NFDI4Objects KG — Poseidon aDNA samples: country × capture type

This browser-executable notebook queries the NFDI4Objects Knowledge Graph for archaeogenetic aDNA samples from the Poseidon Community Archive — loaded into the N4O KG as collection/17 under the ArNO ontology (Straten, Strohm, Thiery & Renz 2025) — and cross-tabulates them by country and capture type. Two complementary views render the result: a stacked bar chart showing both the overall sample volume per country and the methodological mix within each, and a heatmap that makes every country × method combination individually inspectable.

Unlike the companion notebook n4okg-poseidon-sites-map-live.qmd, which asks where are the samples, this notebook asks how were they produced — and whether the answer depends on where they come from. It is part of an Open Educational Resource series on knowledge graphs and linked open data, and is designed to stand on its own. A local-Python variant of this notebook is available as n4okg-poseidon-country-capture.ipynb.

Note

On first load, your browser downloads the Python runtime (Pyodide, ~10 MB). Please allow a moment for it to initialise.

Warning

The NFDI4Objects Knowledge Graph is a research prototype. If this notebook fails to load data with a network error, the endpoint may be temporarily unreachable or may not allow cross-origin browser requests from this page’s domain. The local .ipynb companion is not affected by this and is always a reliable fallback.

About this notebook

Why this dataset?

collection/17 holds the Poseidon Community Archive (PCA), a large open corpus of archaeogenetic genotype data with contextual and bibliographic metadata (Schmid et al. 2024). The metadata contains a field describing the capture type of each sample — the laboratory protocol used to enrich aDNA fragments from a sample (e.g. Shotgun, 1240K, various targeted panels). This is a methodological choice of the original project, not an underlying property of the archaeological record, and its distribution across countries is therefore an interesting question in its own right.

What you’ll learn

  • how a single SPARQL query returning a tidy three-column + one- measure table can drive two different visualisations
  • how to pivot_table a long-format SPARQL result into a country × method matrix with pandas
  • how a stacked bar and a heatmap answer related but distinct questions about the same cross-tabulation

Data-context notes

  • The query traverses the same ArNO path as the sites map (aDNASample → DiscoverySite → Site → Place → Country), but adds a second hop: aDNASample → hasCaptureType → CaptureType. Each sample contributes to exactly one country–method cell.
  • Patterns in the resulting cross-tabulation should not automatically be read as archaeological facts. They may reflect publication standards, project traditions, or technical preferences in particular labs and countries. The briefing document for this query states this very clearly, and it is worth keeping in mind when you interpret the plots below.

Tooling notes

In the browser, SPARQL access goes through pyodide.http.pyfetchSPARQLWrapper is not available in Pyodide. Plots use matplotlib with the "Agg" backend that Pyodide ships by default; figures are displayed inline by returning plt.show() from the cell.

Step 1 — Define the SPARQL query

The query selects every aDNASample in the graph together with the country it was discovered in (via the chain foundAtDiscoverySite → atSite → atPlace → inCountry) and its capture type (via a direct hasCaptureType edge). COUNT(DISTINCT ?sample) makes sure we count each sample once, even if the graph exposes multiple paths. The two labels are wrapped in OPTIONAL so that samples with a missing country or capture-type label are still represented — visualising missing data is part of understanding the dataset.

Step 2 — Load the data

The query result is a long-format table: one row per (country, capture type) combination with the sample count attached. We convert it into a pandas DataFrame and print a quick summary to confirm the shape of the data before we plot it.

Step 3 — Build the pivot table

The pivot turns the long-format result into a country × method matrix with sample counts as cell values. Missing combinations are filled with zero: the graph does not know about them, and leaving the cell as NaN would make the bar chart and the heatmap misbehave later.

Both plots below read from this one matrix.

Step 4a — Stacked bar chart

The stacked bar gives two pieces of information at once: bar length encodes the total sample volume per country, and the coloured segments encode the methodological composition within each. A country whose bar is dominated by one colour ran (mostly) one protocol; a country whose bar shows several colours of comparable size ran a mixed pipeline.

Matplotlib’s DataFrame.plot(kind="barh", stacked=True) does the heavy lifting here. We limit to the top 20 countries to keep labels readable — a tail of small contributions would compress the informative bars and make the chart harder to read.

Step 4b — Heatmap

The heatmap answers the complementary question to the bar chart: for every country × method combination, how many samples fall into that cell? Strong and weak combinations, sparsity, and country-specific method profiles all become visible simultaneously. Empty cells are shown as very pale — they signal either no samples or the combination was never tried, and on this dataset the distinction between those two is rarely decidable from the data alone.

We use a LogNorm-scaled colour map because sample counts per cell are heavily right-skewed (a handful of large cells dominate; many cells have just a few samples). Linear scaling would collapse everything small to near-white and obscure the structure of the long tail.

Step 5 — Exploring the data

The cells below are a free playground. Two starting points: the full pivot is available as the pivot DataFrame, and the long- format result is available as df. Change the thresholds and see how the picture shifts.


References

  • Straten, M. thor, Strohm, S., Thiery, F. & Renz, M. (2025). Data-Driven Community Standards for Interdisciplinary Heterogeneous Information Networks. E-Science-Tage 2025. Heidelberg: heiBOOKS. doi: 10.11588/heibooks.1652.c23914
  • Schmid, C. et al. (2024). Poseidon — A framework for archaeogenetic human genotype data management. eLife 13. doi: 10.7554/eLife.98317.1
  • archaeonatural-cloud/poseidon2lod — RDF generation scripts. github.com/archaeonatural-cloud/poseidon2lod

This notebook is part of the Open Educational Resources of NFDI4Objects.