NFDI4Objects KG — Poseidon aDNA sites on a map

This browser-executable notebook queries the NFDI4Objects Knowledge Graph for archaeogenetic aDNA samples from the Poseidon Community Archive — loaded into the N4O KG as collection/17 under the ArNO ontology (Straten, Strohm, Thiery & Renz 2025) — aggregates them by discovery site, and plots the result on two interactive Leaflet maps. The first map shows individual sites as country-coloured markers; the second shows the same sites styled by sample count, so concentrations of heavily sampled locations are visible at a glance.

It is part of an Open Educational Resource series on knowledge graphs and linked open data, and is designed to stand on its own. A local-Python variant of this notebook is available as n4okg-poseidon-sites-map.ipynb.

Note

On first load, your browser downloads the Python runtime (Pyodide, ~10 MB) and the Leaflet library (~150 KB). Please allow a moment for these to initialise.

Warning

The NFDI4Objects Knowledge Graph is a research prototype. If this notebook fails to load data with a network error, the endpoint may be temporarily unreachable or may not allow cross-origin browser requests from this page’s domain. The local .ipynb companion is not affected by this and is always a reliable fallback.

About this notebook

Why this dataset?

collection/17 holds the Poseidon Community Archive (PCA), a large open corpus of archaeogenetic genotype data with contextual and bibliographic metadata (Schmid et al. 2024). For teaching purposes it offers something most linked-data-for-archaeology examples do not: a natural-science dataset mapped into a cultural-heritage knowledge graph. Samples come with coordinates, countries, dating, capture types, and genetic measurements — enough structure to make meaningful aggregations visible without being overwhelming.

What you’ll learn

  • how to traverse a multi-hop ArNO path (aDNASample → DiscoverySite → Site → Place → Country) in a single SPARQL query
  • how the GeoSPARQL two-step indirection works on this endpoint (geometry is attached to the DiscoverySite, not to the Sample)
  • how to aggregate by a coordinate tuple so that one named site with multiple samples collapses to a single visible marker

Data-context notes

  • Coordinates live on the DiscoverySite, not on the Sample. The ArNO modelling keeps the sample focused on what is the sample (tissue, genetics, dating) and delegates where it was found to the discovery site. The query path reflects this: ?sample arno:foundAtDiscoverySite ?ds . ?ds geo:hasGeometry/geo:asWKT ?wkt.
  • Aggregation is by (site, country, wkt) rather than by site name alone. A single named site may in principle be represented by more than one coordinate in the underlying data, and grouping by the geometry tuple avoids accidentally unifying them. The poseidon2lod README also notes that site and location reference coordinates are created by averaging associated discovery-site coordinates when matching against GeoNames and Wikidata.
  • WKT literals on this endpoint use POINT(lon lat) (uppercase, longitude first). The parser below handles both casings defensively and swaps the order for Leaflet.

Tooling notes

In the browser, SPARQL access goes through pyodide.http.pyfetchSPARQLWrapper is not available in Pyodide (it depends on requests, which requires a system-level networking stack absent in WebAssembly). Mapping uses a hand-rolled Leaflet block returned via _repr_html_ rather than folium, because folium writes its output to disk and relies on file-system paths that Pyodide cannot expose to the parent page.

Step 1 — Define the SPARQL query

The query walks the ArNO path aDNASample → foundAtDiscoverySite → atSite → atPlace → inCountry to reach each sample’s country, and in parallel pulls the WKT geometry from the discovery site via geo:hasGeometry/geo:asWKT. We then GROUP BY site label, country label, and WKT so that one row of the result is one place on the map, regardless of how many samples it represents.

Step 2 — Load the data

The bindings from a SPARQL JSON result are flattened into a DataFrame. The WKT literal is split into separate latitude and longitude columns, and we drop rows whose coordinates failed to parse — GeoSPARQL literals occasionally carry non-standard wrappings that the regex does not match, and it is better to surface that gracefully than to fail the whole notebook.

Step 3a — Map with country-coloured markers

The first map is built as an HTML string in Python and returned as the cell’s value via a _repr_html_-reprable object. quarto-live inserts the HTML on the main thread, where the browser happily runs the embedded <script> with full DOM access.

A layer control in the top-right corner (expand it by hovering or tapping) offers four base layers — OpenStreetMap as the default, the Humanitarian OSM style, Esri satellite imagery, and Esri’s world terrain base — and one overlay per country, each with its colour swatch in the label. Unchecking a country hides just its markers, so you can isolate regions interactively.

Step 3b — Hex-binned sample density

The second map aggregates samples into a hexagonal grid, with each cell coloured by the number of aDNA samples falling inside it. A site with hundreds of samples contributes hundreds to its cell, not one — so the heatmap reflects sampling intensity, not just the presence of a dig. Overlapping sites in the same region compound, which makes heavily-studied clusters (Central Europe, the Eurasian steppe, Anatolia) jump out even at a world zoom level.

We hand-roll the hex binning in Python and draw the polygons as Leaflet L.polygon — no extra plugin, no canvas resize quirks, and the colour legend aligns perfectly with the underlying buckets. The hex size is ~1.0° longitude (roughly 80 km at 45°N), large enough to aggregate multiple neighbouring sites into each cell, and small enough to keep regional structure visible.

Step 4 — Exploring the data

The cells below are a free playground — filter the DataFrame by country, rank sites by sample count, or compute per-country aggregates. Remember: one row of df is one (site, country, coordinate) tuple, with a numSamples column holding the number of aDNA samples aggregated there.


References

  • Straten, M. thor, Strohm, S., Thiery, F. & Renz, M. (2025). Data-Driven Community Standards for Interdisciplinary Heterogeneous Information Networks. E-Science-Tage 2025. Heidelberg: heiBOOKS. doi: 10.11588/heibooks.1652.c23914
  • Schmid, C. et al. (2024). Poseidon — A framework for archaeogenetic human genotype data management. eLife 13. doi: 10.7554/eLife.98317.1
  • archaeonatural-cloud/poseidon2lod — RDF generation scripts. github.com/archaeonatural-cloud/poseidon2lod

This notebook is part of the Open Educational Resources of NFDI4Objects.