The Campanian Ignimbrite: sites from a fuzzy-linked dataset
This browser-executable notebook reads a local Turtle (RDF) file into the Pyodide virtual file system, queries it with SPARQL directly from Python via rdflib, and derives several complementary views on the same dataset — the Campanian Ignimbrite findspot catalogue maintained by the Research Squirrel Engineers.
On first load, your browser downloads the Python runtime (Pyodide, ~10 MB) and the TTL file is copied into Pyodide’s virtual filesystem. Please allow a moment for it to initialise.
About this notebook
The Campanian Ignimbrite (CI) is one of the largest volcanic eruptions of the late Quaternary: around 39 000 years ago, a supereruption in the Campanian Volcanic Arc deposited an ash layer reaching as far as the Black Sea. Archaeological and palaeoenvironmental findspots across Europe and the Mediterranean can be correlated to this single stratigraphic marker, giving late Palaeolithic research an unusually precise temporal anchor. The dataset used here catalogues 74 such findspots with literature references, geolocations, and links to external LOD hubs (Wikidata, OpenStreetMap, GeoNames).
This notebook intentionally uses a single SPARQL query to extract the core tabular view of the data, then derives four complementary visualisations from that one DataFrame. It is a pattern worth recognising: many “dashboard”-style analyses in linked data are not about writing more queries, but about asking different questions of the same result.
A companion local notebook, local_ttl-campanian-ignimbrite-sites.ipynb, runs the same pipeline with the full scientific Python stack for readers who prefer a Jupyter environment.
Why this dataset?
Three properties make the CI catalogue pedagogically valuable:
- Mixed spatial types — caves, lakes, maars, inhabited places, plateaus, and several “unknown-category” entries coexist, so filtering and faceting have real visual payoff
- Explicit uncertainty — every site carries a
certaintyLevel(high,medium,low,dubious,representative), which lets us show FAIR-compatible handling of provenance and confidence - Partial LOD coverage — roughly 60 % of sites have Wikidata links and 40 % have OSM links; the completeness question becomes itself a research question
What you’ll learn
- how to load a Turtle file directly in the browser and query it with
rdflib, bypassing any SPARQL endpoint - how to distinguish between different kinds of skos/fsl matches (
closeMatch,partlyMatch,spatialCloseMatch,dubiousMatch) and why the distinction matters epistemically - how to produce several focused views from a single, deliberate query
Data-context notes
- 74 unique sites, returned as ~106 rows — sites with multiple literature references are duplicated in the main query; we de-duplicate for the map and keep the reference list per site
- WKT literals come with an SRID prefix:
<http://.../EPSG/0/4326> POINT(lon lat)— the case-insensitive regex parser in use handles the prefix transparently - external LOD links live across four properties with different semantic strength:
skos:closeMatch(most confident, 51 of 74 sites),fsl:partlyMatch,fsl:spatialCloseMatch, andfsl:dubiousMatch(explicitly marked unreliable) UnknownCategoryandInhabitedPlacedominate the spatial types; together they account for roughly half of the catalogue
Tooling notes
All querying is local — rdflib parses the Turtle file that quarto-live copies into Pyodide’s virtual filesystem via the resources: frontmatter entry. No HTTP, no CORS, no endpoint uptime questions. Mapping uses a hand-rolled Leaflet block returned via _repr_html_; static charts use matplotlib.
Step 1 — Load the Turtle file and define the SPARQL queries
The TTL file is shipped alongside the notebook and copied into the browser’s Pyodide VFS by quarto-live. rdflib then parses it into a Graph object, against which we can run arbitrary SPARQL. Two queries this time: one for the main tabular view (the one proposed in the task), and a second one collecting external LOD links across four different match properties.
Step 2 — Run the queries, build DataFrames
The main query is flattened into a row-per-reference DataFrame; for mapping we also build a de-duplicated one-row-per-site DataFrame with references aggregated into a list. The matches query is pivoted into per-site lists of Wikidata and OSM links.
Step 3a — Distribution of spatial types
The catalogue mixes quite different place concepts. Before looking at the map, a horizontal bar chart of spatial-type counts gives the numerical ground truth for what will follow — in particular it makes clear how much of the data sits in two loose categories (UnknownCategory and InhabitedPlace).
Step 3b — Map 1: findspots coloured by spatial type
Each site appears once, coloured by spatialType, with one toggleable layer per category. The layer control doubles as a colour legend. Popups include the full list of literature references for the site and any Wikidata or OSM links, grouped by match confidence (closeMatch is the strongest, dubiousMatch the weakest).
Step 3c — Reference frequencies
Which publications underpin the catalogue? A horizontal bar chart of the top 15 literature references shows how dependent the dataset is on a small number of papers: Rosi et al. 1999 alone supplies roughly 20 % of the site entries. This is a useful reminder that the shape of a linked-data resource often mirrors the shape of its bibliography.
Step 3d — Map 2: findspots coloured by certainty level
Each site is drawn once again, this time coloured by certaintyLevel: green (high), yellow (medium), orange (low), red (dubious), grey (representative). The same popups as in Map 1 are reused. This view is didactically the most pointed one — it makes the completeness and reliability distribution of the dataset part of the reading experience.
Step 3e — Map 3: archaeological findspots (caves and archaeological sites)
A narrower view, filtered to just the two spatial types with the most direct archaeological interpretation: Cave (9 sites) and ArchaeologicalSite (3 sites) — together 12 of the 74 catalogue entries. This is the Campanian-Ignimbrite analogue of the QGIS workflow “filter the SPARQL query by a particular spatial type”; here we do it in pandas after the fact. The two categories are kept as separate layers so they can be toggled independently.
Step 4 — Explore
df_sites (one row per site) and df_long (one row per reference) stay in scope, as does df_matches. Change the cell below to filter, aggregate, or ask your own question of the data.
Part of an Open Educational Resource series on knowledge graphs and linked open data, produced in the context of NFDI4Objects. Dataset by the Research Squirrel Engineers under CC BY 4.0.