RDF and SPARQL Queries with Python — Deep Dive
RDFLib graph internals
RDFLib’s Graph class stores triples in an in-memory store backed by three dictionaries indexing by subject, predicate, and object. This provides O(1) lookup for any single-element pattern and O(n) for full scans.
from rdflib import Graph, Namespace, Literal, URIRef, BNode
from rdflib.namespace import RDF, RDFS, OWL, XSD, FOAF
g = Graph()
EX = Namespace("http://example.org/")
g.bind("ex", EX)
g.bind("foaf", FOAF)
Named graphs and datasets
RDFLib supports named graphs (quads) via ConjunctiveGraph or Dataset:
from rdflib import Dataset
ds = Dataset()
g1 = ds.graph(URIRef("http://example.org/graph1"))
g2 = ds.graph(URIRef("http://example.org/graph2"))
g1.add((EX.Alice, FOAF.knows, EX.Bob))
g2.add((EX.Bob, FOAF.knows, EX.Charlie))
# Query across all named graphs
for s, p, o in ds.triples((None, FOAF.knows, None)):
print(f"{s} knows {o}")
Named graphs are essential for provenance tracking — storing which source contributed which triples.
Advanced SPARQL with RDFLib
Property paths
SPARQL 1.1 property paths let you traverse arbitrary-length paths:
query = """
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
SELECT ?person ?distant
WHERE {
?person foaf:knows+ ?distant .
}
"""
# foaf:knows+ matches one or more hops
# foaf:knows* matches zero or more hops
# foaf:knows{2,4} is NOT standard SPARQL (use recursive patterns instead)
Subqueries and aggregation
query = """
PREFIX ex: <http://example.org/>
SELECT ?country (COUNT(?city) AS ?cityCount) (AVG(?pop) AS ?avgPop)
WHERE {
?city ex:country ?country .
?city ex:population ?pop .
}
GROUP BY ?country
HAVING (COUNT(?city) > 5)
ORDER BY DESC(?cityCount)
"""
results = g.query(query)
for row in results:
print(f"{row.country}: {row.cityCount} cities, avg pop {float(row.avgPop):.0f}")
CONSTRUCT for graph transformation
transform_query = """
PREFIX ex: <http://example.org/>
PREFIX schema: <http://schema.org/>
CONSTRUCT {
?city schema:containedInPlace ?country .
?city schema:population ?pop .
}
WHERE {
?city ex:country ?country .
?city ex:population ?pop .
}
"""
new_graph = g.query(transform_query).graph
new_graph.serialize("schema_org_output.ttl", format="turtle")
Querying remote endpoints efficiently
Pagination
Large result sets from public endpoints require pagination:
from SPARQLWrapper import SPARQLWrapper, JSON
def paginated_query(endpoint: str, base_query: str, page_size: int = 1000):
sparql = SPARQLWrapper(endpoint)
sparql.setReturnFormat(JSON)
offset = 0
all_results = []
while True:
query = f"{base_query} LIMIT {page_size} OFFSET {offset}"
sparql.setQuery(query)
response = sparql.query().convert()
bindings = response["results"]["bindings"]
if not bindings:
break
all_results.extend(bindings)
offset += page_size
if len(bindings) < page_size:
break
return all_results
Federated queries
SPARQL 1.1 supports cross-endpoint queries with SERVICE:
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
PREFIX dbo: <http://dbpedia.org/ontology/>
SELECT ?item ?dbpediaAbstract WHERE {
?item wdt:P31 wd:Q515 . # cities from Wikidata
?item wdt:P1566 ?geonamesId .
SERVICE <http://dbpedia.org/sparql> {
?dbpItem dbo:abstract ?dbpediaAbstract .
FILTER(LANG(?dbpediaAbstract) = "en")
}
}
LIMIT 10
Federated queries are powerful but slow — each SERVICE call is a separate HTTP request. Cache results locally when possible.
Rate limiting and caching
Public endpoints enforce rate limits. Implement polite querying:
import time
import hashlib
import json
from pathlib import Path
CACHE_DIR = Path("sparql_cache")
CACHE_DIR.mkdir(exist_ok=True)
def cached_query(endpoint: str, query: str, cache_hours: int = 24):
cache_key = hashlib.md5(f"{endpoint}:{query}".encode()).hexdigest()
cache_file = CACHE_DIR / f"{cache_key}.json"
if cache_file.exists():
age_hours = (time.time() - cache_file.stat().st_mtime) / 3600
if age_hours < cache_hours:
return json.loads(cache_file.read_text())
sparql = SPARQLWrapper(endpoint)
sparql.setQuery(query)
sparql.setReturnFormat(JSON)
sparql.addCustomHttpHeader("User-Agent", "MyApp/1.0 (contact@example.com)")
result = sparql.query().convert()
cache_file.write_text(json.dumps(result))
time.sleep(1) # polite delay
return result
OWL reasoning
RDFLib supports basic RDFS and OWL reasoning through its OWL-RL plugin:
import owlrl
g = Graph()
g.parse("ontology.ttl", format="turtle")
g.parse("data.ttl", format="turtle")
# Apply RDFS+OWL reasoning — infers new triples
owlrl.DeductiveClosure(owlrl.OWLRL_Semantics).expand(g)
# Now g contains inferred triples:
# If Person subClassOf Agent, and Alice rdf:type Person,
# then Alice rdf:type Agent is inferred
Reasoning is computationally expensive. For large graphs, use a dedicated reasoner like HermiT or Pellet, or pre-compute inferences offline.
SHACL validation
Validate RDF data against shape constraints:
from pyshacl import validate
shapes_graph = """
@prefix sh: <http://www.w3.org/ns/shacl#> .
@prefix ex: <http://example.org/> .
ex:PersonShape a sh:NodeShape ;
sh:targetClass ex:Person ;
sh:property [
sh:path ex:name ;
sh:minCount 1 ;
sh:datatype xsd:string ;
] ;
sh:property [
sh:path ex:age ;
sh:maxCount 1 ;
sh:datatype xsd:integer ;
sh:minInclusive 0 ;
] .
"""
conforms, results_graph, results_text = validate(
g,
shacl_graph=shapes_graph,
data_graph_format="turtle",
shacl_graph_format="turtle",
)
if not conforms:
print("Validation failures:")
print(results_text)
Performance optimization
Store backends
RDFLib’s default in-memory store works for graphs under ~5 million triples. For larger graphs:
- BerkeleyDB store — Persistent, handles tens of millions of triples.
- Oxigraph — A Rust-based RDF store with a Python binding (
pyoxigraph). Significantly faster than RDFLib for SPARQL queries. - Apache Jena Fuseki — Full-featured triple store with SPARQL endpoint. Use SPARQLWrapper to query from Python.
from pyoxigraph import Store
store = Store("./oxigraph_data")
store.load("data.ttl", mime_type="text/turtle")
results = store.query("SELECT ?s ?p ?o WHERE { ?s ?p ?o } LIMIT 10")
for solution in results:
print(solution)
Query optimization tips
- Put the most selective pattern first — SPARQL evaluates patterns roughly in order. Start with the pattern that matches the fewest triples.
- Avoid
SELECT *— Project only the variables you need. - Use
FILTERafter patterns — Filters applied early can prevent the optimizer from choosing efficient join orders. - Prefer
VALUESoverFILTER IN—VALUES ?x { ex:A ex:B ex:C }is optimized as a hash lookup in most engines.
Integration patterns
RDF to pandas
import pandas as pd
query = """
SELECT ?city ?country ?population WHERE {
?city ex:country ?country .
?city ex:population ?population .
}
"""
results = g.query(query)
df = pd.DataFrame(results.bindings)
# Column names are Variable objects — convert to strings
df.columns = [str(col) for col in df.columns]
RDF to NetworkX
import networkx as nx
G_nx = nx.DiGraph()
for s, p, o in g:
if isinstance(o, URIRef): # skip literals for graph structure
G_nx.add_edge(str(s), str(o), predicate=str(p))
print(f"Nodes: {G_nx.number_of_nodes()}, Edges: {G_nx.number_of_edges()}")
One thing to remember: RDF’s power lies in its universality — any dataset using the same URIs can be merged and queried together without schema mapping. Python’s ecosystem (RDFLib, SPARQLWrapper, pyoxigraph) makes this practical for everything from small research projects to enterprise knowledge bases.
See Also
- Python Knowledge Graph Construction How Python builds a web of facts about the world — connecting people, places, and ideas so computers can answer real questions.
- Python Neo4j Integration How Python talks to a database that thinks in connections instead of rows and columns.
- Python Property Graph Modeling How Python designs rich maps of connected data where every dot and line can carry extra details.
- Python Arima Forecasting How ARIMA models use patterns in past numbers to predict the future, explained like a bedtime story.
- Python Autocorrelation Analysis How today's number is connected to yesterday's, and why that connection is the secret weapon of time series analysis.