RDF and SPARQL Queries with Python — Deep Dive

Build production RDF pipelines with RDFLib graph operations, SPARQL query optimization, federated queries, and OWL reasoning in Python.

RDFLib graph internals

RDFLib’s Graph class stores triples in an in-memory store backed by three dictionaries indexing by subject, predicate, and object. This provides O(1) lookup for any single-element pattern and O(n) for full scans.

from rdflib import Graph, Namespace, Literal, URIRef, BNode
from rdflib.namespace import RDF, RDFS, OWL, XSD, FOAF

g = Graph()
EX = Namespace("http://example.org/")
g.bind("ex", EX)
g.bind("foaf", FOAF)

Named graphs and datasets

RDFLib supports named graphs (quads) via ConjunctiveGraph or Dataset:

from rdflib import Dataset

ds = Dataset()
g1 = ds.graph(URIRef("http://example.org/graph1"))
g2 = ds.graph(URIRef("http://example.org/graph2"))

g1.add((EX.Alice, FOAF.knows, EX.Bob))
g2.add((EX.Bob, FOAF.knows, EX.Charlie))

# Query across all named graphs
for s, p, o in ds.triples((None, FOAF.knows, None)):
    print(f"{s} knows {o}")

Named graphs are essential for provenance tracking — storing which source contributed which triples.

Advanced SPARQL with RDFLib

Property paths

SPARQL 1.1 property paths let you traverse arbitrary-length paths:

query = """
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
SELECT ?person ?distant
WHERE {
    ?person foaf:knows+ ?distant .
}
"""
# foaf:knows+ matches one or more hops
# foaf:knows* matches zero or more hops
# foaf:knows{2,4} is NOT standard SPARQL (use recursive patterns instead)

Subqueries and aggregation

query = """
PREFIX ex: <http://example.org/>
SELECT ?country (COUNT(?city) AS ?cityCount) (AVG(?pop) AS ?avgPop)
WHERE {
    ?city ex:country ?country .
    ?city ex:population ?pop .
}
GROUP BY ?country
HAVING (COUNT(?city) > 5)
ORDER BY DESC(?cityCount)
"""
results = g.query(query)
for row in results:
    print(f"{row.country}: {row.cityCount} cities, avg pop {float(row.avgPop):.0f}")

CONSTRUCT for graph transformation

transform_query = """
PREFIX ex: <http://example.org/>
PREFIX schema: <http://schema.org/>

CONSTRUCT {
    ?city schema:containedInPlace ?country .
    ?city schema:population ?pop .
}
WHERE {
    ?city ex:country ?country .
    ?city ex:population ?pop .
}
"""
new_graph = g.query(transform_query).graph
new_graph.serialize("schema_org_output.ttl", format="turtle")

Querying remote endpoints efficiently

Pagination

Large result sets from public endpoints require pagination:

from SPARQLWrapper import SPARQLWrapper, JSON

def paginated_query(endpoint: str, base_query: str, page_size: int = 1000):
    sparql = SPARQLWrapper(endpoint)
    sparql.setReturnFormat(JSON)
    offset = 0
    all_results = []

    while True:
        query = f"{base_query} LIMIT {page_size} OFFSET {offset}"
        sparql.setQuery(query)
        response = sparql.query().convert()
        bindings = response["results"]["bindings"]

        if not bindings:
            break

        all_results.extend(bindings)
        offset += page_size

        if len(bindings) < page_size:
            break

    return all_results

Federated queries

SPARQL 1.1 supports cross-endpoint queries with SERVICE:

PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
PREFIX dbo: <http://dbpedia.org/ontology/>

SELECT ?item ?dbpediaAbstract WHERE {
    ?item wdt:P31 wd:Q515 .  # cities from Wikidata
    ?item wdt:P1566 ?geonamesId .

    SERVICE <http://dbpedia.org/sparql> {
        ?dbpItem dbo:abstract ?dbpediaAbstract .
        FILTER(LANG(?dbpediaAbstract) = "en")
    }
}
LIMIT 10

Federated queries are powerful but slow — each SERVICE call is a separate HTTP request. Cache results locally when possible.

Rate limiting and caching

Public endpoints enforce rate limits. Implement polite querying:

import time
import hashlib
import json
from pathlib import Path

CACHE_DIR = Path("sparql_cache")
CACHE_DIR.mkdir(exist_ok=True)

def cached_query(endpoint: str, query: str, cache_hours: int = 24):
    cache_key = hashlib.md5(f"{endpoint}:{query}".encode()).hexdigest()
    cache_file = CACHE_DIR / f"{cache_key}.json"

    if cache_file.exists():
        age_hours = (time.time() - cache_file.stat().st_mtime) / 3600
        if age_hours < cache_hours:
            return json.loads(cache_file.read_text())

    sparql = SPARQLWrapper(endpoint)
    sparql.setQuery(query)
    sparql.setReturnFormat(JSON)
    sparql.addCustomHttpHeader("User-Agent", "MyApp/1.0 (contact@example.com)")

    result = sparql.query().convert()
    cache_file.write_text(json.dumps(result))
    time.sleep(1)  # polite delay

    return result

OWL reasoning

RDFLib supports basic RDFS and OWL reasoning through its OWL-RL plugin:

import owlrl

g = Graph()
g.parse("ontology.ttl", format="turtle")
g.parse("data.ttl", format="turtle")

# Apply RDFS+OWL reasoning — infers new triples
owlrl.DeductiveClosure(owlrl.OWLRL_Semantics).expand(g)

# Now g contains inferred triples:
# If Person subClassOf Agent, and Alice rdf:type Person,
# then Alice rdf:type Agent is inferred

Reasoning is computationally expensive. For large graphs, use a dedicated reasoner like HermiT or Pellet, or pre-compute inferences offline.

SHACL validation

Validate RDF data against shape constraints:

from pyshacl import validate

shapes_graph = """
@prefix sh: <http://www.w3.org/ns/shacl#> .
@prefix ex: <http://example.org/> .

ex:PersonShape a sh:NodeShape ;
    sh:targetClass ex:Person ;
    sh:property [
        sh:path ex:name ;
        sh:minCount 1 ;
        sh:datatype xsd:string ;
    ] ;
    sh:property [
        sh:path ex:age ;
        sh:maxCount 1 ;
        sh:datatype xsd:integer ;
        sh:minInclusive 0 ;
    ] .
"""

conforms, results_graph, results_text = validate(
    g,
    shacl_graph=shapes_graph,
    data_graph_format="turtle",
    shacl_graph_format="turtle",
)

if not conforms:
    print("Validation failures:")
    print(results_text)

Performance optimization

Store backends

RDFLib’s default in-memory store works for graphs under ~5 million triples. For larger graphs:

BerkeleyDB store — Persistent, handles tens of millions of triples.
Oxigraph — A Rust-based RDF store with a Python binding (pyoxigraph). Significantly faster than RDFLib for SPARQL queries.
Apache Jena Fuseki — Full-featured triple store with SPARQL endpoint. Use SPARQLWrapper to query from Python.

from pyoxigraph import Store

store = Store("./oxigraph_data")
store.load("data.ttl", mime_type="text/turtle")

results = store.query("SELECT ?s ?p ?o WHERE { ?s ?p ?o } LIMIT 10")
for solution in results:
    print(solution)

Query optimization tips

Put the most selective pattern first — SPARQL evaluates patterns roughly in order. Start with the pattern that matches the fewest triples.
Avoid SELECT * — Project only the variables you need.
Use FILTER after patterns — Filters applied early can prevent the optimizer from choosing efficient join orders.
Prefer VALUES over FILTER IN — VALUES ?x { ex:A ex:B ex:C } is optimized as a hash lookup in most engines.

Integration patterns

RDF to pandas

import pandas as pd

query = """
SELECT ?city ?country ?population WHERE {
    ?city ex:country ?country .
    ?city ex:population ?population .
}
"""
results = g.query(query)
df = pd.DataFrame(results.bindings)
# Column names are Variable objects — convert to strings
df.columns = [str(col) for col in df.columns]

RDF to NetworkX

import networkx as nx

G_nx = nx.DiGraph()
for s, p, o in g:
    if isinstance(o, URIRef):  # skip literals for graph structure
        G_nx.add_edge(str(s), str(o), predicate=str(p))

print(f"Nodes: {G_nx.number_of_nodes()}, Edges: {G_nx.number_of_edges()}")

One thing to remember: RDF’s power lies in its universality — any dataset using the same URIs can be merged and queried together without schema mapping. Python’s ecosystem (RDFLib, SPARQLWrapper, pyoxigraph) makes this practical for everything from small research projects to enterprise knowledge bases.

pythonsemantic-webknowledge-graphs