Knowledge Graph Construction with Python — Core Concepts

Understand how to extract entities and relationships from text, model them as triples, and build queryable knowledge graphs using Python tools.

What is a knowledge graph

A knowledge graph is a structured representation of facts as a network of entities (nodes) and relationships (edges). Unlike a table, where each row is independent, a knowledge graph captures how facts relate to each other — enabling multi-hop reasoning that tables struggle with.

The fundamental unit is a triple: (subject, predicate, object). “Albert Einstein” → “born_in” → “Ulm” is one triple. Millions of triples form a graph where you can traverse from Einstein to Ulm to Germany to Europe in a chain of connected facts.

The construction pipeline

Building a knowledge graph from scratch involves four stages:

1. Data collection

Sources include structured data (databases, CSV files, APIs), semi-structured data (JSON, XML, HTML tables), and unstructured data (plain text, PDFs). Python libraries like requests, BeautifulSoup, and pandas handle ingestion. For large-scale web scraping, Scrapy provides structured crawling.

2. Entity extraction

You need to identify named entities — people, organizations, locations, dates — from raw text. SpaCy’s named entity recognition (NER) pipeline handles this well for common entity types. For domain-specific entities (gene names, chemical compounds), you fine-tune a model or use rule-based matching with SpaCy’s Matcher.

3. Relation extraction

Once you have entities, you need the connections between them. Given the sentence “Marie Curie worked at the University of Paris,” the system should extract (Marie Curie, worked_at, University of Paris).

Approaches range from rule-based pattern matching (fast, brittle) to transformer-based models (accurate, expensive). Libraries like OpenNRE and Hugging Face relation extraction models automate this step.

4. Graph assembly and deduplication

Raw extraction produces duplicates and inconsistencies. “NYC,” “New York City,” and “New York” should all point to the same entity. This step — called entity resolution or entity linking — maps mentions to canonical identifiers. Wikipedia IDs and Wikidata QIDs are common targets.

Storage options

Knowledge graphs need a database that handles nodes and edges efficiently:

RDF triple stores (Jena, Blazegraph) — Store triples in (subject, predicate, object) format, queryable with SPARQL.
Property graph databases (Neo4j, Memgraph) — Nodes and edges carry arbitrary key-value properties, queryable with Cypher or GQL.
In-memory with RDFLib — For small graphs (under a million triples), Python’s rdflib library keeps everything in memory.

Quality matters

A knowledge graph is only as useful as its accuracy. Common quality checks:

Coverage — Does the graph contain the facts your application needs?
Accuracy — Are the extracted triples correct? A 90% extraction accuracy means 10% of your graph is wrong.
Freshness — Real-world facts change. People move, companies merge, papers get retracted.
Consistency — Contradictory triples (“Einstein born in 1879” and “Einstein born in 1880”) must be resolved.

Common misconception

“You need millions of triples to make a knowledge graph useful.” Small, domain-specific graphs with a few thousand carefully curated triples can power focused applications — like an internal company knowledge base or a medical decision support tool — more effectively than massive but noisy general-purpose graphs.

One thing to remember: Knowledge graph construction is a pipeline — collect, extract, link, assemble — and the hardest part isn’t building the graph but keeping it accurate and up to date.

pythonknowledge-graphsdata-engineering