Python IPFS Integration — Deep Dive
IPFS architecture for Python developers
IPFS is not a single technology — it’s a stack of protocols:
- libp2p: Peer-to-peer networking (discovery, connections, streams)
- Bitswap: Block exchange protocol (requesting and sending data blocks)
- IPLD (InterPlanetary Linked Data): Data model for content-addressed structures
- UnixFS: File system abstraction on top of IPLD (how files and directories are represented)
Python interacts with this stack through three interfaces: the HTTP API (most common), the gateway API (read-only), and native libp2p bindings (advanced).
Understanding the Merkle DAG
Files on IPFS aren’t stored as flat blobs. They’re split into chunks and organized into a Merkle DAG (Directed Acyclic Graph):
Root CID (QmRoot...)
├── Chunk 0 (QmA...) - 256KB
├── Chunk 1 (QmB...) - 256KB
├── Chunk 2 (QmC...) - 256KB
└── Chunk 3 (QmD...) - 128KB (last chunk, smaller)
The root CID is computed from the child CIDs, which are computed from the data. This means:
- Identical chunks across different files are stored only once (deduplication).
- You can verify any chunk independently.
- Large files can be downloaded from multiple peers in parallel.
Python can inspect this structure:
import ipfshttpclient
client = ipfshttpclient.connect()
def inspect_dag(cid, depth=0):
"""Walk the Merkle DAG of an IPFS object."""
obj = client.object.stat(cid)
links = client.object.links(cid)
print(f"{' ' * depth}CID: {cid}")
print(f"{' ' * depth}Size: {obj['CumulativeSize']} bytes")
print(f"{' ' * depth}Links: {obj['NumLinks']}")
for link in links.get("Links", []):
inspect_dag(link["Hash"], depth + 1)
Custom chunking strategies
IPFS defaults to 256KB fixed-size chunks. For specific use cases, custom chunking improves deduplication and performance:
import hashlib
def rabin_chunk(data: bytes, min_size=8192, max_size=262144, target_size=65536):
"""Content-defined chunking using rolling hash.
Splits at content boundaries rather than fixed offsets,
so inserting data in the middle only affects nearby chunks.
"""
chunks = []
offset = 0
window = 48
mask = (1 << 16) - 1 # Target ~64KB chunks
while offset < len(data):
chunk_end = min(offset + min_size, len(data))
if chunk_end >= len(data):
chunks.append(data[offset:])
break
# Roll hash forward looking for boundary
while chunk_end < min(offset + max_size, len(data)):
window_data = data[chunk_end - window:chunk_end]
h = hashlib.sha256(window_data).digest()
fingerprint = int.from_bytes(h[:4], "big")
if fingerprint & mask == 0:
break
chunk_end += 1
chunks.append(data[offset:chunk_end])
offset = chunk_end
return chunks
Content-defined chunking is valuable when files change incrementally (logs, databases) — only modified chunks get new CIDs, saving storage and bandwidth.
IPNS: mutable pointers
Since CIDs change when content changes, IPNS provides stable names that can be updated to point to different CIDs:
def publish_to_ipns(client, cid, key_name="self"):
"""Publish a CID to an IPNS name."""
result = client.name.publish(cid, key=key_name)
return result["Name"] # IPNS address (peer ID or key-based)
def resolve_ipns(client, ipns_name):
"""Resolve an IPNS name to its current CID."""
result = client.name.resolve(ipns_name)
return result["Path"] # /ipfs/QmCurrentCID...
# Create a dedicated key for a project
client.key.gen("my-website", type="ed25519")
ipns_addr = publish_to_ipns(client, new_site_cid, "my-website")
# ipns_addr stays the same even as new_site_cid changes with updates
IPNS resolution can be slow (10-30 seconds) because it requires DHT lookups. For faster resolution, use DNSLink: a TXT DNS record that maps a domain name to a CID.
# DNS TXT record for example.com
_dnslink.example.com TXT "dnslink=/ipfs/QmNewContent..."
Building a content-addressed application backend
A Python web application backed by IPFS for immutable data storage:
from fastapi import FastAPI, UploadFile
import ipfshttpclient
import json
app = FastAPI()
ipfs = ipfshttpclient.connect()
# Metadata database (maps application IDs to IPFS CIDs)
metadata_store = {}
@app.post("/documents")
async def upload_document(file: UploadFile):
content = await file.read()
# Store content on IPFS
result = ipfs.add_bytes(content)
cid = result
# Store metadata locally (or in a traditional DB)
doc_id = generate_id()
metadata_store[doc_id] = {
"cid": cid,
"filename": file.filename,
"size": len(content),
"content_type": file.content_type,
}
return {"document_id": doc_id, "cid": cid}
@app.get("/documents/{doc_id}")
async def get_document(doc_id: str):
meta = metadata_store.get(doc_id)
if not meta:
raise HTTPException(404)
content = ipfs.cat(meta["cid"])
return Response(
content=content,
media_type=meta["content_type"],
headers={"X-IPFS-CID": meta["cid"]},
)
This pattern uses IPFS as an immutable content store while keeping mutable metadata (ownership, tags, access control) in a traditional database.
Filecoin integration for archival storage
IPFS pinning keeps data available on the network. Filecoin adds economic guarantees — storage providers are penalized for losing data:
from lighthouseweb3 import Lighthouse
# Lighthouse provides Python-friendly Filecoin storage
lh = Lighthouse(token="YOUR_API_KEY")
def archive_to_filecoin(file_path):
"""Upload to IPFS + create Filecoin storage deal."""
result = lh.upload(file_path)
return {
"cid": result["Hash"],
"size": result["Size"],
"deal_status": "pending",
}
def check_deal_status(cid):
"""Verify Filecoin deal is active."""
status = lh.get_deal_status(cid)
return {
"deal_id": status.get("dealId"),
"miner": status.get("miner"),
"active": status.get("dealStatus") == "active",
"expiry": status.get("expiry"),
}
Filecoin deals typically last 6-18 months and can be renewed. For critical data, create deals with multiple storage providers for redundancy.
Performance optimization
Parallel uploads
Large collections benefit from concurrent uploads:
import asyncio
import aiohttp
async def upload_batch(files, api_key, max_concurrent=10):
semaphore = asyncio.Semaphore(max_concurrent)
results = {}
async def upload_one(file_path):
async with semaphore:
async with aiohttp.ClientSession() as session:
data = aiohttp.FormData()
data.add_field("file", open(file_path, "rb"),
filename=file_path.name)
headers = {"Authorization": f"Bearer {api_key}"}
async with session.post(
"https://api.web3.storage/upload",
data=data, headers=headers
) as resp:
result = await resp.json()
results[file_path.name] = result["cid"]
tasks = [upload_one(f) for f in files]
await asyncio.gather(*tasks)
return results
Caching gateway responses
IPFS content is immutable — once fetched, it never changes. Aggressive caching is safe:
from functools import lru_cache
import requests
@lru_cache(maxsize=10000)
def fetch_ipfs_cached(cid: str) -> bytes:
"""Fetch and permanently cache IPFS content.
Safe because CIDs are content-addressed — same CID always returns same data.
"""
gateways = [
"https://ipfs.io",
"https://cloudflare-ipfs.com",
"https://gateway.pinata.cloud",
]
for gw in gateways:
try:
resp = requests.get(f"{gw}/ipfs/{cid}", timeout=30)
if resp.status_code == 200:
return resp.content
except requests.RequestException:
continue
raise Exception(f"Failed to fetch {cid} from all gateways")
CAR file format for bulk operations
CAR (Content-Addressed aRchive) files bundle multiple IPFS blocks into a single file for efficient transport:
# Using the ipfs-car library or generating manually
def create_car_archive(files_dir, output_path):
"""Create a CAR file from a directory for batch upload."""
import subprocess
result = subprocess.run(
["ipfs-car", "pack", str(files_dir), "--output", str(output_path)],
capture_output=True, text=True, check=True,
)
# The root CID is printed to stdout
root_cid = result.stdout.strip().split("\n")[-1]
return root_cid
Web3.Storage and other services accept CAR uploads, which is faster than uploading files individually because the DAG structure is pre-computed.
One thing to remember
Production IPFS integration with Python means understanding the Merkle DAG structure for efficient data management, using pinning services and Filecoin for durability guarantees, implementing gateway caching for performance, and treating IPFS as an immutable content layer complemented by a mutable metadata layer in traditional storage.
See Also
- Python Blockchain Data Analysis How Python detectives read the blockchain's public ledger to find patterns, explained with a library guest book analogy.
- Python Crypto Trading Bots How Python programs trade cryptocurrency automatically while you sleep, explained with a lemonade stand price watcher.
- Python Defi Protocol Integration How Python connects to decentralized finance protocols, explained through a self-service banking analogy.
- Python Nft Metadata Generation How Python creates the descriptions and images behind NFT collections, told through a trading card factory story.
- Python Smart Contract Testing Why testing blockchain programs with Python matters, explained through a vending machine story anyone can follow.