Python IPFS Integration — Deep Dive

Production IPFS patterns: chunking strategies, DAG manipulation, IPNS publishing, Filecoin archival, and building Python apps on content-addressed storage.

IPFS architecture for Python developers

IPFS is not a single technology — it’s a stack of protocols:

libp2p: Peer-to-peer networking (discovery, connections, streams)
Bitswap: Block exchange protocol (requesting and sending data blocks)
IPLD (InterPlanetary Linked Data): Data model for content-addressed structures
UnixFS: File system abstraction on top of IPLD (how files and directories are represented)

Python interacts with this stack through three interfaces: the HTTP API (most common), the gateway API (read-only), and native libp2p bindings (advanced).

Understanding the Merkle DAG

Files on IPFS aren’t stored as flat blobs. They’re split into chunks and organized into a Merkle DAG (Directed Acyclic Graph):

Root CID (QmRoot...)
├── Chunk 0 (QmA...) - 256KB
├── Chunk 1 (QmB...) - 256KB
├── Chunk 2 (QmC...) - 256KB
└── Chunk 3 (QmD...) - 128KB (last chunk, smaller)

The root CID is computed from the child CIDs, which are computed from the data. This means:

Identical chunks across different files are stored only once (deduplication).
You can verify any chunk independently.
Large files can be downloaded from multiple peers in parallel.

Python can inspect this structure:

import ipfshttpclient

client = ipfshttpclient.connect()

def inspect_dag(cid, depth=0):
    """Walk the Merkle DAG of an IPFS object."""
    obj = client.object.stat(cid)
    links = client.object.links(cid)

    print(f"{'  ' * depth}CID: {cid}")
    print(f"{'  ' * depth}Size: {obj['CumulativeSize']} bytes")
    print(f"{'  ' * depth}Links: {obj['NumLinks']}")

    for link in links.get("Links", []):
        inspect_dag(link["Hash"], depth + 1)

Custom chunking strategies

IPFS defaults to 256KB fixed-size chunks. For specific use cases, custom chunking improves deduplication and performance:

import hashlib

def rabin_chunk(data: bytes, min_size=8192, max_size=262144, target_size=65536):
    """Content-defined chunking using rolling hash.

    Splits at content boundaries rather than fixed offsets,
    so inserting data in the middle only affects nearby chunks.
    """
    chunks = []
    offset = 0
    window = 48
    mask = (1 << 16) - 1  # Target ~64KB chunks

    while offset < len(data):
        chunk_end = min(offset + min_size, len(data))

        if chunk_end >= len(data):
            chunks.append(data[offset:])
            break

        # Roll hash forward looking for boundary
        while chunk_end < min(offset + max_size, len(data)):
            window_data = data[chunk_end - window:chunk_end]
            h = hashlib.sha256(window_data).digest()
            fingerprint = int.from_bytes(h[:4], "big")

            if fingerprint & mask == 0:
                break
            chunk_end += 1

        chunks.append(data[offset:chunk_end])
        offset = chunk_end

    return chunks

Content-defined chunking is valuable when files change incrementally (logs, databases) — only modified chunks get new CIDs, saving storage and bandwidth.

IPNS: mutable pointers

Since CIDs change when content changes, IPNS provides stable names that can be updated to point to different CIDs:

def publish_to_ipns(client, cid, key_name="self"):
    """Publish a CID to an IPNS name."""
    result = client.name.publish(cid, key=key_name)
    return result["Name"]  # IPNS address (peer ID or key-based)

def resolve_ipns(client, ipns_name):
    """Resolve an IPNS name to its current CID."""
    result = client.name.resolve(ipns_name)
    return result["Path"]  # /ipfs/QmCurrentCID...

# Create a dedicated key for a project
client.key.gen("my-website", type="ed25519")
ipns_addr = publish_to_ipns(client, new_site_cid, "my-website")
# ipns_addr stays the same even as new_site_cid changes with updates

IPNS resolution can be slow (10-30 seconds) because it requires DHT lookups. For faster resolution, use DNSLink: a TXT DNS record that maps a domain name to a CID.

# DNS TXT record for example.com
_dnslink.example.com  TXT  "dnslink=/ipfs/QmNewContent..."

Building a content-addressed application backend

A Python web application backed by IPFS for immutable data storage:

from fastapi import FastAPI, UploadFile
import ipfshttpclient
import json

app = FastAPI()
ipfs = ipfshttpclient.connect()

# Metadata database (maps application IDs to IPFS CIDs)
metadata_store = {}

@app.post("/documents")
async def upload_document(file: UploadFile):
    content = await file.read()

    # Store content on IPFS
    result = ipfs.add_bytes(content)
    cid = result

    # Store metadata locally (or in a traditional DB)
    doc_id = generate_id()
    metadata_store[doc_id] = {
        "cid": cid,
        "filename": file.filename,
        "size": len(content),
        "content_type": file.content_type,
    }

    return {"document_id": doc_id, "cid": cid}

@app.get("/documents/{doc_id}")
async def get_document(doc_id: str):
    meta = metadata_store.get(doc_id)
    if not meta:
        raise HTTPException(404)

    content = ipfs.cat(meta["cid"])
    return Response(
        content=content,
        media_type=meta["content_type"],
        headers={"X-IPFS-CID": meta["cid"]},
    )

This pattern uses IPFS as an immutable content store while keeping mutable metadata (ownership, tags, access control) in a traditional database.

Filecoin integration for archival storage

IPFS pinning keeps data available on the network. Filecoin adds economic guarantees — storage providers are penalized for losing data:

from lighthouseweb3 import Lighthouse

# Lighthouse provides Python-friendly Filecoin storage
lh = Lighthouse(token="YOUR_API_KEY")

def archive_to_filecoin(file_path):
    """Upload to IPFS + create Filecoin storage deal."""
    result = lh.upload(file_path)
    return {
        "cid": result["Hash"],
        "size": result["Size"],
        "deal_status": "pending",
    }

def check_deal_status(cid):
    """Verify Filecoin deal is active."""
    status = lh.get_deal_status(cid)
    return {
        "deal_id": status.get("dealId"),
        "miner": status.get("miner"),
        "active": status.get("dealStatus") == "active",
        "expiry": status.get("expiry"),
    }

Filecoin deals typically last 6-18 months and can be renewed. For critical data, create deals with multiple storage providers for redundancy.

Performance optimization

Parallel uploads

Large collections benefit from concurrent uploads:

import asyncio
import aiohttp

async def upload_batch(files, api_key, max_concurrent=10):
    semaphore = asyncio.Semaphore(max_concurrent)
    results = {}

    async def upload_one(file_path):
        async with semaphore:
            async with aiohttp.ClientSession() as session:
                data = aiohttp.FormData()
                data.add_field("file", open(file_path, "rb"),
                             filename=file_path.name)

                headers = {"Authorization": f"Bearer {api_key}"}
                async with session.post(
                    "https://api.web3.storage/upload",
                    data=data, headers=headers
                ) as resp:
                    result = await resp.json()
                    results[file_path.name] = result["cid"]

    tasks = [upload_one(f) for f in files]
    await asyncio.gather(*tasks)
    return results

Caching gateway responses

IPFS content is immutable — once fetched, it never changes. Aggressive caching is safe:

from functools import lru_cache
import requests

@lru_cache(maxsize=10000)
def fetch_ipfs_cached(cid: str) -> bytes:
    """Fetch and permanently cache IPFS content.

    Safe because CIDs are content-addressed — same CID always returns same data.
    """
    gateways = [
        "https://ipfs.io",
        "https://cloudflare-ipfs.com",
        "https://gateway.pinata.cloud",
    ]
    for gw in gateways:
        try:
            resp = requests.get(f"{gw}/ipfs/{cid}", timeout=30)
            if resp.status_code == 200:
                return resp.content
        except requests.RequestException:
            continue
    raise Exception(f"Failed to fetch {cid} from all gateways")

CAR file format for bulk operations

CAR (Content-Addressed aRchive) files bundle multiple IPFS blocks into a single file for efficient transport:

# Using the ipfs-car library or generating manually
def create_car_archive(files_dir, output_path):
    """Create a CAR file from a directory for batch upload."""
    import subprocess
    result = subprocess.run(
        ["ipfs-car", "pack", str(files_dir), "--output", str(output_path)],
        capture_output=True, text=True, check=True,
    )
    # The root CID is printed to stdout
    root_cid = result.stdout.strip().split("\n")[-1]
    return root_cid

Web3.Storage and other services accept CAR uploads, which is faster than uploading files individually because the DAG structure is pre-computed.

One thing to remember

Production IPFS integration with Python means understanding the Merkle DAG structure for efficient data management, using pinning services and Filecoin for durability guarantees, implementing gateway caching for performance, and treating IPFS as an immutable content layer complemented by a mutable metadata layer in traditional storage.

pythonblockchainproduction