Python Hashlib Hashing — Deep Dive
Architecture of hashlib
CPython’s hashlib module is a hybrid: it prefers OpenSSL’s implementations (via _hashlib) for performance and algorithm breadth, falling back to pure-Python _sha256, _sha512, etc. when OpenSSL is unavailable. On most production systems, you’re using OpenSSL’s battle-tested C implementations.
import hashlib
# Check available algorithms (OpenSSL-backed)
print(hashlib.algorithms_available)
# {'sha256', 'sha512', 'blake2b', 'sha3_256', 'md5', ...}
# Guaranteed on all Python installations
print(hashlib.algorithms_guaranteed)
# {'sha256', 'sha384', 'sha512', 'sha224', 'sha1', 'md5',
# 'sha3_256', 'sha3_384', 'sha3_512', 'sha3_224',
# 'blake2b', 'blake2s', 'shake_128', 'shake_256'}
The usedforsecurity parameter (Python 3.9+) allows using “broken” algorithms in non-security contexts without triggering FIPS-mode errors:
# In FIPS-mode OpenSSL, MD5 is blocked by default
h = hashlib.md5(b"data", usedforsecurity=False) # OK for checksums
Streaming Hash Computation
The update-digest pattern is essential for production systems handling large data:
import hashlib
from pathlib import Path
def hash_file(path: Path, algorithm: str = "sha256",
chunk_size: int = 65536) -> str:
h = hashlib.new(algorithm)
with open(path, "rb") as f:
while chunk := f.read(chunk_size):
h.update(chunk)
return h.hexdigest()
# Memory usage: constant regardless of file size
digest = hash_file(Path("/var/log/syslog"))
The chunk size of 64 KB aligns with typical filesystem read-ahead buffers and SHA-256’s internal block size (64 bytes) without being so large it wastes memory on small files.
The file_digest Shortcut (Python 3.11+)
import hashlib
with open("large_file.bin", "rb") as f:
digest = hashlib.file_digest(f, "sha256")
print(digest.hexdigest())
This function handles chunked reading internally and can use readinto() for zero-copy optimization when available.
PBKDF2 for Password Hashing
import hashlib
import secrets
def hash_password(password: str) -> tuple[bytes, bytes]:
salt = secrets.token_bytes(32)
dk = hashlib.pbkdf2_hmac(
hash_name="sha256",
password=password.encode("utf-8"),
salt=salt,
iterations=600_000, # OWASP 2023 minimum for SHA-256
)
return salt, dk
def verify_password(password: str, salt: bytes, stored_dk: bytes) -> bool:
dk = hashlib.pbkdf2_hmac("sha256", password.encode("utf-8"),
salt, 600_000)
return secrets.compare_digest(dk, stored_dk)
Tuning Iteration Count
The iteration count should make verification take 100–500 ms on your server hardware. Benchmark on your production machines:
import hashlib
import time
password = b"benchmark"
salt = b"0" * 32
for iterations in [100_000, 300_000, 600_000, 1_000_000]:
start = time.perf_counter()
hashlib.pbkdf2_hmac("sha256", password, salt, iterations)
elapsed = (time.perf_counter() - start) * 1000
print(f"{iterations:>10,} iterations: {elapsed:.1f} ms")
OWASP’s 2023 recommendations: 600,000 for PBKDF2-SHA256, 210,000 for PBKDF2-SHA512. These numbers assume commodity hardware; adjust upward for high-value targets.
BLAKE2: The Performance Champion
BLAKE2 was designed as a drop-in SHA-256 replacement that’s faster while maintaining equivalent security margins. CPython includes both BLAKE2b (64-bit optimized, up to 64-byte digest) and BLAKE2s (32-bit optimized, up to 32-byte digest).
import hashlib
# Keyed hashing (MAC) without needing HMAC
h = hashlib.blake2b(key=b"secret-key-here!", digest_size=32)
h.update(b"message to authenticate")
mac = h.hexdigest()
# Personalization — domain separation for different uses
h1 = hashlib.blake2b(b"data", person=b"cache-key")
h2 = hashlib.blake2b(b"data", person=b"dedup-key")
assert h1.hexdigest() != h2.hexdigest() # Different domains, different hashes
BLAKE2’s built-in keying, salting, and personalization eliminate the need for HMAC in many scenarios while being faster than HMAC-SHA256.
Benchmark Comparison
Typical throughput on a modern x86-64 CPU (single core):
| Algorithm | Throughput (MB/s) | Relative |
|---|---|---|
| MD5 | ~700 | 2.3x |
| SHA-1 | ~600 | 2.0x |
| SHA-256 | ~300 | 1.0x (baseline) |
| SHA-512 | ~450 | 1.5x |
| BLAKE2b | ~900 | 3.0x |
| SHA3-256 | ~200 | 0.67x |
BLAKE2b outperforms SHA-256 by 3x because it was designed to exploit 64-bit CPU operations and requires fewer rounds.
Length Extension Attacks
SHA-256, SHA-512, and SHA-1 use the Merkle–Damgård construction, which is vulnerable to length extension attacks. Given H(message) and the length of message (but not the message itself), an attacker can compute H(message || padding || attacker_data) without knowing message.
This breaks naive MAC schemes:
# VULNERABLE: H(secret || message)
mac = hashlib.sha256(secret + message).hexdigest()
# Attacker can forge H(secret || message || padding || evil_data)
Defenses:
- Use HMAC —
hmac.new(key, message, hashlib.sha256)applies a double-hashing construction immune to length extension. - Use SHA-3 or BLAKE2 — Both use sponge/HAIFA constructions that are inherently resistant.
- Use HMAC even with SHA-3 — It doesn’t hurt and provides a uniform API.
import hmac
import hashlib
# SAFE: HMAC construction
mac = hmac.new(
key=secret,
msg=message,
digestmod=hashlib.sha256
).hexdigest()
Hash-Based Data Structures
Content-Addressable Storage
import hashlib
from pathlib import Path
class ContentStore:
def __init__(self, root: Path):
self.root = root
self.root.mkdir(parents=True, exist_ok=True)
def put(self, data: bytes) -> str:
digest = hashlib.sha256(data).hexdigest()
# Fan-out: first 2 chars as directory (like Git)
dir_path = self.root / digest[:2]
dir_path.mkdir(exist_ok=True)
file_path = dir_path / digest[2:]
if not file_path.exists():
file_path.write_bytes(data)
return digest
def get(self, digest: str) -> bytes | None:
file_path = self.root / digest[:2] / digest[2:]
return file_path.read_bytes() if file_path.exists() else None
Git uses this pattern with SHA-1 (migrating to SHA-256). Docker uses it for layer storage. IPFS uses it for content addressing across a distributed network.
Merkle Trees
Hash trees enable efficient verification of large datasets. Each leaf is the hash of a data block; each internal node is the hash of its children. Changing one block requires recomputing only O(log n) hashes to update the root.
import hashlib
def merkle_root(items: list[bytes]) -> str:
if not items:
return hashlib.sha256(b"").hexdigest()
layer = [hashlib.sha256(item).digest() for item in items]
while len(layer) > 1:
if len(layer) % 2 == 1:
layer.append(layer[-1]) # Duplicate last for odd count
layer = [
hashlib.sha256(layer[i] + layer[i + 1]).digest()
for i in range(0, len(layer), 2)
]
return layer[0].hex()
Bitcoin, Ethereum, and certificate transparency logs all rely on Merkle trees built from SHA-256.
SHAKE: Variable-Length Output
SHA-3 includes SHAKE128 and SHAKE256 — extendable-output functions (XOFs) that produce digests of arbitrary length:
import hashlib
# Generate 64 bytes of deterministic output from a seed
shake = hashlib.shake_256(b"seed-value")
output = shake.hexdigest(64) # 128 hex chars = 64 bytes
# Useful for key derivation, deterministic randomness,
# and generating multiple keys from one seed
XOFs are valuable when you need more output than a fixed-length hash provides — for example, deriving both an encryption key and an IV from a single password.
Production Hardening
Constant-time comparison everywhere:
import secrets
# Every token/hash comparison must be timing-safe
if secrets.compare_digest(computed_hash, stored_hash):
authenticate()
Logging without leaking:
def safe_log_token(token: str) -> str:
"""Log enough to identify, not enough to use."""
return f"{token[:8]}...({len(token)} chars)"
Hash algorithm agility:
import hashlib
ALGORITHM = "sha256" # Single config point for migration
def compute_hash(data: bytes) -> str:
return hashlib.new(ALGORITHM, data).hexdigest()
When SHA-256 eventually needs replacement (decades away, but it will happen), changing one constant updates the entire system. Certificate authorities learned this lesson painfully during the SHA-1 to SHA-256 migration.
Thread safety: hashlib objects are not thread-safe. Each thread should create its own hash object. The module-level convenience functions (hashlib.sha256(data)) are safe because they create and consume an object in a single call.
The one thing to remember: hashlib is the foundation of data integrity in Python — master its streaming interface, understand when raw hashing isn’t enough (passwords, MACs), and you’ll build systems that verify trust at every layer.
See Also
- Python Certificate Pinning Why your Python app should remember which ID card a server uses — and refuse impostors even if they have official-looking badges.
- Python Cryptography Library Understand Python Cryptography Library with a vivid mental model so secure Python choices feel obvious, not scary.
- Python Dependency Vulnerability Scanning Why the libraries your Python project uses might be secretly broken — and how to find out before hackers do.
- Python Hmac Authentication How Python proves a message wasn't tampered with — using a secret handshake only you and the receiver know.
- Python Owasp Top Ten The ten most common ways hackers break into web apps — and how Python developers can stop every single one.