Python Hashlib Hashing — Core Concepts
What Hashlib Actually Does
The hashlib module provides a uniform interface to dozens of cryptographic hash functions. Each function takes arbitrary-length input and produces a fixed-length output — the digest. The three properties that make this useful:
- Deterministic — the same input always yields the same digest
- Pre-image resistant — given a digest, you can’t find the input
- Collision resistant — it’s computationally infeasible to find two different inputs with the same digest
Choosing an Algorithm
| Algorithm | Digest Size | Speed | Status |
|---|---|---|---|
| MD5 | 128 bits | Fast | Broken — collisions found |
| SHA-1 | 160 bits | Fast | Deprecated — collision attacks practical |
| SHA-256 | 256 bits | Moderate | Current standard |
| SHA-512 | 512 bits | Moderate | Preferred for larger security margins |
| SHA-3 | 224–512 bits | Moderate | Newest standard, different internal design |
| BLAKE2b | Up to 512 bits | Very fast | Modern, safe, faster than SHA-256 |
For most applications, SHA-256 is the right choice. It’s universally supported, well-studied, and has no known weaknesses. Use BLAKE2 when performance matters — it’s faster than SHA-256 while maintaining strong security guarantees.
Never use MD5 or SHA-1 for security. They’re fine for non-security checksums (cache keys, deduplication), but an attacker can craft collisions in seconds.
Basic Usage
The workflow is always the same: create a hash object, feed it data with .update(), and extract the digest.
You can call .update() multiple times — the result is identical to hashing the concatenated data. This is critical for large files: instead of loading a 4 GB file into memory, you read it in chunks and update the hash incrementally.
The .hexdigest() method returns the digest as a hex string. The .digest() method returns raw bytes — useful when feeding the hash into another cryptographic function.
Hashing Large Files
Reading an entire file into memory defeats the purpose of streaming hash computation. The standard pattern reads in 8 KB or 64 KB blocks:
The key insight: the hash object maintains internal state, so update(chunk1) followed by update(chunk2) equals update(chunk1 + chunk2). This makes hashlib naturally suited for streaming data, network protocols, and pipeline processing.
Password Hashing — The Nuance
Raw hashlib is not suitable for password hashing. It’s too fast. An attacker with a GPU can compute billions of SHA-256 hashes per second, making brute-force attacks on common passwords trivial.
Secure password hashing requires three additions:
- Salt — a random value unique to each password, preventing rainbow table attacks
- Iteration count — making each hash computation deliberately slow
- Memory hardness (optional) — using large amounts of RAM to resist GPU/ASIC attacks
Python provides hashlib.pbkdf2_hmac() for this purpose. It applies HMAC-SHA256 repeatedly (typically 600,000+ iterations as of OWASP 2023 recommendations), forcing each password guess to take measurable time.
For new projects, consider bcrypt, scrypt, or argon2 via third-party libraries. These add memory hardness, which makes GPU-based cracking dramatically more expensive.
Common Misconception
“Hashing encrypts data.” Hashing and encryption solve different problems. Encryption is reversible — you can decrypt with the key. Hashing is irreversible by design. You hash to verify integrity and prove knowledge without revealing content. You encrypt to protect data you need to read later.
Real-World Applications
File integrity verification — Software distributors publish SHA-256 digests. After downloading, you hash the file locally and compare. Package managers like pip and apt do this automatically.
Content-addressable storage — Git identifies every object (commit, tree, blob) by its SHA-1 hash. This means identical content always gets the same address, enabling deduplication and integrity checks across the entire repository history.
Digital signatures — Before signing a document, you hash it. The signature covers the hash, not the full document. This makes signing fast regardless of document size and ensures any modification invalidates the signature.
Cache keys — Hash request parameters to create cache keys. Two identical requests produce the same hash, yielding a cache hit without complex key construction logic.
The one thing to remember: hashlib gives you fast, reliable fingerprinting for data — but for passwords specifically, you need slow, salted hashing that hashlib supports through pbkdf2_hmac.
See Also
- Python Certificate Pinning Why your Python app should remember which ID card a server uses — and refuse impostors even if they have official-looking badges.
- Python Cryptography Library Understand Python Cryptography Library with a vivid mental model so secure Python choices feel obvious, not scary.
- Python Dependency Vulnerability Scanning Why the libraries your Python project uses might be secretly broken — and how to find out before hackers do.
- Python Hmac Authentication How Python proves a message wasn't tampered with — using a secret handshake only you and the receiver know.
- Python Owasp Top Ten The ten most common ways hackers break into web apps — and how Python developers can stop every single one.