Python Hashlib Hashing — Core Concepts

What Hashlib Actually Does

The hashlib module provides a uniform interface to dozens of cryptographic hash functions. Each function takes arbitrary-length input and produces a fixed-length output — the digest. The three properties that make this useful:

  1. Deterministic — the same input always yields the same digest
  2. Pre-image resistant — given a digest, you can’t find the input
  3. Collision resistant — it’s computationally infeasible to find two different inputs with the same digest

Choosing an Algorithm

AlgorithmDigest SizeSpeedStatus
MD5128 bitsFastBroken — collisions found
SHA-1160 bitsFastDeprecated — collision attacks practical
SHA-256256 bitsModerateCurrent standard
SHA-512512 bitsModeratePreferred for larger security margins
SHA-3224–512 bitsModerateNewest standard, different internal design
BLAKE2bUp to 512 bitsVery fastModern, safe, faster than SHA-256

For most applications, SHA-256 is the right choice. It’s universally supported, well-studied, and has no known weaknesses. Use BLAKE2 when performance matters — it’s faster than SHA-256 while maintaining strong security guarantees.

Never use MD5 or SHA-1 for security. They’re fine for non-security checksums (cache keys, deduplication), but an attacker can craft collisions in seconds.

Basic Usage

The workflow is always the same: create a hash object, feed it data with .update(), and extract the digest.

You can call .update() multiple times — the result is identical to hashing the concatenated data. This is critical for large files: instead of loading a 4 GB file into memory, you read it in chunks and update the hash incrementally.

The .hexdigest() method returns the digest as a hex string. The .digest() method returns raw bytes — useful when feeding the hash into another cryptographic function.

Hashing Large Files

Reading an entire file into memory defeats the purpose of streaming hash computation. The standard pattern reads in 8 KB or 64 KB blocks:

The key insight: the hash object maintains internal state, so update(chunk1) followed by update(chunk2) equals update(chunk1 + chunk2). This makes hashlib naturally suited for streaming data, network protocols, and pipeline processing.

Password Hashing — The Nuance

Raw hashlib is not suitable for password hashing. It’s too fast. An attacker with a GPU can compute billions of SHA-256 hashes per second, making brute-force attacks on common passwords trivial.

Secure password hashing requires three additions:

  • Salt — a random value unique to each password, preventing rainbow table attacks
  • Iteration count — making each hash computation deliberately slow
  • Memory hardness (optional) — using large amounts of RAM to resist GPU/ASIC attacks

Python provides hashlib.pbkdf2_hmac() for this purpose. It applies HMAC-SHA256 repeatedly (typically 600,000+ iterations as of OWASP 2023 recommendations), forcing each password guess to take measurable time.

For new projects, consider bcrypt, scrypt, or argon2 via third-party libraries. These add memory hardness, which makes GPU-based cracking dramatically more expensive.

Common Misconception

“Hashing encrypts data.” Hashing and encryption solve different problems. Encryption is reversible — you can decrypt with the key. Hashing is irreversible by design. You hash to verify integrity and prove knowledge without revealing content. You encrypt to protect data you need to read later.

Real-World Applications

File integrity verification — Software distributors publish SHA-256 digests. After downloading, you hash the file locally and compare. Package managers like pip and apt do this automatically.

Content-addressable storageGit identifies every object (commit, tree, blob) by its SHA-1 hash. This means identical content always gets the same address, enabling deduplication and integrity checks across the entire repository history.

Digital signatures — Before signing a document, you hash it. The signature covers the hash, not the full document. This makes signing fast regardless of document size and ensures any modification invalidates the signature.

Cache keys — Hash request parameters to create cache keys. Two identical requests produce the same hash, yielding a cache hit without complex key construction logic.

The one thing to remember: hashlib gives you fast, reliable fingerprinting for data — but for passwords specifically, you need slow, salted hashing that hashlib supports through pbkdf2_hmac.

pythonsecuritycryptography

See Also