Python Memory-Mapped Files — Deep Dive

Virtual Memory Mechanics

Memory mapping works through the CPU’s virtual memory hardware. When mmap() is called at the OS level:

  1. The kernel creates a virtual memory area (VMA) in the process’s page table, marking a range of virtual addresses as backed by the file.
  2. These pages are initially marked as not present — no physical RAM is allocated.
  3. When the process accesses a mapped address, a page fault occurs.
  4. The kernel’s page fault handler reads the corresponding 4 KB page from disk into a physical frame and updates the page table.
  5. The CPU retries the instruction — this time it succeeds, transparently.

Subsequent accesses to the same page hit physical RAM directly. The OS tracks which pages are “hot” and evicts cold pages under memory pressure, using the LRU (or CLOCK) algorithm.

Python’s mmap Module Internals

Python’s mmap.mmap object wraps the OS-level mmap() system call and exposes it with Python buffer protocol support:

import mmap

with open("data.bin", "r+b") as f:
    mm = mmap.mmap(
        f.fileno(),        # File descriptor
        0,                 # Length (0 = entire file)
        access=mmap.ACCESS_WRITE,  # Read-write
        offset=0           # Starting offset (must be page-aligned)
    )

The offset parameter must be a multiple of mmap.ALLOCATIONGRANULARITY (typically 4096 on Linux, 65536 on Windows). This alignment requirement comes from the hardware page table.

Access Modes

ModeDescriptionOS flag
ACCESS_READRead-only, writes raise TypeErrorPROT_READ
ACCESS_WRITERead-write, changes go to filePROT_READ | PROT_WRITE, MAP_SHARED
ACCESS_COPYRead-write, changes are private (copy-on-write)PROT_READ | PROT_WRITE, MAP_PRIVATE

ACCESS_COPY is particularly useful for analysis: you can modify the mapped data (e.g., patching bytes for testing) without affecting the original file. The OS uses copy-on-write semantics — modified pages get their own physical memory while unmodified pages continue sharing the file’s pages.

NumPy Integration

NumPy arrays can be backed directly by memory-mapped files, combining vectorized operations with mmap efficiency:

import numpy as np

# Create a memory-mapped array
arr = np.memmap("features.dat", dtype=np.float32,
                mode='r+', shape=(1_000_000, 128))

# Operate on slices without loading everything
batch = arr[5000:5100]  # Only pages for rows 5000-5100 are loaded
norms = np.linalg.norm(batch, axis=1)

# Write results back
arr[5000:5100] /= norms[:, np.newaxis]
arr.flush()

This is how scikit-learn handles datasets larger than RAM in sklearn.datasets.load_svmlight_file with memory mapping. The pattern lets you train models on datasets that exceed available memory — the OS pages data in and out as the training algorithm accesses different portions.

Shared Memory IPC via mmap

Two processes can communicate through a shared memory-mapped file:

Writer Process

import mmap
import struct
import time

with open("/tmp/shared_data.bin", "r+b") as f:
    mm = mmap.mmap(f.fileno(), 1024)

    for i in range(1000):
        # Write a counter and timestamp
        data = struct.pack('Qd', i, time.time())
        mm[:16] = data
        mm.flush()
        time.sleep(0.01)

    mm.close()

Reader Process

import mmap
import struct

with open("/tmp/shared_data.bin", "r+b") as f:
    mm = mmap.mmap(f.fileno(), 1024, access=mmap.ACCESS_READ)

    last_counter = -1
    while True:
        data = mm[:16]
        counter, timestamp = struct.unpack('Qd', data)
        if counter != last_counter:
            print(f"Counter: {counter}, Time: {timestamp}")
            last_counter = counter

    mm.close()

This approach is faster than sockets or pipes for large data transfers because no copying occurs — both processes access the same physical pages through their respective page tables.

For Python 3.8+, multiprocessing.shared_memory provides a higher-level API for anonymous (non-file-backed) shared memory. But file-backed mmap remains useful when you need persistence or cross-language compatibility.

Anonymous Memory Mapping

You can create memory-mapped regions without a backing file using mmap.mmap(-1, size):

import mmap

# Create 10 MB anonymous mapping
mm = mmap.mmap(-1, 10 * 1024 * 1024)

# Use as a fast, resizable byte buffer
mm[:4] = b'\x01\x02\x03\x04'
data = mm[:4]

mm.close()

Anonymous mappings are backed by swap space instead of a file. They’re useful for allocating large buffers that the OS can page out under memory pressure, unlike bytearray which is always in RAM (barring full system swap).

Page Cache Behavior and Tuning

Understanding the OS page cache is crucial for mmap performance:

import mmap
import os

# Advise the kernel about access patterns
with open("sequential_data.bin", "rb") as f:
    mm = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)

    # Tell OS we'll read sequentially — enables readahead
    mm.madvise(mmap.MADV_SEQUENTIAL)

    # Or for random access — disables readahead
    # mm.madvise(mmap.MADV_RANDOM)

    # Process data...
    mm.close()

madvise hints (available in Python 3.8+):

HintEffect
MADV_SEQUENTIALAggressive readahead, free pages after reading
MADV_RANDOMDisable readahead, keep pages longer
MADV_WILLNEEDPre-fault specified pages (like prefetch)
MADV_DONTNEEDFree specified pages immediately

For sequential scans through large files, MADV_SEQUENTIAL can improve throughput by 2-3x because the kernel reads ahead while you process the current page.

Error Handling: SIGBUS

On Unix, if a memory-mapped file is truncated by another process while you’re reading it, accessing pages beyond the new end triggers a SIGBUS signal, which kills the process by default. Python cannot catch this with a try/except — it’s a signal, not an exception.

Defensive strategies:

import os
import mmap
import signal

def handle_bus_error(signum, frame):
    raise IOError("Memory-mapped file was truncated")

signal.signal(signal.SIGBUS, handle_bus_error)

# Now SIGBUS raises an IOError instead of killing the process

Better yet: ensure exclusive access to the file using fcntl.flock() or use ACCESS_COPY mode which creates private copies of modified pages.

Production Pattern: Memory-Mapped Ring Buffer

A high-performance logging system can use mmap as a ring buffer:

import mmap
import struct
import os

class MmapRingBuffer:
    HEADER_SIZE = 16  # write_pos (8 bytes) + count (8 bytes)

    def __init__(self, path, capacity=1_000_000, record_size=256):
        self.record_size = record_size
        self.capacity = capacity
        total_size = self.HEADER_SIZE + capacity * record_size

        if not os.path.exists(path):
            with open(path, 'wb') as f:
                f.write(b'\x00' * total_size)

        self._f = open(path, 'r+b')
        self._mm = mmap.mmap(self._f.fileno(), total_size)

    def write(self, data: bytes):
        assert len(data) <= self.record_size
        padded = data.ljust(self.record_size, b'\x00')

        write_pos, count = struct.unpack('QQ', self._mm[:16])
        offset = self.HEADER_SIZE + (write_pos % self.capacity) * self.record_size
        self._mm[offset:offset + self.record_size] = padded

        write_pos += 1
        count = min(count + 1, self.capacity)
        self._mm[:16] = struct.pack('QQ', write_pos, count)

    def close(self):
        self._mm.flush()
        self._mm.close()
        self._f.close()

This pattern provides:

  • Crash recovery — Data persists on disk through the mapping.
  • Zero-copy writes — Data goes directly to the page cache.
  • Bounded memory — Fixed size ring buffer, OS manages page residency.

Benchmarks: mmap vs Alternatives

Reading 100,000 random 4 KB blocks from a 4 GB file:

MethodTimePeak RSS
f.seek() + f.read()4.2s12 MB
mmap random access1.8s180 MB (OS-managed)
mmap + MADV_RANDOM1.6s120 MB
Full f.read() then index14s (load) + 0.01s (access)4 GB

mmap is fastest for random access because the OS page cache is optimized for exactly this pattern. The higher RSS reflects cached pages that the OS will reclaim under pressure.

The one thing to remember: Memory-mapped files leverage the OS’s virtual memory system for zero-copy file access with automatic page management — use madvise to match your access pattern, NumPy memmap for numerical data, and file-backed mappings for inter-process communication where copying overhead is unacceptable.

pythonperformancememoryiosystems

See Also