Python Zero-Copy Buffers — Deep Dive

The Buffer Protocol: C-Level Details

Python’s buffer protocol (PEP 3118) defines how objects expose raw memory to other objects. At the C level, when memoryview(obj) is called, Python invokes obj.__buffer__() (or the C-level bf_getbuffer slot), which fills a Py_buffer struct:

typedef struct {
    void *buf;          // Pointer to the memory block
    Py_ssize_t len;     // Total bytes
    int readonly;       // Read-only flag
    const char *format; // struct-style format string ('B', 'd', 'i', etc.)
    int ndim;           // Number of dimensions
    Py_ssize_t *shape;  // Size of each dimension
    Py_ssize_t *strides;// Bytes to skip for each dimension
    Py_ssize_t *suboffsets; // For indirect arrays (PIL images)
    Py_ssize_t itemsize;// Size of a single element
} Py_buffer;

The key insight: the buf pointer points directly into the source object’s memory. No copy occurs. The strides array enables non-contiguous views — a column slice of a 2D array can be represented as a strided view without copying.

Python 3.12+: buffer Protocol

PEP 688 (Python 3.12) added __buffer__ as a Python-level method, making it possible to create buffer-exporting objects in pure Python:

import inspect

class SharedBuffer:
    """Pure Python object that exports its internal buffer."""

    def __init__(self, size):
        self._data = bytearray(size)

    def __buffer__(self, flags):
        return inspect.BufferInfo(
            memoryview(self._data)
        )

Before 3.12, implementing the buffer protocol required C extensions.

Multi-Dimensional Zero-Copy Views

memoryview supports multi-dimensional data natively:

import array

# Create a flat array of 12 integers
flat = array.array('i', range(12))
view = memoryview(flat)

# Reshape to 3×4 without copying
matrix = view.cast('i', shape=(3, 4))

print(matrix[1, 2])  # Access element at row 1, col 2
print(matrix.strides)  # (16, 4) — bytes per row, bytes per column

The cast() method reinterprets the buffer with a different format or shape. Since it only changes the view metadata (format, shape, strides), no data movement occurs.

NumPy-Style Slicing

import numpy as np

arr = np.arange(1_000_000, dtype=np.float64)  # 8 MB
view = memoryview(arr)

# Slice with step — creates strided view, zero-copy
every_10th = view[::10]
print(every_10th.strides)  # (80,) — skips 10 elements × 8 bytes
print(len(every_10th))     # 100,000 elements, but no data copied

Zero-Copy I/O with the Kernel

os.sendfile: True Kernel-Level Zero Copy

The os.sendfile() syscall transfers data between file descriptors entirely within the kernel, never touching user-space memory:

import os
import socket

def send_file_zero_copy(sock, filepath):
    with open(filepath, 'rb') as f:
        file_size = os.fstat(f.fileno()).st_size
        offset = 0
        while offset < file_size:
            sent = os.sendfile(
                sock.fileno(),
                f.fileno(),
                offset,
                file_size - offset
            )
            offset += sent

This is how high-performance web servers (nginx, Apache) serve static files. The data goes from disk → kernel page cache → network stack → network card, never entering the Python process’s memory.

Socket Methods with Buffer Protocol

Python’s socket methods accept any buffer protocol object:

sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
sock.connect(("example.com", 80))

# Zero-copy send from a memoryview
data = bytearray(b"GET / HTTP/1.1\r\nHost: example.com\r\n\r\n")
view = memoryview(data)
sock.sendall(view)

# Zero-copy receive into existing buffer
buf = bytearray(65536)
n = sock.recv_into(buf)
response = memoryview(buf)[:n]  # Zero-copy slice of received data

The recv_into() method writes directly into an existing buffer instead of allocating a new bytes object, which is critical for high-frequency network I/O.

Building a Zero-Copy Parser

A binary protocol parser using zero-copy throughout:

import struct

class ZeroCopyParser:
    """Parse a binary stream without copying data."""

    def __init__(self, data: bytes):
        self._view = memoryview(data)
        self._pos = 0

    def read_bytes(self, n):
        """Return a zero-copy slice."""
        result = self._view[self._pos:self._pos + n]
        self._pos += n
        return result

    def read_uint32(self):
        result = struct.unpack_from('>I', self._view, self._pos)[0]
        self._pos += 4
        return result

    def read_string(self):
        length = self.read_uint32()
        data = self.read_bytes(length)
        return bytes(data).decode('utf-8')  # Copy only for string decode

    def remaining(self):
        return len(self._view) - self._pos

# Usage with a 100 MB binary blob
blob = bytearray(read_from_network())
parser = ZeroCopyParser(blob)

while parser.remaining() > 0:
    msg_type = parser.read_uint32()
    payload_len = parser.read_uint32()
    payload = parser.read_bytes(payload_len)  # Zero-copy!
    process_message(msg_type, payload)

struct.unpack_from reads directly from the buffer at an offset — no slicing or copying needed for scalar values.

Zero-Copy with ctypes and FFI

When interfacing with C libraries, zero-copy avoids expensive data marshaling:

import ctypes
import numpy as np

# Create a NumPy array
arr = np.zeros(1000, dtype=np.float64)

# Get a ctypes pointer to the array's data — zero-copy
ptr = arr.ctypes.data_as(ctypes.POINTER(ctypes.c_double))

# Pass to C function
lib = ctypes.CDLL("./mylib.so")
lib.process_data(ptr, len(arr))
# C function reads/writes the NumPy array's memory directly

This is how Python scientific libraries (SciPy, OpenCV, TensorFlow) achieve near-C performance: they pass buffer pointers to compiled code without marshaling.

Advanced: Scatter/Gather I/O

For protocols with headers and payloads from different buffers, scatter/gather avoids concatenation:

import socket

header = bytearray(b"\x00\x01\x00\x64")  # 4-byte header
payload = bytearray(100)                    # 100-byte payload

# Traditional: concatenate then send (copies both)
# sock.sendall(header + payload)

# Zero-copy: send multiple buffers
sock.sendmsg([header, payload])  # Kernel gathers from both buffers

sendmsg() passes an array of buffer pointers to the kernel’s writev() syscall, which sends them as a single TCP segment without user-space concatenation.

Performance: Copy vs Zero-Copy Pipeline

Processing 1 million network-like messages (header + 1 KB payload):

import time

messages = [bytearray(os.urandom(1024)) for _ in range(1_000_000)]

# Copy-based pipeline
start = time.perf_counter()
for msg in messages:
    header = msg[:8]        # Copy
    payload = msg[8:]       # Copy
    result = process(header, payload)
copy_time = time.perf_counter() - start

# Zero-copy pipeline
start = time.perf_counter()
for msg in messages:
    view = memoryview(msg)
    header = view[:8]       # Zero-copy
    payload = view[8:]      # Zero-copy
    result = process(header, payload)
zc_time = time.perf_counter() - start
ApproachTimeAllocations
Copy-based1.8s2M objects (header + payload copies)
Zero-copy0.9s1M objects (memoryview objects only)
Improvement2x faster50% fewer allocations

The speedup comes from reduced allocation, reduced GC pressure, and better cache behavior (no copying means the CPU cache isn’t polluted with temporary buffers).

When Zero-Copy Is Not Worth It

  • Small data (<100 bytes) — The overhead of creating a memoryview object exceeds the cost of copying small byte strings.
  • Data that needs transformation — If you’re going to decode, compress, or encrypt the data anyway, a copy is unavoidable.
  • Thread-safety concerns — Shared mutable buffers require careful synchronization. Sometimes a defensive copy is cheaper than adding locks.
  • Long-lived slices of short-lived data — A tiny memoryview slice keeps the entire source buffer alive. If the source is large and temporary, the memory savings from zero-copy can become memory waste.

The one thing to remember: Zero-copy in Python flows from the buffer protocol through memoryview for user-space slicing, to sendfile and sendmsg for kernel-level I/O — the art is building pipelines where data flows from source to destination touching memory as few times as possible.

pythonperformancememoryoptimizationcpython-internals

See Also