Python struct — Deep Dive
Alignment and padding
C compilers insert padding bytes to align struct members to their natural boundaries. Python’s struct replicates this behavior only when using @ (native alignment). Other byte-order prefixes (<, >, !, =) use no padding.
import struct
# Native alignment: compiler may add padding
print(struct.calcsize("@bI")) # likely 8 (1 + 3 padding + 4)
# Standard, no padding
print(struct.calcsize("<bI")) # 5 (1 + 4, no padding)
When interoperating with C structs compiled with #pragma pack(1) or __attribute__((packed)), use < or > to avoid padding mismatches. For normal C structs, use @ to match the compiler’s layout exactly — but this makes your code platform-dependent.
The buffer protocol
struct.unpack_from and struct.pack_into work directly with any object supporting the buffer protocol (like bytearray, memoryview, or mmap). This avoids copying data:
import struct
buf = bytearray(1024)
fmt = struct.Struct(">HHI")
# Write directly into buffer at offset 0
fmt.pack_into(buf, 0, 0xCAFE, 0x0001, 42)
# Read from buffer at offset 0
magic, version, payload = fmt.unpack_from(buf, 0)
For large binary files, combine mmap with unpack_from to parse gigabyte-scale data without loading it all into memory:
import mmap, struct
with open("huge.bin", "rb") as f:
mm = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
record_fmt = struct.Struct("<IIf")
record_size = record_fmt.size
for offset in range(0, len(mm), record_size):
id_, flags, value = record_fmt.unpack_from(mm, offset)
Performance characteristics
struct.pack and struct.unpack parse the format string every call. For repeated operations, struct.Struct pre-compiles it once. In tight loops processing millions of records, this matters:
import struct, time
fmt_str = "<IIf"
compiled = struct.Struct(fmt_str)
data = compiled.pack(1, 2, 3.14)
# Direct: ~2.5μs per call
# Compiled: ~1.5μs per call (varies by platform)
For truly hot paths, consider memoryview casting (Python 3.3+) instead:
import array
buf = bytearray(12000)
view = memoryview(buf).cast("f") # treat as array of floats
view[0] = 3.14 # direct memory write, no packing overhead
This bypasses struct entirely for homogeneous arrays — but only works when every element is the same type.
Parsing real binary formats
BMP file header
The BMP format starts with a 14-byte file header followed by a 40-byte info header (for BITMAPINFOHEADER):
import struct
def parse_bmp(path):
with open(path, "rb") as f:
# File header: 2s magic, I size, H reserved, H reserved, I offset
file_hdr = struct.unpack("<2sIHHI", f.read(14))
magic, file_size, _, _, pixel_offset = file_hdr
# Info header: I size, i width, i height, H planes, H bpp, ...
info_hdr = struct.unpack("<IiiHH", f.read(16))
_, width, height, planes, bpp = info_hdr
return {
"magic": magic,
"file_size": file_size,
"dimensions": (width, abs(height)),
"bpp": bpp,
}
Network packet parsing
Custom UDP protocols often have fixed headers:
import struct, socket
sock = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
sock.bind(("0.0.0.0", 9000))
HEADER = struct.Struct("!BBH I") # version, type, length, sequence
while True:
data, addr = sock.recvfrom(65535)
version, msg_type, length, seq = HEADER.unpack_from(data, 0)
payload = data[HEADER.size:HEADER.size + length]
Handling variable-length records
struct works with fixed layouts, but real formats often have variable parts. The pattern: parse a fixed header to learn the variable length, then read/unpack that many bytes:
import struct
def read_records(f):
header = struct.Struct(">HI") # type (2) + length (4)
while True:
raw = f.read(header.size)
if len(raw) < header.size:
break
rec_type, rec_len = header.unpack(raw)
payload = f.read(rec_len)
yield rec_type, payload
Struct vs. alternatives
| Feature | struct | ctypes | array | numpy |
|---|---|---|---|---|
| Fixed binary layout | ✅ | ✅ | ❌ | ✅ |
| No dependencies | ✅ | ✅ | ✅ | ❌ |
| Heterogeneous fields | ✅ | ✅ | ❌ | ✅ (structured) |
| Bulk operations | ❌ | ❌ | ✅ | ✅ |
| Nested structures | ❌ | ✅ | ❌ | ✅ |
Use struct for one-off or moderate-volume binary I/O. For millions of records with uniform layout, numpy’s fromfile with a custom dtype will outperform struct by 10-100×.
Endianness detection at runtime
If you’re writing code that needs to handle both byte orders, Python exposes the system’s native byte order:
import sys
print(sys.byteorder) # 'little' on most modern hardware
# Dynamically choose format prefix
prefix = "<" if sys.byteorder == "little" else ">"
fmt = struct.Struct(f"{prefix}IHd")
For network protocols, always use ! (network/big-endian) regardless of the local machine — this is the standard convention defined by RFC 1700.
Common pitfalls
1. Forgetting byte order. Default @ uses native byte order AND alignment. If you’re parsing a file from another system, always specify < or >.
2. String handling. s format always produces/expects bytes, not str. And 4s packs exactly 4 bytes — if your input is shorter, it’s null-padded; if longer, it’s truncated silently.
3. Signed vs. unsigned mismatch. Using i (signed) when the format specifies unsigned will produce negative values for high bit patterns. Use I for unsigned.
4. Platform-dependent sizes. l and L (long) vary between platforms (4 bytes on 32-bit, 8 on some 64-bit). Use i/I or q/Q for guaranteed sizes.
One thing to remember
struct is a surgical tool — pair it with memoryview or mmap for zero-copy performance, always specify byte order explicitly, and know when to hand off to numpy for bulk binary work.
See Also
- Python Atexit How Python's atexit module lets your program clean up after itself right before it shuts down.
- Python Bisect Sorted Lists How Python's bisect module finds things in sorted lists the way you'd find a word in a dictionary — by jumping to the middle.
- Python Contextlib How Python's contextlib module makes the 'with' statement work for anything, not just files.
- Python Copy Module Why copying data in Python isn't as simple as it sounds, and how the copy module prevents sneaky bugs.
- Python Dataclass Field Metadata How Python dataclass fields can carry hidden notes — like sticky notes on a filing cabinet that tools read automatically.