NumPy Structured Arrays — Deep Dive
Technical foundation
A NumPy structured array is a contiguous block of memory where each element has a fixed-size, multi-field layout defined by a structured dtype. Unlike Python objects, there is no per-element overhead — each record occupies exactly dtype.itemsize bytes, packed sequentially.
Memory layout and alignment
By default, NumPy packs fields tightly with no padding:
import numpy as np
dt = np.dtype([('x', 'i1'), ('y', 'f8')])
print(dt.itemsize) # 9 bytes (1 + 8, no padding)
This differs from C compilers, which typically align double to 8-byte boundaries. To match C struct layout (necessary for interop), use align=True:
dt_aligned = np.dtype([('x', 'i1'), ('y', 'f8')], align=True)
print(dt_aligned.itemsize) # 16 bytes (1 + 7 padding + 8)
Inspect exact offsets:
for name in dt_aligned.names:
field_dtype, offset = dt_aligned.fields[name]
print(f"{name}: dtype={field_dtype}, offset={offset}")
# x: dtype=int8, offset=0
# y: dtype=float64, offset=8
Nested structured dtypes
Fields can themselves be structured:
point_dt = np.dtype([('x', 'f4'), ('y', 'f4'), ('z', 'f4')])
particle_dt = np.dtype([
('id', 'i4'),
('position', point_dt),
('velocity', point_dt),
('mass', 'f8'),
])
particles = np.zeros(1000, dtype=particle_dt)
particles['position']['x'] = np.random.randn(1000)
Nested access returns views all the way down — no copies until you explicitly request one.
Sub-arrays in dtypes
A field can be an array itself:
dt = np.dtype([('label', 'U10'), ('readings', 'f4', (5,))])
data = np.zeros(3, dtype=dt)
data['readings'][0] = [1.1, 2.2, 3.3, 4.4, 5.5]
print(data['readings'].shape) # (3, 5) — a regular 2D float array
This is powerful for fixed-size vector fields (RGB pixels, 3D coordinates, sensor channels) without needing separate arrays.
Memory-mapped structured arrays
For datasets too large for RAM, combine structured dtypes with memory mapping:
# Write
dt = np.dtype([('timestamp', 'f8'), ('sensor_id', 'i4'), ('value', 'f8')])
data = np.memmap('sensors.dat', dtype=dt, mode='w+', shape=(10_000_000,))
data['timestamp'][:100] = np.arange(100, dtype='f8')
del data # flush to disk
# Read — only pages accessed data into RAM
mapped = np.memmap('sensors.dat', dtype=dt, mode='r')
recent = mapped[mapped['timestamp'] > 50] # OS pages in only needed blocks
This pattern handles multi-gigabyte binary logs with constant memory footprint.
C struct interop
Structured arrays map directly to C structs, enabling zero-copy data exchange:
// C side
typedef struct {
int32_t id;
double x;
double y;
} Point;
# Python side — must match C layout exactly
point_dt = np.dtype([('id', '<i4'), ('x', '<f8'), ('y', '<f8')], align=True)
# Read binary file written by C program
points = np.fromfile('points.bin', dtype=point_dt)
# Pass to C function via ctypes
import ctypes
lib = ctypes.CDLL('./libpoints.so')
lib.process_points(
points.ctypes.data_as(ctypes.c_void_p),
ctypes.c_int(len(points))
)
Key gotchas for C interop:
- Byte order must match (
<for little-endian on x86). - Alignment must match (
align=Trueif the C compiler uses default alignment). - String fields in NumPy are Unicode (
U); C useschar[]— useS(byte strings) for C interop.
Multi-field indexing
Selecting multiple fields returns a view (NumPy 1.16+):
dt = np.dtype([('a', 'i4'), ('b', 'f8'), ('c', 'f8')])
data = np.zeros(10, dtype=dt)
subset = data[['a', 'c']] # view with fields a and c only
In older NumPy versions, this returned a copy with reordered memory. The view behavior is more efficient but means modifications propagate. Check your NumPy version if this matters.
Converting between structured and unstructured
# Structured → regular 2D array (all fields must be same type)
dt = np.dtype([('x', 'f8'), ('y', 'f8'), ('z', 'f8')])
structured = np.zeros(100, dtype=dt)
plain = np.lib.recfunctions.structured_to_unstructured(structured)
print(plain.shape) # (100, 3)
# Regular → structured
from numpy.lib.recfunctions import unstructured_to_structured
back = unstructured_to_structured(plain, dt)
structured_to_unstructured returns a view when possible (fields are contiguous and same type), avoiding copies.
Performance: structured vs separate arrays
Structured arrays offer better cache locality when you access multiple fields per record (row-oriented access). Separate arrays win when you process one field across all records (column-oriented access).
# Row-oriented: structured wins
for record in structured_data:
process(record['x'], record['y'], record['z'])
# Column-oriented: separate arrays win
result = x_array * 2 + y_array # pure vectorized, one field at a time
In practice, structured arrays shine for I/O and data transport. For heavy computation, extract fields into separate arrays first.
Practical recipe: parsing a binary protocol
header_dt = np.dtype([
('magic', 'S4'),
('version', 'u2'),
('num_records', 'u4'),
('reserved', 'u2'),
])
record_dt = np.dtype([
('timestamp', 'u8'),
('channel', 'u1'),
('flags', 'u1'),
('value', 'f4'),
])
with open('protocol_dump.bin', 'rb') as f:
header = np.frombuffer(f.read(header_dt.itemsize), dtype=header_dt)[0]
n = int(header['num_records'])
records = np.frombuffer(f.read(n * record_dt.itemsize), dtype=record_dt)
active = records[records['flags'] & 0x01 > 0]
print(f"Parsed {n} records, {len(active)} active")
No manual struct unpacking, no loops, no intermediate lists. The entire parse is two frombuffer calls.
Performance considerations
Structured arrays store records contiguously (row-oriented layout). This is efficient when you access many fields per record but slower when you process one field across millions of records. For column-oriented workloads, extract individual fields into separate arrays first:
timestamps = records['timestamp'].copy() # contiguous column
values = records['value'].copy()
# Column operations are now cache-friendly
filtered_values = values[timestamps > some_threshold]
The copy cost is paid once; subsequent vectorized operations on contiguous columns run at full SIMD speed.
The one thing to remember: Structured arrays are NumPy’s zero-copy bridge between raw binary data and typed Python access — master the dtype definition and everything else follows.
See Also
- Python Bokeh Get an intuitive feel for Bokeh so Python behavior stops feeling unpredictable.
- Python Numpy Advanced Indexing How to cherry-pick exactly the data you want from a NumPy array using lists, masks, and fancy tricks.
- Python Numpy Broadcasting Rules How NumPy magically makes different-sized arrays work together without you writing any loops.
- Python Numpy Einsum One tiny function that replaces dozens of NumPy operations — once you learn its shorthand, array math becomes a breeze.
- Python Numpy Fft Spectral How NumPy breaks apart a signal into its hidden frequencies — like separating a chord into individual notes.