NumPy Structured Arrays — Core Concepts

Why this topic matters

Most NumPy tutorials focus on homogeneous arrays of floats or ints. But real data — sensor logs, database exports, binary file formats — mixes types. Structured arrays let you define a custom dtype with named fields of different types, giving you fast, typed, tabular storage without leaving NumPy.

How to create a structured array

Define a dtype as a list of (name, type) tuples:

import numpy as np

dt = np.dtype([('name', 'U20'), ('age', 'i4'), ('score', 'f8')])
students = np.array([
    ('Alice', 22, 91.5),
    ('Bob', 19, 85.0),
    ('Carol', 21, 93.2),
], dtype=dt)

Each element of students is a record. Access fields by name:

print(students['name'])   # ['Alice' 'Bob' 'Carol']
print(students['score'])  # [91.5 85.  93.2]

Field access returns views

When you access a single field, NumPy returns a view into the original data — no copy. Modifying the field view modifies the structured array:

scores = students['score']
scores[0] = 99.0
print(students[0])  # ('Alice', 22, 99.0) — changed in place

This is efficient but requires awareness. If you need an independent copy, call .copy() explicitly.

Sorting and filtering

Structured arrays support sorting by field name:

sorted_by_age = np.sort(students, order='age')

Boolean filtering works the same as regular arrays:

high_scorers = students[students['score'] > 90]

Record arrays — dot access

np.recarray wraps structured arrays to allow attribute-style access:

rec = students.view(np.recarray)
print(rec.name)   # ['Alice' 'Bob' 'Carol']
print(rec.age)    # [22 19 21]

This is convenient but slightly slower due to attribute lookup overhead. For performance-critical code, use dictionary-style field access.

Common misconception

People often assume structured arrays are the same as pandas DataFrames. They serve different purposes. Structured arrays are for low-level, fixed-schema, high-performance data storage. DataFrames add indexing, missing value handling, groupby, and dozens of other features. If you need those, use pandas. If you need raw speed with millions of fixed-format records, structured arrays win.

Nested dtypes

Fields can contain sub-fields or fixed-size sub-arrays:

dt = np.dtype([
    ('label', 'U10'),
    ('readings', 'f4', (5,)),  # each record holds 5 floats
])
data = np.zeros(3, dtype=dt)
data['readings'][0] = [1.1, 2.2, 3.3, 4.4, 5.5]
print(data['readings'].shape)  # (3, 5) — a regular 2D array

This is useful for fixed-size vector fields like RGB colors, 3D coordinates, or multi-channel sensor readings.

When to use structured arrays

  • Reading binary file formats (HDF5, FITS, custom protocols)
  • Interfacing with C structs via ctypes or cffi
  • Processing sensor data with fixed record layouts
  • Memory-mapped files where each record has a known byte layout
  • High-volume logging where row count matters more than feature richness

The one thing to remember: Structured arrays combine the speed of NumPy with the flexibility of mixed types — they are your bridge between raw binary data and Python.

pythonnumpydata-science

See Also