NumPy Random Generator — Core Concepts

Master the modern NumPy random API — Generator objects, reproducibility, distributions, and why you should stop using np.random.seed().

Why this topic matters

Random number generation underpins simulations, machine learning, statistical testing, and data augmentation. NumPy overhauled its random API in version 1.17, introducing Generator objects that are faster, more flexible, and safer for parallel code. Understanding the new API avoids subtle reproducibility bugs and global-state issues that plague the legacy interface.

Legacy vs modern API

Legacy (avoid in new code)

import numpy as np
np.random.seed(42)
data = np.random.randn(100)

Problems: np.random.seed() sets global state. Any library that also calls np.random functions shares that state. Order of imports, library internals, and even garbage collection can change the sequence unpredictably.

Modern (recommended)

rng = np.random.default_rng(42)
data = rng.standard_normal(100)

Each Generator owns its own state. No global side effects. Multiple generators can coexist independently.

Creating generators

# Seeded — reproducible
rng = np.random.default_rng(seed=12345)

# Unseeded — different every run (entropy from OS)
rng = np.random.default_rng()

# From a SeedSequence — best for spawning independent streams
from numpy.random import SeedSequence
ss = SeedSequence(42)
child_seeds = ss.spawn(4)
generators = [np.random.default_rng(s) for s in child_seeds]

SeedSequence is the key to parallel reproducibility. Spawning from one parent seed guarantees statistically independent streams without seed collisions.

Common distributions

Method	Distribution	Example
`rng.random(n)`	Uniform [0, 1)	Random probabilities
`rng.integers(lo, hi, n)`	Discrete uniform	Dice rolls
`rng.standard_normal(n)`	Normal (μ=0, σ=1)	Noise generation
`rng.normal(μ, σ, n)`	Normal (custom)	Simulating measurements
`rng.exponential(scale, n)`	Exponential	Wait times
`rng.choice(arr, n)`	Sampling from array	Bootstrap resampling
`rng.shuffle(arr)`	In-place permutation	Shuffling a dataset
`rng.permutation(n)`	Random permutation	Index shuffling

Common misconception

People think setting a seed guarantees identical results across different NumPy versions. It does not. NumPy explicitly does not guarantee cross-version stream compatibility for Generator. The same seed may produce different sequences in NumPy 1.24 vs 1.26. If you need exact reproducibility across versions, save the generated data rather than relying on the seed alone.

The legacy RandomState does guarantee stream compatibility — which is one reason it still exists.

Reproducibility best practices

Always use explicit Generator objects — never rely on global state.
Pass generators as function arguments — makes dependencies clear.
Use SeedSequence.spawn() for parallel workloads.
Log the seed alongside results so experiments can be replayed.
Pin your NumPy version if exact stream reproducibility is critical.

def train_model(data, rng):
    """Accept an explicit generator — no hidden global state."""
    idx = rng.permutation(len(data))
    shuffled = data[idx]
    # ... training logic

Performance

The default bit generator (PCG64) is significantly faster than the legacy Mersenne Twister:

Operation (1M samples)	Legacy `RandomState`	Modern `Generator`
Uniform floats	4.2 ms	2.8 ms
Standard normal	12.1 ms	5.3 ms
Random integers	6.8 ms	3.1 ms

The speedup comes from both the faster PCG64 algorithm and improved distribution sampling methods (e.g., Ziggurat for normals instead of Box-Muller).

The one thing to remember: Use np.random.default_rng(seed) instead of np.random.seed() — it is faster, safer, and gives you independent random streams without global-state headaches.

pythonnumpydata-science