NumPy Random Generator — Deep Dive

Explore NumPy's random architecture — bit generators, SeedSequence internals, parallel streams, and custom distribution sampling.

Technical foundation

NumPy’s random module has a two-layer architecture:

Bit Generator — produces raw pseudo-random bits. Stateless math, no knowledge of distributions.
Generator — consumes bits and transforms them into samples from specific distributions (normal, exponential, etc.).

This separation means you can swap bit generators without changing your sampling code.

Bit generators available in NumPy

from numpy.random import PCG64, MT19937, Philox, SFC64, Generator

rng_pcg = Generator(PCG64(42))       # Default — fast, small state
rng_mt = Generator(MT19937(42))      # Legacy Mersenne Twister
rng_philox = Generator(Philox(42))   # Counter-based — ideal for parallel
rng_sfc = Generator(SFC64(42))       # Fastest for single-threaded use

Bit Generator	Period	State size	Speed	Best for
PCG64	2^128	32 bytes	Fast	General purpose (default)
MT19937	2^19937	2.5 KB	Medium	Legacy compatibility
Philox	2^256	64 bytes	Medium	Parallel/distributed
SFC64	~2^256	32 bytes	Fastest	Single-threaded throughput

PCG64 internals

PCG (Permuted Congruential Generator) uses a linear congruential generator as its core, then applies a permutation function to improve statistical quality:

state = state * multiplier + increment    (mod 2^128)
output = permute(state)                    (64-bit result)

The permutation step is what distinguishes PCG from a raw LCG. It ensures the output passes stringent statistical tests (TestU01 BigCrush) while keeping the state transition simple and fast.

PCG64 supports advance and jump operations in O(log n) time:

from numpy.random import PCG64

bg = PCG64(42)
bg.advance(1_000_000)  # Skip 1M states — O(log n), not O(n)

# Jumped returns an independent copy advanced by a huge offset
bg2 = bg.jumped()  # Effectively independent stream

SeedSequence: the reproducibility backbone

SeedSequence solves the fundamental problem of turning one user seed into many independent bit-generator initializations:

from numpy.random import SeedSequence, PCG64, Generator

ss = SeedSequence(42)
children = ss.spawn(8)  # 8 independent child sequences

generators = [Generator(PCG64(child)) for child in children]
# All 8 generators are statistically independent

Under the hood, SeedSequence uses a hash-based algorithm that:

Mixes the user seed with a spawn key (a sequence of integers tracking the position in the spawn tree).
Applies multiple rounds of mixing to ensure avalanche properties.
Generates enough state to initialize any bit generator.

This is hierarchical: children can spawn grandchildren:

worker_ss = children[3]
sub_streams = worker_ss.spawn(4)  # 4 sub-streams from worker 3

Parallel random number generation

Pattern 1: SeedSequence spawn (recommended)

from concurrent.futures import ProcessPoolExecutor

def simulate(seed):
    rng = np.random.default_rng(seed)
    return rng.standard_normal(1_000_000).mean()

ss = SeedSequence(42)
seeds = ss.spawn(8)

with ProcessPoolExecutor(max_workers=8) as pool:
    results = list(pool.map(simulate, seeds))

Pattern 2: Counter-based (Philox) for reproducible parallelism

Philox is a counter-based RNG: it encrypts a counter with a key. Different counters give different streams — no coordination needed:

from numpy.random import Philox, Generator

key = 42
generators = [Generator(Philox(key=key, counter=i)) for i in range(8)]
# Each generator produces an independent stream

Counter-based RNGs are particularly useful in distributed computing where workers cannot communicate to coordinate seeds.

Pattern 3: Jump-ahead for MT19937

from numpy.random import MT19937, Generator

bg = MT19937(42)
generators = []
for _ in range(4):
    generators.append(Generator(bg.jumped()))
    # jumped() advances state by 2^128 steps — enough to guarantee no overlap

Distribution sampling algorithms

The Generator class uses optimized algorithms for each distribution:

Normal distribution — Ziggurat method

The legacy RandomState used Box-Muller (compute two normals from two uniforms). Generator uses the Ziggurat algorithm, which is ~2x faster:

# Ziggurat precomputes a set of horizontal rectangles that approximate
# the normal PDF. Most samples fall inside a rectangle and need only
# a uniform random number and a comparison. Only edge cases require
# expensive exponential or tail sampling.
rng = np.random.default_rng(42)
samples = rng.standard_normal(10_000_000)  # ~5ms vs ~12ms with Box-Muller

Discrete sampling — alias method

For rng.choice(arr, p=weights), the Generator uses the alias method for O(1) sampling after O(n) setup:

weights = np.array([0.1, 0.3, 0.2, 0.4])
rng = np.random.default_rng(42)
samples = rng.choice(4, size=1_000_000, p=weights)
# After initial setup, each sample is O(1) — one uniform random + one comparison

Custom distributions via inverse CDF

For distributions not built into NumPy, use inverse transform sampling:

def sample_pareto_custom(rng, alpha, size):
    """Sample from Pareto distribution using inverse CDF."""
    u = rng.random(size)
    return (1 - u) ** (-1 / alpha)

rng = np.random.default_rng(42)
samples = sample_pareto_custom(rng, alpha=2.5, size=100_000)

For distributions where the inverse CDF is not available in closed form, use acceptance-rejection sampling:

def rejection_sample(rng, target_pdf, proposal_rng, M, size):
    """Generic rejection sampler."""
    samples = []
    while len(samples) < size:
        batch_size = int((size - len(samples)) * 1.5 / M) + 100
        x = proposal_rng(rng, batch_size)
        u = rng.random(batch_size)
        accept = u < target_pdf(x) / (M * proposal_pdf(x))
        samples.extend(x[accept])
    return np.array(samples[:size])

Saving and restoring state

For checkpoint-restart workflows:

rng = np.random.default_rng(42)
rng.standard_normal(1000)  # advance state

# Save state
state = rng.bit_generator.state

# ... later, restore
rng2 = np.random.default_rng()
rng2.bit_generator.state = state
# rng2 now produces the exact same sequence as rng from this point forward

The state is a dictionary containing the bit generator name, internal state arrays, and the position within any buffered output.

Entropy and true randomness

For cryptographic or security-sensitive applications, NumPy is not appropriate. Its generators are designed for speed and statistical quality, not unpredictability. Use secrets or os.urandom() for security-sensitive randomness.

However, np.random.default_rng() (no seed) does use OS entropy (/dev/urandom on Linux) for initialization, making the initial state unpredictable even if the subsequent stream is deterministic.

Migration from legacy API

Legacy	Modern	Notes
`np.random.seed(42)`	`rng = default_rng(42)`	No global state
`np.random.rand(3, 4)`	`rng.random((3, 4))`	Shape as tuple
`np.random.randn(3, 4)`	`rng.standard_normal((3, 4))`	Shape as tuple
`np.random.randint(0, 10, 5)`	`rng.integers(0, 10, 5)`	`endpoint=True` option
`np.random.choice(a, 5)`	`rng.choice(a, 5)`	Same interface
`np.random.shuffle(a)`	`rng.shuffle(a)`	Same interface

The one thing to remember: NumPy’s random architecture cleanly separates bit generation from distribution sampling — understanding this separation lets you control reproducibility, parallelize safely, and choose the right performance tradeoffs.

pythonnumpydata-science