Linear Algebra with NumPy — Deep Dive

Advanced NumPy linear algebra: custom BLAS tuning, structured matrices, iterative solvers, and production patterns for large-scale numerical computation.

Under the hood: LAPACK and BLAS

NumPy’s linalg module is a thin wrapper around LAPACK (Linear Algebra PACKage) and BLAS (Basic Linear Algebra Subprograms). When you call np.linalg.solve, NumPy dispatches to dgesv (double-precision general solver). When you call eigh, it dispatches to dsyevd (divide-and-conquer symmetric eigensolver).

The BLAS implementation determines raw performance:

Backend	Source	Typical speedup over reference BLAS
OpenBLAS	Open source, auto-detects CPU	5–20×
Intel MKL	Commercial (free in conda)	10–30× on Intel CPUs
Apple Accelerate	macOS built-in	10–25× on Apple Silicon
BLIS	AMD-focused open source	10–20× on AMD CPUs

To check and switch:

import numpy as np
np.__config__.show()  # Shows linked BLAS/LAPACK

# Install MKL-backed NumPy via conda:
# conda install numpy blas=*=mkl

Thread control for BLAS

BLAS implementations use multiple threads by default. In production, this interacts badly with Python’s own multiprocessing or job schedulers:

import os
os.environ['OMP_NUM_THREADS'] = '1'       # OpenMP (OpenBLAS)
os.environ['MKL_NUM_THREADS'] = '1'       # Intel MKL
os.environ['OPENBLAS_NUM_THREADS'] = '1'  # OpenBLAS specific

# Must be set BEFORE importing numpy
import numpy as np

Use threadpoolctl for runtime control:

from threadpoolctl import threadpool_limits

with threadpool_limits(limits=4, user_api='blas'):
    result = np.linalg.svd(large_matrix)

Structured matrix optimizations

General solvers treat every matrix as dense. When your matrix has structure, specialized routines are orders of magnitude faster:

Banded matrices

Matrices where non-zero entries cluster near the diagonal (common in differential equations):

from scipy.linalg import solve_banded

# ab contains the bands; l and u are lower/upper bandwidth
ab = np.array([[0, 2, 2],    # upper diagonal
               [5, 5, 5],     # main diagonal
               [1, 1, 0]])    # lower diagonal
b = np.array([1.0, 2.0, 3.0])
x = solve_banded((1, 1), ab, b)

Symmetric positive definite

Use Cholesky decomposition instead of general LU — it is 2× faster and guaranteed stable:

L = np.linalg.cholesky(A)  # A = L @ L.T
# Solve via forward then back substitution
from scipy.linalg import cho_solve
x = cho_solve((L, True), b)

Triangular systems

If you already have a triangular matrix, skip the factorization:

from scipy.linalg import solve_triangular
x = solve_triangular(L, b, lower=True)

Iterative solvers for large sparse systems

Direct solvers (LU, Cholesky) have O(n³) complexity and O(n²) memory. For sparse systems with millions of unknowns, iterative solvers are the only option:

from scipy.sparse.linalg import cg, gmres, lgmres

# Conjugate Gradient — for symmetric positive definite
x, info = cg(A_sparse, b, tol=1e-10, maxiter=5000)

# GMRES — for general non-symmetric systems
x, info = gmres(A_sparse, b, tol=1e-10, restart=50)

Convergence depends heavily on preconditioning. An ideal preconditioner M approximates A⁻¹ cheaply:

from scipy.sparse.linalg import LinearOperator, spilu

# Incomplete LU preconditioner
ilu = spilu(A_sparse.tocsc())
M = LinearOperator(A_sparse.shape, matvec=ilu.solve)
x, info = gmres(A_sparse, b, M=M, tol=1e-10)

Randomized algorithms

For matrices too large for exact SVD, randomized methods approximate the top-k singular values in O(mnk) instead of O(mn·min(m,n)):

from sklearn.utils.extmath import randomized_svd

U, sigma, Vt = randomized_svd(A, n_components=50, random_state=42)

The Johnson-Lindenstrauss lemma guarantees that random projections preserve distances with high probability, making these approximations reliable for dimensionality reduction.

Batched operations

Modern applications (deep learning, Monte Carlo simulations) operate on stacks of matrices. NumPy supports batched operations natively:

# 100 matrices, each 4x4
batch = np.random.randn(100, 4, 4)

# Invert all 100 in one call
inverses = np.linalg.inv(batch)

# Batched eigenvalues
eigenvalues = np.linalg.eigvalsh(batch)  # shape: (100, 4)

# Batched matrix multiply
A = np.random.randn(100, 4, 3)
B = np.random.randn(100, 3, 5)
C = A @ B  # shape: (100, 4, 5)

This avoids Python-level loops and enables BLAS to optimize memory access across the batch.

Numerical stability patterns

Pivoting

LU decomposition with partial pivoting reorders rows to avoid dividing by small numbers. NumPy’s solve does this automatically. If you implement custom solvers, always pivot.

Regularization

For ill-conditioned systems, add a small diagonal term (Tikhonov regularization):

# Instead of solving A @ x = b directly:
alpha = 1e-6
x = np.linalg.solve(A.T @ A + alpha * np.eye(n), A.T @ b)

# Better: use lstsq with built-in conditioning
x, residuals, rank, sv = np.linalg.lstsq(A, b, rcond=None)

Mixed precision

For large-scale problems, compute in float32 for speed and refine in float64:

A32 = A.astype(np.float32)
x32 = np.linalg.solve(A32, b.astype(np.float32))
# Iterative refinement
residual = b - A @ x32.astype(np.float64)
correction = np.linalg.solve(A, residual)
x_refined = x32.astype(np.float64) + correction

Real-world case: PCA from scratch

Principal Component Analysis is eigendecomposition of the covariance matrix:

def pca(X, n_components):
    """X: (n_samples, n_features) centered data."""
    # Covariance matrix
    cov = (X.T @ X) / (X.shape[0] - 1)
    
    # Top eigenvalues/vectors (eigh returns ascending order)
    eigenvalues, eigenvectors = np.linalg.eigh(cov)
    
    # Take top-k (reverse for descending)
    idx = np.argsort(eigenvalues)[::-1][:n_components]
    components = eigenvectors[:, idx]
    
    # Project data
    return X @ components, eigenvalues[idx]

For datasets with more samples than features, this is efficient. When features outnumber samples, use SVD on the data matrix directly — it avoids forming the n_features × n_features covariance matrix.

One thing to remember: Know your matrix structure — symmetric, sparse, banded, positive definite — because choosing the right solver for that structure is the single biggest performance lever in numerical linear algebra.

pythonnumpylinear-algebramathperformance