Linear Algebra with NumPy — Deep Dive
Under the hood: LAPACK and BLAS
NumPy’s linalg module is a thin wrapper around LAPACK (Linear Algebra PACKage) and BLAS (Basic Linear Algebra Subprograms). When you call np.linalg.solve, NumPy dispatches to dgesv (double-precision general solver). When you call eigh, it dispatches to dsyevd (divide-and-conquer symmetric eigensolver).
The BLAS implementation determines raw performance:
| Backend | Source | Typical speedup over reference BLAS |
|---|---|---|
| OpenBLAS | Open source, auto-detects CPU | 5–20× |
| Intel MKL | Commercial (free in conda) | 10–30× on Intel CPUs |
| Apple Accelerate | macOS built-in | 10–25× on Apple Silicon |
| BLIS | AMD-focused open source | 10–20× on AMD CPUs |
To check and switch:
import numpy as np
np.__config__.show() # Shows linked BLAS/LAPACK
# Install MKL-backed NumPy via conda:
# conda install numpy blas=*=mkl
Thread control for BLAS
BLAS implementations use multiple threads by default. In production, this interacts badly with Python’s own multiprocessing or job schedulers:
import os
os.environ['OMP_NUM_THREADS'] = '1' # OpenMP (OpenBLAS)
os.environ['MKL_NUM_THREADS'] = '1' # Intel MKL
os.environ['OPENBLAS_NUM_THREADS'] = '1' # OpenBLAS specific
# Must be set BEFORE importing numpy
import numpy as np
Use threadpoolctl for runtime control:
from threadpoolctl import threadpool_limits
with threadpool_limits(limits=4, user_api='blas'):
result = np.linalg.svd(large_matrix)
Structured matrix optimizations
General solvers treat every matrix as dense. When your matrix has structure, specialized routines are orders of magnitude faster:
Banded matrices
Matrices where non-zero entries cluster near the diagonal (common in differential equations):
from scipy.linalg import solve_banded
# ab contains the bands; l and u are lower/upper bandwidth
ab = np.array([[0, 2, 2], # upper diagonal
[5, 5, 5], # main diagonal
[1, 1, 0]]) # lower diagonal
b = np.array([1.0, 2.0, 3.0])
x = solve_banded((1, 1), ab, b)
Symmetric positive definite
Use Cholesky decomposition instead of general LU — it is 2× faster and guaranteed stable:
L = np.linalg.cholesky(A) # A = L @ L.T
# Solve via forward then back substitution
from scipy.linalg import cho_solve
x = cho_solve((L, True), b)
Triangular systems
If you already have a triangular matrix, skip the factorization:
from scipy.linalg import solve_triangular
x = solve_triangular(L, b, lower=True)
Iterative solvers for large sparse systems
Direct solvers (LU, Cholesky) have O(n³) complexity and O(n²) memory. For sparse systems with millions of unknowns, iterative solvers are the only option:
from scipy.sparse.linalg import cg, gmres, lgmres
# Conjugate Gradient — for symmetric positive definite
x, info = cg(A_sparse, b, tol=1e-10, maxiter=5000)
# GMRES — for general non-symmetric systems
x, info = gmres(A_sparse, b, tol=1e-10, restart=50)
Convergence depends heavily on preconditioning. An ideal preconditioner M approximates A⁻¹ cheaply:
from scipy.sparse.linalg import LinearOperator, spilu
# Incomplete LU preconditioner
ilu = spilu(A_sparse.tocsc())
M = LinearOperator(A_sparse.shape, matvec=ilu.solve)
x, info = gmres(A_sparse, b, M=M, tol=1e-10)
Randomized algorithms
For matrices too large for exact SVD, randomized methods approximate the top-k singular values in O(mnk) instead of O(mn·min(m,n)):
from sklearn.utils.extmath import randomized_svd
U, sigma, Vt = randomized_svd(A, n_components=50, random_state=42)
The Johnson-Lindenstrauss lemma guarantees that random projections preserve distances with high probability, making these approximations reliable for dimensionality reduction.
Batched operations
Modern applications (deep learning, Monte Carlo simulations) operate on stacks of matrices. NumPy supports batched operations natively:
# 100 matrices, each 4x4
batch = np.random.randn(100, 4, 4)
# Invert all 100 in one call
inverses = np.linalg.inv(batch)
# Batched eigenvalues
eigenvalues = np.linalg.eigvalsh(batch) # shape: (100, 4)
# Batched matrix multiply
A = np.random.randn(100, 4, 3)
B = np.random.randn(100, 3, 5)
C = A @ B # shape: (100, 4, 5)
This avoids Python-level loops and enables BLAS to optimize memory access across the batch.
Numerical stability patterns
Pivoting
LU decomposition with partial pivoting reorders rows to avoid dividing by small numbers. NumPy’s solve does this automatically. If you implement custom solvers, always pivot.
Regularization
For ill-conditioned systems, add a small diagonal term (Tikhonov regularization):
# Instead of solving A @ x = b directly:
alpha = 1e-6
x = np.linalg.solve(A.T @ A + alpha * np.eye(n), A.T @ b)
# Better: use lstsq with built-in conditioning
x, residuals, rank, sv = np.linalg.lstsq(A, b, rcond=None)
Mixed precision
For large-scale problems, compute in float32 for speed and refine in float64:
A32 = A.astype(np.float32)
x32 = np.linalg.solve(A32, b.astype(np.float32))
# Iterative refinement
residual = b - A @ x32.astype(np.float64)
correction = np.linalg.solve(A, residual)
x_refined = x32.astype(np.float64) + correction
Real-world case: PCA from scratch
Principal Component Analysis is eigendecomposition of the covariance matrix:
def pca(X, n_components):
"""X: (n_samples, n_features) centered data."""
# Covariance matrix
cov = (X.T @ X) / (X.shape[0] - 1)
# Top eigenvalues/vectors (eigh returns ascending order)
eigenvalues, eigenvectors = np.linalg.eigh(cov)
# Take top-k (reverse for descending)
idx = np.argsort(eigenvalues)[::-1][:n_components]
components = eigenvectors[:, idx]
# Project data
return X @ components, eigenvalues[idx]
For datasets with more samples than features, this is efficient. When features outnumber samples, use SVD on the data matrix directly — it avoids forming the n_features × n_features covariance matrix.
One thing to remember: Know your matrix structure — symmetric, sparse, banded, positive definite — because choosing the right solver for that structure is the single biggest performance lever in numerical linear algebra.
See Also
- Python Bayesian Inference How updating your beliefs with new evidence works — and why it helps computers make smarter guesses.
- Python Convolution Operations The sliding-window trick that lets computers sharpen photos, recognize faces, and hear words in noisy audio.
- Python Fourier Transforms How breaking any sound, image, or signal into simple waves reveals hidden patterns invisible to the naked eye.
- Python Genetic Algorithms How computers borrow evolution's playbook — survival of the fittest, mutation, and reproduction — to solve problems too complicated for brute force.
- Python Markov Chains Why the next thing that happens often depends only on what is happening right now — and how that one rule generates text, predicts weather, and powers board games.