Matrix Operations — Deep Dive
Memory layout and performance
NumPy arrays have a flags attribute that tells you whether the data is stored in C-contiguous (row-major) or Fortran-contiguous (column-major) order. This matters enormously for performance because modern CPUs load data in cache lines. Iterating along the contiguous dimension is fast; iterating across it causes cache misses.
import numpy as np
A = np.random.randn(1000, 1000)
print(A.flags['C_CONTIGUOUS']) # True by default
# Force Fortran order
A_f = np.asfortranarray(A)
print(A_f.flags['F_CONTIGUOUS']) # True
Matrix multiplication with A @ B delegates to BLAS (Basic Linear Algebra Subprograms). The BLAS implementation — OpenBLAS, MKL, or ATLAS — determines actual speed. MKL-backed NumPy can be 2–5× faster than default OpenBLAS on Intel hardware for large matrices.
Check your BLAS backend:
np.show_config()
Strided tricks for advanced views
NumPy’s as_strided allows creating views with custom memory strides. This is powerful but dangerous — incorrect strides read garbage memory.
A practical use case is building a sliding window matrix without copying data:
from numpy.lib.stride_tricks import sliding_window_view
signal = np.arange(100, dtype=np.float64)
windows = sliding_window_view(signal, window_shape=10)
# windows.shape: (91, 10) — zero-copy view
For Toeplitz or Hankel matrix construction, strided views avoid allocating the full dense matrix.
Sparse matrices with SciPy
When matrices are mostly zeros (common in recommendation systems, graph adjacency, and NLP), storing every element wastes memory and computation. SciPy provides several sparse formats:
| Format | Best for | Access pattern |
|---|---|---|
| CSR (Compressed Sparse Row) | Row slicing, matrix-vector multiply | Fast row access |
| CSC (Compressed Sparse Column) | Column slicing, transposed operations | Fast column access |
| COO (Coordinate) | Incremental construction | Fast construction |
| LIL (List of Lists) | Row-by-row construction | Flexible insertion |
from scipy import sparse
# Build in COO, convert to CSR for computation
rows = [0, 0, 1, 2, 2]
cols = [0, 2, 1, 0, 2]
data = [1.0, 3.0, 2.0, 4.0, 5.0]
coo = sparse.coo_matrix((data, (rows, cols)), shape=(3, 3))
csr = coo.tocsr()
# Sparse matrix-vector multiply
x = np.array([1.0, 2.0, 3.0])
result = csr @ x # Only touches non-zero entries
For a 100,000×100,000 matrix with 0.01% density, sparse storage uses ~80KB vs ~80GB dense.
Blocked and tiled multiplication
For matrices that exceed CPU cache, blocking (tiling) improves performance by operating on sub-matrices that fit in L1 or L2 cache. BLAS implementations do this internally, but understanding the principle matters when you write custom kernels or use Numba:
import numba
@numba.njit(parallel=True)
def blocked_matmul(A, B, block_size=64):
m, n = A.shape[0], B.shape[1]
k = A.shape[1]
C = np.zeros((m, n))
for ii in numba.prange(0, m, block_size):
for jj in range(0, n, block_size):
for kk in range(0, k, block_size):
for i in range(ii, min(ii + block_size, m)):
for j in range(jj, min(jj + block_size, n)):
s = 0.0
for ki in range(kk, min(kk + block_size, k)):
s += A[i, ki] * B[ki, j]
C[i, j] += s
return C
Eigendecomposition and SVD
Two decompositions dominate applied linear algebra:
Eigendecomposition factors a square matrix A into A = QΛQ⁻¹, where Λ is a diagonal matrix of eigenvalues and Q holds eigenvectors. Used in PCA, stability analysis, and graph spectral methods.
eigenvalues, eigenvectors = np.linalg.eigh(symmetric_matrix)
# eigh is faster and more stable than eig for symmetric matrices
Singular Value Decomposition (SVD) works on any matrix: A = UΣVᵀ. It powers dimensionality reduction, matrix completion (Netflix recommendation), and least-squares solutions.
U, sigma, Vt = np.linalg.svd(A, full_matrices=False)
# Truncated SVD: keep top k components
k = 50
A_approx = U[:, :k] @ np.diag(sigma[:k]) @ Vt[:k, :]
For large-scale SVD, use scipy.sparse.linalg.svds or sklearn.decomposition.TruncatedSVD which operate on sparse matrices without densifying.
GPU-accelerated matrices
For matrices beyond 10,000×10,000, GPU acceleration delivers 10–100× speedups. CuPy provides a NumPy-compatible API backed by CUDA:
import cupy as cp
A_gpu = cp.random.randn(10000, 10000)
B_gpu = cp.random.randn(10000, 10000)
C_gpu = A_gpu @ B_gpu # Runs on GPU via cuBLAS
C_cpu = cp.asnumpy(C_gpu) # Transfer back
PyTorch tensors offer similar GPU matrix operations with the added benefit of automatic differentiation for machine learning workflows.
Numerical stability
Matrix operations accumulate floating-point errors. Key practices:
- Condition number —
np.linalg.cond(A)measures how sensitive the output is to input perturbations. Condition numbers above 10¹⁰ signal trouble. - Use
lstsqinstead of solving normal equations —np.linalg.lstsq(A, b)is numerically stable; computinginv(A.T @ A) @ A.T @ bis not. - Prefer
eighovereigfor symmetric matrices — exploits symmetry for better stability and half the computation. - Scale your data — matrices with entries spanning many orders of magnitude suffer worse rounding errors.
Real-world application: image transformation
Every 2D geometric transformation — rotation, scaling, shearing, translation — can be represented as a 3×3 matrix (using homogeneous coordinates). Applying a transformation to an image means multiplying each pixel coordinate by this matrix:
import cv2
# 30-degree rotation around image center
rows, cols = img.shape[:2]
center = (cols / 2, rows / 2)
M = cv2.getRotationMatrix2D(center, 30, 1.0)
rotated = cv2.warpAffine(img, M, (cols, rows))
Chaining transformations is just matrix multiplication: rotate then scale = scale_matrix @ rotate_matrix (applied right to left).
One thing to remember: Matrix operations are only as fast as your BLAS backend and memory layout allow — profiling np.show_config() and checking contiguity with .flags are the first steps to real performance.
See Also
- Python Bayesian Inference How updating your beliefs with new evidence works — and why it helps computers make smarter guesses.
- Python Convolution Operations The sliding-window trick that lets computers sharpen photos, recognize faces, and hear words in noisy audio.
- Python Fourier Transforms How breaking any sound, image, or signal into simple waves reveals hidden patterns invisible to the naked eye.
- Python Genetic Algorithms How computers borrow evolution's playbook — survival of the fittest, mutation, and reproduction — to solve problems too complicated for brute force.
- Python Linear Algebra Numpy Why solving puzzles with rows and columns of numbers is the secret engine behind search engines, video games, and AI.