Matrix Operations — Deep Dive

Performance-oriented matrix operations in Python: memory layout, BLAS acceleration, sparse matrices, and GPU offloading for production-scale linear algebra.

Memory layout and performance

NumPy arrays have a flags attribute that tells you whether the data is stored in C-contiguous (row-major) or Fortran-contiguous (column-major) order. This matters enormously for performance because modern CPUs load data in cache lines. Iterating along the contiguous dimension is fast; iterating across it causes cache misses.

import numpy as np

A = np.random.randn(1000, 1000)
print(A.flags['C_CONTIGUOUS'])  # True by default

# Force Fortran order
A_f = np.asfortranarray(A)
print(A_f.flags['F_CONTIGUOUS'])  # True

Matrix multiplication with A @ B delegates to BLAS (Basic Linear Algebra Subprograms). The BLAS implementation — OpenBLAS, MKL, or ATLAS — determines actual speed. MKL-backed NumPy can be 2–5× faster than default OpenBLAS on Intel hardware for large matrices.

Check your BLAS backend:

np.show_config()

Strided tricks for advanced views

NumPy’s as_strided allows creating views with custom memory strides. This is powerful but dangerous — incorrect strides read garbage memory.

A practical use case is building a sliding window matrix without copying data:

from numpy.lib.stride_tricks import sliding_window_view

signal = np.arange(100, dtype=np.float64)
windows = sliding_window_view(signal, window_shape=10)
# windows.shape: (91, 10) — zero-copy view

For Toeplitz or Hankel matrix construction, strided views avoid allocating the full dense matrix.

Sparse matrices with SciPy

When matrices are mostly zeros (common in recommendation systems, graph adjacency, and NLP), storing every element wastes memory and computation. SciPy provides several sparse formats:

Format	Best for	Access pattern
CSR (Compressed Sparse Row)	Row slicing, matrix-vector multiply	Fast row access
CSC (Compressed Sparse Column)	Column slicing, transposed operations	Fast column access
COO (Coordinate)	Incremental construction	Fast construction
LIL (List of Lists)	Row-by-row construction	Flexible insertion

from scipy import sparse

# Build in COO, convert to CSR for computation
rows = [0, 0, 1, 2, 2]
cols = [0, 2, 1, 0, 2]
data = [1.0, 3.0, 2.0, 4.0, 5.0]
coo = sparse.coo_matrix((data, (rows, cols)), shape=(3, 3))
csr = coo.tocsr()

# Sparse matrix-vector multiply
x = np.array([1.0, 2.0, 3.0])
result = csr @ x  # Only touches non-zero entries

For a 100,000×100,000 matrix with 0.01% density, sparse storage uses ~80KB vs ~80GB dense.

Blocked and tiled multiplication

For matrices that exceed CPU cache, blocking (tiling) improves performance by operating on sub-matrices that fit in L1 or L2 cache. BLAS implementations do this internally, but understanding the principle matters when you write custom kernels or use Numba:

import numba

@numba.njit(parallel=True)
def blocked_matmul(A, B, block_size=64):
    m, n = A.shape[0], B.shape[1]
    k = A.shape[1]
    C = np.zeros((m, n))
    for ii in numba.prange(0, m, block_size):
        for jj in range(0, n, block_size):
            for kk in range(0, k, block_size):
                for i in range(ii, min(ii + block_size, m)):
                    for j in range(jj, min(jj + block_size, n)):
                        s = 0.0
                        for ki in range(kk, min(kk + block_size, k)):
                            s += A[i, ki] * B[ki, j]
                        C[i, j] += s
    return C

Eigendecomposition and SVD

Two decompositions dominate applied linear algebra:

Eigendecomposition factors a square matrix A into A = QΛQ⁻¹, where Λ is a diagonal matrix of eigenvalues and Q holds eigenvectors. Used in PCA, stability analysis, and graph spectral methods.

eigenvalues, eigenvectors = np.linalg.eigh(symmetric_matrix)
# eigh is faster and more stable than eig for symmetric matrices

Singular Value Decomposition (SVD) works on any matrix: A = UΣVᵀ. It powers dimensionality reduction, matrix completion (Netflix recommendation), and least-squares solutions.

U, sigma, Vt = np.linalg.svd(A, full_matrices=False)
# Truncated SVD: keep top k components
k = 50
A_approx = U[:, :k] @ np.diag(sigma[:k]) @ Vt[:k, :]

For large-scale SVD, use scipy.sparse.linalg.svds or sklearn.decomposition.TruncatedSVD which operate on sparse matrices without densifying.

GPU-accelerated matrices

For matrices beyond 10,000×10,000, GPU acceleration delivers 10–100× speedups. CuPy provides a NumPy-compatible API backed by CUDA:

import cupy as cp

A_gpu = cp.random.randn(10000, 10000)
B_gpu = cp.random.randn(10000, 10000)
C_gpu = A_gpu @ B_gpu  # Runs on GPU via cuBLAS
C_cpu = cp.asnumpy(C_gpu)  # Transfer back

PyTorch tensors offer similar GPU matrix operations with the added benefit of automatic differentiation for machine learning workflows.

Numerical stability

Matrix operations accumulate floating-point errors. Key practices:

Condition number — np.linalg.cond(A) measures how sensitive the output is to input perturbations. Condition numbers above 10¹⁰ signal trouble.
Use lstsq instead of solving normal equations — np.linalg.lstsq(A, b) is numerically stable; computing inv(A.T @ A) @ A.T @ b is not.
Prefer eigh over eig for symmetric matrices — exploits symmetry for better stability and half the computation.
Scale your data — matrices with entries spanning many orders of magnitude suffer worse rounding errors.

Real-world application: image transformation

Every 2D geometric transformation — rotation, scaling, shearing, translation — can be represented as a 3×3 matrix (using homogeneous coordinates). Applying a transformation to an image means multiplying each pixel coordinate by this matrix:

import cv2

# 30-degree rotation around image center
rows, cols = img.shape[:2]
center = (cols / 2, rows / 2)
M = cv2.getRotationMatrix2D(center, 30, 1.0)
rotated = cv2.warpAffine(img, M, (cols, rows))

Chaining transformations is just matrix multiplication: rotate then scale = scale_matrix @ rotate_matrix (applied right to left).

One thing to remember: Matrix operations are only as fast as your BLAS backend and memory layout allow — profiling np.show_config() and checking contiguity with .flags are the first steps to real performance.

pythonmathlinear-algebraperformance