NumPy Ufunc Creation — Deep Dive
Technical foundation
A NumPy ufunc is a C struct (PyUFuncObject) that wraps one or more typed inner loops. When called, the ufunc machinery selects the appropriate loop based on input dtypes, sets up the broadcasting iterator, and dispatches the computation. Creating a custom ufunc means registering your own inner loops into this machinery.
The ufunc protocol in detail
Every ufunc call follows this sequence:
- Type resolution — determine input/output dtypes from the call signature and available loops.
- Broadcasting — align input shapes via the broadcast iterator.
- Loop selection — pick the inner loop matching the resolved types.
- Execution — iterate over elements, calling the inner loop with stride information.
- Output — write results to the output array (pre-allocated or newly created).
Custom ufuncs participate in this same pipeline, which is why they automatically support out, where, casting, order, and other standard ufunc keyword arguments.
Numba @vectorize — the practical choice
Basic usage with multiple signatures
from numba import vectorize, int32, int64, float32, float64
@vectorize([
int32(int32, int32),
int64(int64, int64),
float32(float32, float32),
float64(float64, float64),
])
def safe_divide(a, b):
if b == 0:
return 0
return a / b
Numba compiles four separate inner loops, one per type signature. NumPy’s type resolution selects the appropriate one at call time.
Using .reduce() and .accumulate()
import numpy as np
# safe_divide.reduce works like np.add.reduce but with your function
arr = np.array([100.0, 2.0, 5.0, 2.0])
print(safe_divide.reduce(arr)) # ((100 / 2) / 5) / 2 = 5.0
print(safe_divide.accumulate(arr)) # [100. 50. 10. 5.]
For .reduce() to work correctly, your function should be associative (or at least the reduction order should match your intent — NumPy reduces left-to-right by default).
Generalized ufuncs (@guvectorize)
Regular ufuncs operate element-wise. Generalized ufuncs operate on sub-arrays, defined by a signature string:
from numba import guvectorize, float64
@guvectorize([(float64[:], float64[:], float64[:])], '(n),(n)->(n)')
def weighted_sum(a, b, result):
total = 0.0
for i in range(a.shape[0]):
total += a[i]
for i in range(result.shape[0]):
result[i] = a[i] * b[i] / total
# Works on 1D arrays
a = np.array([1.0, 2.0, 3.0])
b = np.array([10.0, 20.0, 30.0])
print(weighted_sum(a, b)) # [1.67, 6.67, 15.0]
# Broadcasting: works on batches too
batch_a = np.random.randn(100, 5)
batch_b = np.random.randn(100, 5)
result = weighted_sum(batch_a, batch_b) # (100, 5) — each row processed independently
The signature '(n),(n)->(n)' means: “take two 1D arrays of the same length and produce one 1D array of the same length.” The outer dimensions (here, the 100 rows) are broadcast automatically.
Signature patterns
| Signature | Meaning |
|---|---|
'(),()->()' | Element-wise (same as @vectorize) |
'(n)->()' | Reduce a vector to a scalar |
'(m,n)->(n)' | Process matrix rows |
'(m,n),(n,p)->(m,p)' | Matrix multiplication pattern |
'(n),(n)->()' | Dot product pattern |
Building ufuncs with the C API
For maximum control, you can write ufunc inner loops in C:
// Inner loop function signature
static void double_add_loop(
char **args, // input/output data pointers
npy_intp *dimensions, // number of elements
npy_intp *steps, // byte strides
void *data // extra data (unused here)
) {
npy_intp n = dimensions[0];
char *in1 = args[0], *in2 = args[1], *out = args[2];
npy_intp s1 = steps[0], s2 = steps[1], so = steps[2];
for (npy_intp i = 0; i < n; i++) {
*(double *)out = *(double *)in1 + *(double *)in2;
in1 += s1; in2 += s2; out += so;
}
}
Register it in your module’s init function:
PyUFuncObject *ufunc = (PyUFuncObject *)PyUFunc_FromFuncAndData(
loop_funcs, // array of function pointers
loop_data, // array of void* data per loop
loop_types, // array of type signatures
1, // number of loops
2, // number of inputs
1, // number of outputs
PyUFunc_None, // identity for reduce (None = no identity)
"custom_add", // name
"Custom add doc", // docstring
0 // unused
);
This is what SciPy uses internally for its special functions. The overhead is significant (build system, type arrays, error handling), but the result is a ufunc indistinguishable from NumPy’s built-in ones.
np.frompyfunc internals and limitations
np.frompyfunc creates a ufunc whose inner loop calls PyObject_Call for each element. This means:
- Every element is boxed to a Python object, passed to the function, and the result is unboxed.
- The GIL is held for the entire operation.
- Output dtype is always
object.
import numpy as np
def cube(x):
return x ** 3
cube_ufunc = np.frompyfunc(cube, 1, 1)
result = cube_ufunc(np.arange(5))
print(result.dtype) # object — not int64!
print(result.astype('i8')) # must cast manually
Despite the overhead, frompyfunc creates a real ufunc with reduce, accumulate, and outer methods:
mul = np.frompyfunc(lambda a, b: a * b, 2, 1)
print(mul.reduce([1, 2, 3, 4, 5])) # 120 — factorial via reduce
Performance comparison
Benchmarking a simple operation f(a, b) = sqrt(a^2 + b^2) on 1M float64 elements:
| Method | Time (ms) | Relative |
|---|---|---|
| Pure NumPy expression | 3.2 | 1x |
Numba @vectorize | 3.5 | 1.1x |
np.vectorize | 850 | 265x |
np.frompyfunc | 920 | 287x |
| Python for-loop | 1400 | 437x |
Numba matches pure NumPy for simple expressions and pulls ahead for complex branching logic that NumPy cannot express without temporaries.
Advanced: custom type resolution
NumPy 2.0 introduced PyUFunc_AddWrappingLoop for custom type resolution. This lets your ufunc handle new dtypes (like datetime64 or custom extension types) without modifying the ufunc’s core loops:
# Conceptual — actual API is C-level
# Register a loop that handles your custom dtype
ufunc.add_loop(
my_custom_dtype,
inner_loop_func,
output_dtype=my_custom_dtype,
)
Practical recipe: domain-specific ufunc
from numba import vectorize, float64
import numpy as np
@vectorize([float64(float64, float64, float64)])
def black_scholes_d1(S, K, sigma):
"""Compute d1 term of Black-Scholes formula (simplified, T=1, r=0)."""
if sigma <= 0 or S <= 0 or K <= 0:
return 0.0
from math import log, sqrt
return (log(S / K) + 0.5 * sigma ** 2) / (sigma * sqrt(1.0))
# Vectorized over arrays of options
spots = np.linspace(80, 120, 10000)
strikes = np.full(10000, 100.0)
vols = np.full(10000, 0.2)
d1 = black_scholes_d1(spots, strikes, vols)
This processes 10,000 options in microseconds, with full broadcasting support if you pass arrays of different shapes.
The one thing to remember: Numba @vectorize gives you C-speed custom ufuncs with full NumPy protocol support — it is the sweet spot between np.vectorize convenience and C API power.
See Also
- Python Bokeh Get an intuitive feel for Bokeh so Python behavior stops feeling unpredictable.
- Python Numpy Advanced Indexing How to cherry-pick exactly the data you want from a NumPy array using lists, masks, and fancy tricks.
- Python Numpy Broadcasting Rules How NumPy magically makes different-sized arrays work together without you writing any loops.
- Python Numpy Einsum One tiny function that replaces dozens of NumPy operations — once you learn its shorthand, array math becomes a breeze.
- Python Numpy Fft Spectral How NumPy breaks apart a signal into its hidden frequencies — like separating a chord into individual notes.