Backtesting Trading Strategies with Python — Deep Dive
Event-driven vs. vectorized backtesting
There are two architectural approaches, each with distinct tradeoffs.
Vectorized backtesting processes entire arrays of data at once using NumPy operations. It is fast (often 100× faster) but cannot model complex order types, partial fills, or path-dependent strategies like trailing stops.
import numpy as np
import pandas as pd
def vectorized_sma_crossover(prices: pd.Series, fast: int = 10, slow: int = 30) -> pd.DataFrame:
sma_fast = prices.rolling(fast).mean()
sma_slow = prices.rolling(slow).mean()
signal = np.where(sma_fast > sma_slow, 1.0, -1.0)
signal = pd.Series(signal, index=prices.index)
daily_returns = prices.pct_change()
strategy_returns = signal.shift(1) * daily_returns # shift avoids look-ahead
return pd.DataFrame({
"market_return": daily_returns,
"strategy_return": strategy_returns,
"cumulative": (1 + strategy_returns).cumprod(),
})
Event-driven backtesting processes each bar (or tick) sequentially, maintaining state for open orders, positions, and cash. It is slower but mirrors live trading logic closely, making the transition from backtest to production smoother.
from dataclasses import dataclass, field
from typing import Optional
@dataclass
class Position:
symbol: str
shares: int
entry_price: float
entry_date: str
@dataclass
class Portfolio:
cash: float
positions: dict[str, Position] = field(default_factory=dict)
@property
def equity(self) -> float:
# In a real system, mark-to-market positions here
return self.cash + sum(
p.shares * p.entry_price for p in self.positions.values()
)
def execute_buy(self, symbol: str, shares: int, price: float, date: str, cost_bps: float = 10.0):
gross = shares * price
cost = gross * cost_bps / 10_000
if gross + cost > self.cash:
return False
self.cash -= gross + cost
self.positions[symbol] = Position(symbol, shares, price, date)
return True
def execute_sell(self, symbol: str, price: float, cost_bps: float = 10.0) -> Optional[float]:
if symbol not in self.positions:
return None
pos = self.positions.pop(symbol)
gross = pos.shares * price
cost = gross * cost_bps / 10_000
self.cash += gross - cost
return (price - pos.entry_price) / pos.entry_price
The choice depends on strategy complexity. Use vectorized for rapid signal research; switch to event-driven for strategies involving stop-losses, limit orders, or multi-asset rebalancing.
Walk-forward optimization
Walk-forward analysis is the gold standard for validation because it simulates how a strategy would actually be maintained over time.
import pandas as pd
import numpy as np
from typing import Callable
def walk_forward(
prices: pd.Series,
train_window: int,
test_window: int,
optimize_fn: Callable,
backtest_fn: Callable,
) -> list[dict]:
"""
Slide a training window through the data, optimize parameters,
then test on the following out-of-sample period.
"""
results = []
start = 0
while start + train_window + test_window <= len(prices):
train = prices.iloc[start : start + train_window]
test = prices.iloc[start + train_window : start + train_window + test_window]
best_params = optimize_fn(train)
oos_return = backtest_fn(test, best_params)
results.append({
"train_start": train.index[0],
"test_start": test.index[0],
"test_end": test.index[-1],
"params": best_params,
"oos_return": oos_return,
})
start += test_window # slide forward
return results
The key insight: if out-of-sample performance degrades significantly compared to in-sample, the strategy is likely overfit. Consistent (even if lower) out-of-sample returns indicate a robust edge.
Statistical validation
The multiple testing problem
If you test 100 strategy variants, roughly 5 will appear significant at the 5% level by pure chance. Adjustments are essential:
- Bonferroni correction: divide the significance threshold by the number of tests. Simple but conservative.
- Deflated Sharpe Ratio (Bailey and López de Prado): adjusts the Sharpe Ratio for the number of strategies tried, their correlation, and skewness/kurtosis of returns.
from scipy.stats import norm
import numpy as np
def deflated_sharpe_ratio(
observed_sharpe: float,
num_trials: int,
variance_of_sharpes: float,
skewness: float = 0.0,
kurtosis: float = 3.0,
num_returns: int = 252,
) -> float:
"""
Probability that the observed Sharpe Ratio is genuine,
adjusting for multiple testing.
"""
expected_max_sharpe = variance_of_sharpes * (
(1 - np.euler_gamma) * norm.ppf(1 - 1 / num_trials)
+ np.euler_gamma * norm.ppf(1 - 1 / (num_trials * np.e))
)
se = np.sqrt(
(1 - skewness * observed_sharpe + (kurtosis - 1) / 4 * observed_sharpe**2)
/ (num_returns - 1)
)
return float(norm.cdf((observed_sharpe - expected_max_sharpe) / se))
A DSR above 0.95 suggests the Sharpe Ratio is unlikely to be a statistical artifact.
Combinatorial purged cross-validation
Standard k-fold cross-validation leaks information between folds because financial time series are autocorrelated. CPCV (by López de Prado) creates combinatorial splits with purged gaps between train and test sets, eliminating temporal leakage.
Realistic simulation details
Order fill modeling
Naive backtests assume you buy at the close price. Reality is messier:
def simulate_fill(
order_price: float,
high: float,
low: float,
volume: int,
order_shares: int,
spread_pct: float = 0.05,
) -> tuple[float, int]:
"""
Estimate fill price and filled quantity.
Applies spread and limits fill to a fraction of bar volume.
"""
max_fill = int(volume * 0.02) # max 2% of bar volume
filled = min(order_shares, max_fill)
# Adverse fill: buy slightly above mid, sell slightly below
fill_price = order_price * (1 + spread_pct / 100)
# Ensure fill price was achievable within the bar's range
fill_price = min(fill_price, high)
fill_price = max(fill_price, low)
return fill_price, filled
Margin and leverage
Leveraged strategies need accurate margin modeling. A 2× leveraged portfolio does not simply double returns — borrowing costs, margin calls during drawdowns, and forced liquidation at the worst times create nonlinear effects that must be simulated.
Performance reporting
A production-quality backtest report includes:
import numpy as np
import pandas as pd
def performance_report(returns: pd.Series, benchmark: pd.Series, risk_free: float = 0.04) -> dict:
"""Comprehensive strategy performance metrics."""
excess = returns - risk_free / 252
cumulative = (1 + returns).cumprod()
running_max = cumulative.cummax()
drawdown = (cumulative - running_max) / running_max
annual_return = (1 + returns.mean()) ** 252 - 1
annual_vol = returns.std() * np.sqrt(252)
sharpe = excess.mean() / returns.std() * np.sqrt(252) if returns.std() > 0 else 0
# Sortino: only downside deviation
downside = returns[returns < 0].std() * np.sqrt(252)
sortino = excess.mean() / (downside / np.sqrt(252)) if downside > 0 else 0
# Calmar: annual return / max drawdown
max_dd = abs(drawdown.min())
calmar = annual_return / max_dd if max_dd > 0 else 0
# Win rate
winning_days = (returns > 0).sum()
total_days = len(returns.dropna())
return {
"annual_return": f"{annual_return:.2%}",
"annual_volatility": f"{annual_vol:.2%}",
"sharpe_ratio": round(sharpe, 2),
"sortino_ratio": round(sortino, 2),
"calmar_ratio": round(calmar, 2),
"max_drawdown": f"{max_dd:.2%}",
"win_rate": f"{winning_days / total_days:.1%}",
"best_day": f"{returns.max():.2%}",
"worst_day": f"{returns.min():.2%}",
"total_trading_days": total_days,
}
From backtest to live trading
The transition requires:
- Paper trading period: run the strategy with real market data but simulated fills for 1–3 months. Compare paper results against what the backtest predicted for the same period.
- Gradual capital allocation: start with a fraction of intended capital. Scale up only after live metrics match backtest expectations within tolerance.
- Kill switch: automated rules that halt trading if drawdown exceeds a threshold or if data feeds go stale.
- Reconciliation: daily comparison of expected positions (from the strategy engine) versus actual positions (from the broker).
Common anti-patterns
- Optimization on the full dataset: always hold out data for testing.
- Ignoring regime changes: a momentum strategy tuned on a bull market will suffer in sideways or bear conditions.
- Cherry-picking start dates: shifting the start date by a few months can dramatically change results. Report performance across multiple windows.
- No benchmark comparison: a strategy that returns 10% annually sounds great until you realize the S&P 500 returned 12% in the same period with zero effort.
The one thing to remember: A trustworthy backtest treats every assumption as suspect — fill prices, data cleanliness, parameter stability, and statistical significance all need explicit validation before a strategy earns the right to trade real capital.
See Also
- Python Technical Indicators What technical indicators are and how Python calculates them, explained like you have never seen a stock chart.
- Python Fraud Detection Patterns How Python helps banks and companies catch cheaters and thieves before they get away with it.
- Python Portfolio Optimization How Python helps you pick the right mix of investments so you get the best return for the risk you are willing to take.
- Python Quantitative Finance How Python helps people use math and data to make smarter money decisions, explained without any jargon.
- Python Risk Analysis Monte Carlo How rolling a virtual dice thousands of times helps investors understand what could go wrong with their money.