Simulated Annealing — Deep Dive
Theoretical foundations
Simulated annealing’s convergence guarantee rests on the theory of inhomogeneous Markov chains. At each temperature, the Metropolis criterion defines a transition probability that satisfies detailed balance with respect to the Boltzmann distribution:
P(x) ∝ exp(-E(x) / T)
As T → 0, this distribution concentrates on the global minimum. The logarithmic cooling schedule T(k) = C / log(1 + k) guarantees convergence to the global optimum with probability 1 — but convergence is so slow that practitioners always use faster schedules and accept approximate solutions.
Adaptive cooling schedules
Lam-Delosme schedule
Maintains a target acceptance rate (typically ~44% for continuous, ~23% for combinatorial) by adjusting temperature based on observed acceptance:
class AdaptiveSA:
def __init__(self, target_acceptance=0.44, damping=0.7):
self.target = target_acceptance
self.damping = damping
self.acceptance_history = []
def update_temperature(self, T, accepted, total, window=100):
if total < window:
return T
recent_rate = sum(self.acceptance_history[-window:]) / window
if recent_rate > self.target:
# Too many acceptances — cool faster
return T * (1 - self.damping * (recent_rate - self.target))
else:
# Too few acceptances — cool slower (or reheat slightly)
return T * (1 + self.damping * (self.target - recent_rate))
def record(self, accepted):
self.acceptance_history.append(1 if accepted else 0)
Thermodynamic speed limit
The cooling rate should not exceed the system’s ability to equilibrate. A practical rule: at each temperature, run enough iterations for the chain to explore the local energy landscape before cooling. This leads to isothermal step counts that scale with problem difficulty.
Reheating and restart strategies
When the algorithm stagnates, reheating can escape deep local minima:
def sa_with_reheat(cost_fn, initial, neighbor_fn, T_start=100, alpha=0.995,
max_iter=500000, reheat_interval=50000, reheat_factor=2.0):
current = initial
current_cost = cost_fn(current)
best = current.copy()
best_cost = current_cost
T = T_start
stagnation = 0
for i in range(max_iter):
candidate = neighbor_fn(current)
candidate_cost = cost_fn(candidate)
delta = candidate_cost - current_cost
if delta < 0 or np.random.random() < np.exp(-delta / max(T, 1e-15)):
current = candidate
current_cost = candidate_cost
stagnation = 0
if current_cost < best_cost:
best = current.copy()
best_cost = current_cost
else:
stagnation += 1
T *= alpha
# Reheat if stagnating
if stagnation >= reheat_interval:
T = max(T * reheat_factor, T_start * 0.1)
current = best.copy() # Restart from best known
current_cost = best_cost
stagnation = 0
return best, best_cost
Parallel tempering (replica exchange)
Run multiple SA instances at different temperatures simultaneously. Periodically, adjacent temperature replicas attempt to swap their solutions:
from multiprocessing import Process, Queue
import numpy as np
def parallel_tempering(cost_fn, initial_fn, neighbor_fn,
temperatures, swap_interval=100, total_steps=100000):
n_replicas = len(temperatures)
# Initialize replicas
replicas = [initial_fn() for _ in range(n_replicas)]
costs = [cost_fn(r) for r in replicas]
best = min(zip(costs, replicas), key=lambda x: x[0])
for step in range(total_steps):
# Each replica does one SA step at its temperature
for i in range(n_replicas):
candidate = neighbor_fn(replicas[i])
candidate_cost = cost_fn(candidate)
delta = candidate_cost - costs[i]
if delta < 0 or np.random.random() < np.exp(-delta / temperatures[i]):
replicas[i] = candidate
costs[i] = candidate_cost
if costs[i] < best[0]:
best = (costs[i], replicas[i].copy())
# Swap adjacent replicas
if step % swap_interval == 0:
for i in range(n_replicas - 1):
delta_swap = (1/temperatures[i] - 1/temperatures[i+1]) * (costs[i+1] - costs[i])
if np.random.random() < np.exp(delta_swap):
replicas[i], replicas[i+1] = replicas[i+1], replicas[i]
costs[i], costs[i+1] = costs[i+1], costs[i]
return best
# Temperature ladder: geometric spacing
temperatures = [100 * 0.5**i for i in range(8)] # 100, 50, 25, ..., 0.78
Parallel tempering is the gold standard for difficult optimization landscapes. Hot replicas explore widely and pass promising regions to cold replicas via swaps.
Constraint handling strategies
Penalty method with adaptive weights
class AdaptivePenalty:
def __init__(self, constraints, initial_weight=1.0, growth_rate=1.5):
self.constraints = constraints
self.weights = [initial_weight] * len(constraints)
self.growth_rate = growth_rate
self.feasible_count = 0
self.total_count = 0
def penalized_cost(self, solution, base_cost):
penalty = 0
all_feasible = True
for i, constraint in enumerate(self.constraints):
violation = max(0, constraint(solution))
if violation > 0:
all_feasible = False
penalty += self.weights[i] * violation ** 2
self.total_count += 1
if all_feasible:
self.feasible_count += 1
return base_cost + penalty
def adapt_weights(self, target_feasible_ratio=0.5):
if self.total_count == 0:
return
ratio = self.feasible_count / self.total_count
if ratio < target_feasible_ratio:
self.weights = [w * self.growth_rate for w in self.weights]
elif ratio > target_feasible_ratio + 0.1:
self.weights = [w / self.growth_rate for w in self.weights]
self.feasible_count = 0
self.total_count = 0
Repair operators
Instead of penalizing infeasible solutions, repair them:
def repair_schedule(schedule, max_hours_per_worker=8):
"""Fix constraint violations by redistributing excess hours."""
for worker in range(len(schedule)):
hours = sum(schedule[worker])
if hours > max_hours_per_worker:
# Remove random assignments until feasible
assigned = [i for i, h in enumerate(schedule[worker]) if h > 0]
while sum(schedule[worker]) > max_hours_per_worker and assigned:
idx = np.random.choice(assigned)
schedule[worker][idx] = 0
assigned.remove(idx)
return schedule
Neighborhood structures for combinatorial problems
2-opt for routing problems
Reverses a segment of the route — much more effective than random swaps:
def two_opt_neighbor(route):
"""Reverse a random segment of the route."""
new_route = route.copy()
n = len(route)
i = np.random.randint(0, n - 1)
j = np.random.randint(i + 1, n)
new_route[i:j+1] = new_route[i:j+1][::-1]
return new_route
Or-opt
Moves a segment of 1–3 consecutive elements to a different position. Often combined with 2-opt.
Variable neighborhood search
Cycle through multiple neighborhood structures when the algorithm stagnates:
def variable_neighborhood(current, neighborhoods, cost_fn, max_no_improve=1000):
k = 0
no_improve = 0
current_cost = cost_fn(current)
while no_improve < max_no_improve:
neighbor = neighborhoods[k](current)
neighbor_cost = cost_fn(neighbor)
if neighbor_cost < current_cost:
current = neighbor
current_cost = neighbor_cost
k = 0 # Reset to first neighborhood
no_improve = 0
else:
k = (k + 1) % len(neighborhoods)
no_improve += 1
return current, current_cost
SA for machine learning hyperparameter tuning
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import cross_val_score
def hyperparameter_cost(params):
lr, n_estimators, max_depth, min_samples_split = params
model = GradientBoostingClassifier(
learning_rate=10**lr, # log scale: -3 to 0
n_estimators=int(n_estimators),
max_depth=int(max_depth),
min_samples_split=int(min_samples_split)
)
scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')
return -scores.mean() # Minimize negative accuracy
def hyperparameter_neighbor(params, step_sizes=[0.1, 20, 1, 2]):
new_params = params.copy()
dim = np.random.randint(len(params))
new_params[dim] += np.random.normal(0, step_sizes[dim])
# Clip to valid ranges
new_params = np.clip(new_params, [−3, 50, 2, 2], [0, 500, 10, 20])
return new_params
Benchmarking and comparison
| Algorithm | Best for | Evaluations needed | Parallelizable |
|---|---|---|---|
| SA | Single-solution, combinatorial | 10⁴–10⁶ | Via parallel tempering |
| GA | Population-based, mixed variables | 10⁴–10⁶ | Naturally parallel |
| Gradient descent | Smooth, convex | 10²–10⁴ | Batch parallelism |
| Bayesian optimization | Expensive black-box | 10¹–10³ | Sequential by nature |
SA’s advantage: extremely simple to implement, minimal memory (one solution), and competitive with more complex methods when tuned properly.
SciPy’s built-in implementation
from scipy.optimize import dual_annealing
def rosenbrock(x):
return sum(100*(x[1:]-x[:-1]**2)**2 + (1-x[:-1])**2)
result = dual_annealing(
rosenbrock,
bounds=[(-5, 5)] * 10,
maxiter=1000,
initial_temp=5230.0,
restart_temp_ratio=2e-5,
visit=2.62,
seed=42
)
print(f"Minimum: {result.fun:.10f} at {result.x}")
dual_annealing combines classical SA with a local search (L-BFGS-B) at each temperature step, making it a strong out-of-the-box optimizer for continuous problems.
One thing to remember: Simulated annealing’s effectiveness depends almost entirely on three choices: the cooling schedule, the neighborhood structure, and the initial temperature — get those right and SA competes with far more complex optimization algorithms.
See Also
- Python Bayesian Inference How updating your beliefs with new evidence works — and why it helps computers make smarter guesses.
- Python Convolution Operations The sliding-window trick that lets computers sharpen photos, recognize faces, and hear words in noisy audio.
- Python Fourier Transforms How breaking any sound, image, or signal into simple waves reveals hidden patterns invisible to the naked eye.
- Python Genetic Algorithms How computers borrow evolution's playbook — survival of the fittest, mutation, and reproduction — to solve problems too complicated for brute force.
- Python Linear Algebra Numpy Why solving puzzles with rows and columns of numbers is the secret engine behind search engines, video games, and AI.