Simulated Annealing — Deep Dive

Theoretical foundations

Simulated annealing’s convergence guarantee rests on the theory of inhomogeneous Markov chains. At each temperature, the Metropolis criterion defines a transition probability that satisfies detailed balance with respect to the Boltzmann distribution:

P(x) ∝ exp(-E(x) / T)

As T → 0, this distribution concentrates on the global minimum. The logarithmic cooling schedule T(k) = C / log(1 + k) guarantees convergence to the global optimum with probability 1 — but convergence is so slow that practitioners always use faster schedules and accept approximate solutions.

Adaptive cooling schedules

Lam-Delosme schedule

Maintains a target acceptance rate (typically ~44% for continuous, ~23% for combinatorial) by adjusting temperature based on observed acceptance:

class AdaptiveSA:
    def __init__(self, target_acceptance=0.44, damping=0.7):
        self.target = target_acceptance
        self.damping = damping
        self.acceptance_history = []
    
    def update_temperature(self, T, accepted, total, window=100):
        if total < window:
            return T
        
        recent_rate = sum(self.acceptance_history[-window:]) / window
        
        if recent_rate > self.target:
            # Too many acceptances — cool faster
            return T * (1 - self.damping * (recent_rate - self.target))
        else:
            # Too few acceptances — cool slower (or reheat slightly)
            return T * (1 + self.damping * (self.target - recent_rate))
    
    def record(self, accepted):
        self.acceptance_history.append(1 if accepted else 0)

Thermodynamic speed limit

The cooling rate should not exceed the system’s ability to equilibrate. A practical rule: at each temperature, run enough iterations for the chain to explore the local energy landscape before cooling. This leads to isothermal step counts that scale with problem difficulty.

Reheating and restart strategies

When the algorithm stagnates, reheating can escape deep local minima:

def sa_with_reheat(cost_fn, initial, neighbor_fn, T_start=100, alpha=0.995,
                    max_iter=500000, reheat_interval=50000, reheat_factor=2.0):
    current = initial
    current_cost = cost_fn(current)
    best = current.copy()
    best_cost = current_cost
    T = T_start
    stagnation = 0
    
    for i in range(max_iter):
        candidate = neighbor_fn(current)
        candidate_cost = cost_fn(candidate)
        delta = candidate_cost - current_cost
        
        if delta < 0 or np.random.random() < np.exp(-delta / max(T, 1e-15)):
            current = candidate
            current_cost = candidate_cost
            stagnation = 0
            
            if current_cost < best_cost:
                best = current.copy()
                best_cost = current_cost
        else:
            stagnation += 1
        
        T *= alpha
        
        # Reheat if stagnating
        if stagnation >= reheat_interval:
            T = max(T * reheat_factor, T_start * 0.1)
            current = best.copy()  # Restart from best known
            current_cost = best_cost
            stagnation = 0
    
    return best, best_cost

Parallel tempering (replica exchange)

Run multiple SA instances at different temperatures simultaneously. Periodically, adjacent temperature replicas attempt to swap their solutions:

from multiprocessing import Process, Queue
import numpy as np

def parallel_tempering(cost_fn, initial_fn, neighbor_fn, 
                        temperatures, swap_interval=100, total_steps=100000):
    n_replicas = len(temperatures)
    
    # Initialize replicas
    replicas = [initial_fn() for _ in range(n_replicas)]
    costs = [cost_fn(r) for r in replicas]
    best = min(zip(costs, replicas), key=lambda x: x[0])
    
    for step in range(total_steps):
        # Each replica does one SA step at its temperature
        for i in range(n_replicas):
            candidate = neighbor_fn(replicas[i])
            candidate_cost = cost_fn(candidate)
            delta = candidate_cost - costs[i]
            
            if delta < 0 or np.random.random() < np.exp(-delta / temperatures[i]):
                replicas[i] = candidate
                costs[i] = candidate_cost
                
                if costs[i] < best[0]:
                    best = (costs[i], replicas[i].copy())
        
        # Swap adjacent replicas
        if step % swap_interval == 0:
            for i in range(n_replicas - 1):
                delta_swap = (1/temperatures[i] - 1/temperatures[i+1]) * (costs[i+1] - costs[i])
                if np.random.random() < np.exp(delta_swap):
                    replicas[i], replicas[i+1] = replicas[i+1], replicas[i]
                    costs[i], costs[i+1] = costs[i+1], costs[i]
    
    return best

# Temperature ladder: geometric spacing
temperatures = [100 * 0.5**i for i in range(8)]  # 100, 50, 25, ..., 0.78

Parallel tempering is the gold standard for difficult optimization landscapes. Hot replicas explore widely and pass promising regions to cold replicas via swaps.

Constraint handling strategies

Penalty method with adaptive weights

class AdaptivePenalty:
    def __init__(self, constraints, initial_weight=1.0, growth_rate=1.5):
        self.constraints = constraints
        self.weights = [initial_weight] * len(constraints)
        self.growth_rate = growth_rate
        self.feasible_count = 0
        self.total_count = 0
    
    def penalized_cost(self, solution, base_cost):
        penalty = 0
        all_feasible = True
        for i, constraint in enumerate(self.constraints):
            violation = max(0, constraint(solution))
            if violation > 0:
                all_feasible = False
            penalty += self.weights[i] * violation ** 2
        
        self.total_count += 1
        if all_feasible:
            self.feasible_count += 1
        
        return base_cost + penalty
    
    def adapt_weights(self, target_feasible_ratio=0.5):
        if self.total_count == 0:
            return
        ratio = self.feasible_count / self.total_count
        if ratio < target_feasible_ratio:
            self.weights = [w * self.growth_rate for w in self.weights]
        elif ratio > target_feasible_ratio + 0.1:
            self.weights = [w / self.growth_rate for w in self.weights]
        self.feasible_count = 0
        self.total_count = 0

Repair operators

Instead of penalizing infeasible solutions, repair them:

def repair_schedule(schedule, max_hours_per_worker=8):
    """Fix constraint violations by redistributing excess hours."""
    for worker in range(len(schedule)):
        hours = sum(schedule[worker])
        if hours > max_hours_per_worker:
            # Remove random assignments until feasible
            assigned = [i for i, h in enumerate(schedule[worker]) if h > 0]
            while sum(schedule[worker]) > max_hours_per_worker and assigned:
                idx = np.random.choice(assigned)
                schedule[worker][idx] = 0
                assigned.remove(idx)
    return schedule

Neighborhood structures for combinatorial problems

2-opt for routing problems

Reverses a segment of the route — much more effective than random swaps:

def two_opt_neighbor(route):
    """Reverse a random segment of the route."""
    new_route = route.copy()
    n = len(route)
    i = np.random.randint(0, n - 1)
    j = np.random.randint(i + 1, n)
    new_route[i:j+1] = new_route[i:j+1][::-1]
    return new_route

Or-opt

Moves a segment of 1–3 consecutive elements to a different position. Often combined with 2-opt.

Cycle through multiple neighborhood structures when the algorithm stagnates:

def variable_neighborhood(current, neighborhoods, cost_fn, max_no_improve=1000):
    k = 0
    no_improve = 0
    current_cost = cost_fn(current)
    
    while no_improve < max_no_improve:
        neighbor = neighborhoods[k](current)
        neighbor_cost = cost_fn(neighbor)
        
        if neighbor_cost < current_cost:
            current = neighbor
            current_cost = neighbor_cost
            k = 0  # Reset to first neighborhood
            no_improve = 0
        else:
            k = (k + 1) % len(neighborhoods)
            no_improve += 1
    
    return current, current_cost

SA for machine learning hyperparameter tuning

from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import cross_val_score

def hyperparameter_cost(params):
    lr, n_estimators, max_depth, min_samples_split = params
    model = GradientBoostingClassifier(
        learning_rate=10**lr,        # log scale: -3 to 0
        n_estimators=int(n_estimators),
        max_depth=int(max_depth),
        min_samples_split=int(min_samples_split)
    )
    scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')
    return -scores.mean()  # Minimize negative accuracy

def hyperparameter_neighbor(params, step_sizes=[0.1, 20, 1, 2]):
    new_params = params.copy()
    dim = np.random.randint(len(params))
    new_params[dim] += np.random.normal(0, step_sizes[dim])
    # Clip to valid ranges
    new_params = np.clip(new_params, [−3, 50, 2, 2], [0, 500, 10, 20])
    return new_params

Benchmarking and comparison

AlgorithmBest forEvaluations neededParallelizable
SASingle-solution, combinatorial10⁴–10⁶Via parallel tempering
GAPopulation-based, mixed variables10⁴–10⁶Naturally parallel
Gradient descentSmooth, convex10²–10⁴Batch parallelism
Bayesian optimizationExpensive black-box10¹–10³Sequential by nature

SA’s advantage: extremely simple to implement, minimal memory (one solution), and competitive with more complex methods when tuned properly.

SciPy’s built-in implementation

from scipy.optimize import dual_annealing

def rosenbrock(x):
    return sum(100*(x[1:]-x[:-1]**2)**2 + (1-x[:-1])**2)

result = dual_annealing(
    rosenbrock,
    bounds=[(-5, 5)] * 10,
    maxiter=1000,
    initial_temp=5230.0,
    restart_temp_ratio=2e-5,
    visit=2.62,
    seed=42
)
print(f"Minimum: {result.fun:.10f} at {result.x}")

dual_annealing combines classical SA with a local search (L-BFGS-B) at each temperature step, making it a strong out-of-the-box optimizer for continuous problems.

One thing to remember: Simulated annealing’s effectiveness depends almost entirely on three choices: the cooling schedule, the neighborhood structure, and the initial temperature — get those right and SA competes with far more complex optimization algorithms.

pythonalgorithmsoptimizationmetaheuristics

See Also

  • Python Bayesian Inference How updating your beliefs with new evidence works — and why it helps computers make smarter guesses.
  • Python Convolution Operations The sliding-window trick that lets computers sharpen photos, recognize faces, and hear words in noisy audio.
  • Python Fourier Transforms How breaking any sound, image, or signal into simple waves reveals hidden patterns invisible to the naked eye.
  • Python Genetic Algorithms How computers borrow evolution's playbook — survival of the fittest, mutation, and reproduction — to solve problems too complicated for brute force.
  • Python Linear Algebra Numpy Why solving puzzles with rows and columns of numbers is the secret engine behind search engines, video games, and AI.