Stable-Baselines3 — Core Concepts

What Stable-Baselines3 solves

Writing a reinforcement learning algorithm correctly is surprisingly hard. Off-by-one bugs in advantage calculations, incorrect gradient clipping, or a bad random seed can silently tank performance. Stable-Baselines3 provides battle-tested implementations so you can focus on the environment and the reward, not the plumbing.

Available algorithms

SB3 ships several algorithms, each suited to different action and observation types:

AlgorithmAction typeTypical use case
PPODiscrete or ContinuousGeneral purpose; first thing to try
A2CDiscrete or ContinuousSimpler/faster PPO alternative
SACContinuous onlySample-efficient continuous control
TD3Continuous onlyDeterministic continuous control
DQNDiscrete onlyClassic Atari, tabular-like problems
HER(wraps any off-policy)Goal-conditioned tasks (robotics)

A companion package, SB3-Contrib, adds algorithms like TQC, TRPO, RecurrentPPO, and CrossQ for more advanced scenarios.

The five-line training pattern

from stable_baselines3 import PPO

model = PPO("MlpPolicy", "CartPole-v1", verbose=1)
model.learn(total_timesteps=50_000)
model.save("cartpole_ppo")

That is a real, runnable training script. "MlpPolicy" tells SB3 to use a two-layer neural network; for image observations you would use "CnnPolicy".

Policy networks

SB3 auto-builds the neural network from the observation space shape. You can customise architecture through policy_kwargs:

model = PPO(
    "MlpPolicy", "LunarLander-v3",
    policy_kwargs=dict(net_arch=[256, 256]),
)

This gives two hidden layers of 256 units each. For actor-critic methods (PPO, SAC, A2C), you can separate actor and critic architectures with dict(net_arch=dict(pi=[256, 256], vf=[128, 128])).

Callbacks

Callbacks hook into the training loop at fixed points. Common uses:

  • EvalCallback — periodically evaluate the policy on a separate environment and save the best model.
  • CheckpointCallback — save model snapshots every N steps.
  • StopTrainingOnRewardThreshold — halt training once average reward exceeds a target.
from stable_baselines3.common.callbacks import EvalCallback

eval_cb = EvalCallback(eval_env, best_model_save_path="./best/",
                       eval_freq=5000, n_eval_episodes=10)
model.learn(total_timesteps=100_000, callback=eval_cb)

Vectorised environments

SB3 natively supports vectorised environments (multiple parallel instances). Use make_vec_env for easy setup:

from stable_baselines3.common.env_util import make_vec_env
vec_env = make_vec_env("CartPole-v1", n_envs=4)
model = PPO("MlpPolicy", vec_env)

More environments per batch means more data per update, which usually speeds up training.

Logging and monitoring

By default SB3 logs to stdout and TensorBoard. Point it at a log directory:

model = PPO("MlpPolicy", "CartPole-v1", tensorboard_log="./tb_logs/")

Then run tensorboard --logdir ./tb_logs/ to see reward curves, loss, and entropy in real time.

Common misconception

People assume SB3 will “just work” on any custom environment. In reality, reward design and observation engineering matter far more than algorithm choice. If the reward signal is sparse or misleading, no algorithm in SB3 will converge.

The one thing to remember: SB3 removes the implementation pain from reinforcement learning, but the real skill is in designing the environment and reward — the library handles the rest.

pythonreinforcement-learningaideep-learning

See Also