Stable-Baselines3 — Core Concepts
What Stable-Baselines3 solves
Writing a reinforcement learning algorithm correctly is surprisingly hard. Off-by-one bugs in advantage calculations, incorrect gradient clipping, or a bad random seed can silently tank performance. Stable-Baselines3 provides battle-tested implementations so you can focus on the environment and the reward, not the plumbing.
Available algorithms
SB3 ships several algorithms, each suited to different action and observation types:
| Algorithm | Action type | Typical use case |
|---|---|---|
| PPO | Discrete or Continuous | General purpose; first thing to try |
| A2C | Discrete or Continuous | Simpler/faster PPO alternative |
| SAC | Continuous only | Sample-efficient continuous control |
| TD3 | Continuous only | Deterministic continuous control |
| DQN | Discrete only | Classic Atari, tabular-like problems |
| HER | (wraps any off-policy) | Goal-conditioned tasks (robotics) |
A companion package, SB3-Contrib, adds algorithms like TQC, TRPO, RecurrentPPO, and CrossQ for more advanced scenarios.
The five-line training pattern
from stable_baselines3 import PPO
model = PPO("MlpPolicy", "CartPole-v1", verbose=1)
model.learn(total_timesteps=50_000)
model.save("cartpole_ppo")
That is a real, runnable training script. "MlpPolicy" tells SB3 to use a two-layer neural network; for image observations you would use "CnnPolicy".
Policy networks
SB3 auto-builds the neural network from the observation space shape. You can customise architecture through policy_kwargs:
model = PPO(
"MlpPolicy", "LunarLander-v3",
policy_kwargs=dict(net_arch=[256, 256]),
)
This gives two hidden layers of 256 units each. For actor-critic methods (PPO, SAC, A2C), you can separate actor and critic architectures with dict(net_arch=dict(pi=[256, 256], vf=[128, 128])).
Callbacks
Callbacks hook into the training loop at fixed points. Common uses:
- EvalCallback — periodically evaluate the policy on a separate environment and save the best model.
- CheckpointCallback — save model snapshots every N steps.
- StopTrainingOnRewardThreshold — halt training once average reward exceeds a target.
from stable_baselines3.common.callbacks import EvalCallback
eval_cb = EvalCallback(eval_env, best_model_save_path="./best/",
eval_freq=5000, n_eval_episodes=10)
model.learn(total_timesteps=100_000, callback=eval_cb)
Vectorised environments
SB3 natively supports vectorised environments (multiple parallel instances). Use make_vec_env for easy setup:
from stable_baselines3.common.env_util import make_vec_env
vec_env = make_vec_env("CartPole-v1", n_envs=4)
model = PPO("MlpPolicy", vec_env)
More environments per batch means more data per update, which usually speeds up training.
Logging and monitoring
By default SB3 logs to stdout and TensorBoard. Point it at a log directory:
model = PPO("MlpPolicy", "CartPole-v1", tensorboard_log="./tb_logs/")
Then run tensorboard --logdir ./tb_logs/ to see reward curves, loss, and entropy in real time.
Common misconception
People assume SB3 will “just work” on any custom environment. In reality, reward design and observation engineering matter far more than algorithm choice. If the reward signal is sparse or misleading, no algorithm in SB3 will converge.
The one thing to remember: SB3 removes the implementation pain from reinforcement learning, but the real skill is in designing the environment and reward — the library handles the rest.
See Also
- Python Environment Wrappers How thin add-on layers let you change what a learning program sees and does without rewriting the game itself
- Python Monte Carlo Tree Search The clever trick behind AlphaGo — how a program explores millions of possible moves by playing quick random games against itself
- Python Multi Agent Reinforcement What happens when multiple programs learn together in the same world — cooperation, competition, and emergent teamwork
- Python Openai Gym Environments Why OpenAI Gym is the playground where robots and programs learn by trial and error — no prior coding knowledge needed
- Python Policy Gradient Methods Instead of scoring every move, what if the program just learned which moves feel right? That is policy gradients