Environment Wrappers — Core Concepts
Why wrappers exist
Raw environments rarely output data in the exact format a learning algorithm wants. Images might be too large, rewards might be unbounded, or the observation might include irrelevant information. Wrappers let you transform inputs and outputs without forking the environment source code. They follow the decorator pattern — each wrapper wraps the environment (or another wrapper) and exposes the same interface.
The wrapper hierarchy
Gymnasium provides four base classes:
| Base class | What you override | Purpose |
|---|---|---|
Wrapper | step, reset, or any method | General-purpose |
ObservationWrapper | observation(obs) | Transform what the agent sees |
ActionWrapper | action(act) | Transform what the agent does |
RewardWrapper | reward(rew) | Transform the score |
Each one is a thin layer. The base Wrapper delegates everything to the wrapped environment unless you override a method.
Common built-in wrappers
Observation wrappers
- FlattenObservation — converts Dict or Tuple observations into a single 1-D array. Essential when your algorithm expects flat input.
- GrayscaleObservation — converts RGB images to greyscale, reducing input size by 3x.
- ResizeObservation — resizes image observations to a target shape (e.g., 84×84 for Atari).
- FrameStack — stacks the last N observations along a new axis, giving the agent temporal context.
- NormalizeObservation — applies running mean/std normalisation.
Reward wrappers
- ClipReward — clamps rewards to a fixed range like [-1, 1].
- NormalizeReward — divides by a running standard deviation.
Action wrappers
- ClipAction — clamps continuous actions to the action space bounds.
- RescaleAction — linearly maps actions from one range to another.
Utility wrappers
- TimeLimit — truncates episodes after N steps. Applied automatically by
gymnasium.make()for registered environments. - RecordVideo — saves rendered frames as video files.
- RecordEpisodeStatistics — tracks episode length and return in the
infodict.
Stacking order matters
Wrappers form a chain. The outermost wrapper processes data first on the way in (actions) and last on the way out (observations). A typical Atari stack:
Agent ↔ FrameStack ↔ GrayscaleObservation ↔ ResizeObservation ↔ ClipReward ↔ Atari env
When the agent sends an action, it passes through the wrappers left-to-right. When the environment returns an observation, it passes right-to-left. Understanding this flow prevents confusing bugs where a wrapper expects input in a shape that an inner wrapper already changed.
The classic Atari preprocessing stack
Most Atari RL papers use this exact stack:
- NoopResetWrapper — apply random number of no-op actions at start for variety.
- MaxAndSkipWrapper — repeat each action for 4 frames and return the max of the last 2 (reduces flickering).
- EpisodicLifeWrapper — treat loss of life as episode end during training.
- FireResetWrapper — press FIRE after reset (some games require it to start).
- ResizeObservation(84, 84) — standard size for CNN policies.
- GrayscaleObservation — 1 channel instead of 3.
- FrameStack(4) — 4 frames of context for detecting motion.
- ClipReward — all rewards become -1, 0, or +1.
This stack was established by the original DQN paper and is still used as a baseline.
Common misconception
Beginners think wrappers modify the environment permanently. They do not — they are decorators. The original environment is untouched inside the wrapper chain. You can access it via env.unwrapped at any time, which is useful for debugging or accessing environment-specific methods.
The one thing to remember: Wrappers are composable, single-responsibility decorators that sit between the agent and the environment — stack them to shape observations, actions, and rewards without changing the game.
See Also
- Python Monte Carlo Tree Search The clever trick behind AlphaGo — how a program explores millions of possible moves by playing quick random games against itself
- Python Multi Agent Reinforcement What happens when multiple programs learn together in the same world — cooperation, competition, and emergent teamwork
- Python Openai Gym Environments Why OpenAI Gym is the playground where robots and programs learn by trial and error — no prior coding knowledge needed
- Python Policy Gradient Methods Instead of scoring every move, what if the program just learned which moves feel right? That is policy gradients
- Python Q Learning Implementation How a program builds a cheat sheet of every situation and every action to figure out the best move — no teacher required