Environment Wrappers — Core Concepts

Why wrappers exist

Raw environments rarely output data in the exact format a learning algorithm wants. Images might be too large, rewards might be unbounded, or the observation might include irrelevant information. Wrappers let you transform inputs and outputs without forking the environment source code. They follow the decorator pattern — each wrapper wraps the environment (or another wrapper) and exposes the same interface.

The wrapper hierarchy

Gymnasium provides four base classes:

Base classWhat you overridePurpose
Wrapperstep, reset, or any methodGeneral-purpose
ObservationWrapperobservation(obs)Transform what the agent sees
ActionWrapperaction(act)Transform what the agent does
RewardWrapperreward(rew)Transform the score

Each one is a thin layer. The base Wrapper delegates everything to the wrapped environment unless you override a method.

Common built-in wrappers

Observation wrappers

  • FlattenObservation — converts Dict or Tuple observations into a single 1-D array. Essential when your algorithm expects flat input.
  • GrayscaleObservation — converts RGB images to greyscale, reducing input size by 3x.
  • ResizeObservation — resizes image observations to a target shape (e.g., 84×84 for Atari).
  • FrameStack — stacks the last N observations along a new axis, giving the agent temporal context.
  • NormalizeObservation — applies running mean/std normalisation.

Reward wrappers

  • ClipReward — clamps rewards to a fixed range like [-1, 1].
  • NormalizeReward — divides by a running standard deviation.

Action wrappers

  • ClipAction — clamps continuous actions to the action space bounds.
  • RescaleAction — linearly maps actions from one range to another.

Utility wrappers

  • TimeLimit — truncates episodes after N steps. Applied automatically by gymnasium.make() for registered environments.
  • RecordVideo — saves rendered frames as video files.
  • RecordEpisodeStatistics — tracks episode length and return in the info dict.

Stacking order matters

Wrappers form a chain. The outermost wrapper processes data first on the way in (actions) and last on the way out (observations). A typical Atari stack:

Agent ↔ FrameStack ↔ GrayscaleObservation ↔ ResizeObservation ↔ ClipReward ↔ Atari env

When the agent sends an action, it passes through the wrappers left-to-right. When the environment returns an observation, it passes right-to-left. Understanding this flow prevents confusing bugs where a wrapper expects input in a shape that an inner wrapper already changed.

The classic Atari preprocessing stack

Most Atari RL papers use this exact stack:

  1. NoopResetWrapper — apply random number of no-op actions at start for variety.
  2. MaxAndSkipWrapper — repeat each action for 4 frames and return the max of the last 2 (reduces flickering).
  3. EpisodicLifeWrapper — treat loss of life as episode end during training.
  4. FireResetWrapper — press FIRE after reset (some games require it to start).
  5. ResizeObservation(84, 84) — standard size for CNN policies.
  6. GrayscaleObservation — 1 channel instead of 3.
  7. FrameStack(4) — 4 frames of context for detecting motion.
  8. ClipReward — all rewards become -1, 0, or +1.

This stack was established by the original DQN paper and is still used as a baseline.

Common misconception

Beginners think wrappers modify the environment permanently. They do not — they are decorators. The original environment is untouched inside the wrapper chain. You can access it via env.unwrapped at any time, which is useful for debugging or accessing environment-specific methods.

The one thing to remember: Wrappers are composable, single-responsibility decorators that sit between the agent and the environment — stack them to shape observations, actions, and rewards without changing the game.

pythonreinforcement-learningaigymnasium

See Also