Multi-Agent Reinforcement Learning — Core Concepts
Why multi-agent is fundamentally different
In single-agent RL, the environment is (mostly) stationary — its rules do not change while the agent learns. Add a second learning agent and that assumption collapses. Each agent’s policy shift changes the effective environment for every other agent. This non-stationarity is the root challenge of MARL.
Three interaction modes
Cooperative
All agents share a common goal and receive the same (or closely aligned) reward. Example: a team of drones searching for survivors after an earthquake. Success depends on coordination.
Competitive
Agents have opposing objectives — one agent’s gain is another’s loss. Example: two players in a board game. This is a zero-sum setting.
Mixed (cooperative-competitive)
Teams cooperate internally but compete against other teams. Example: a 5v5 video game. This is the most common real-world scenario and the hardest to train.
Key paradigms
| Paradigm | Idea | Tradeoff |
|---|---|---|
| Independent learners | Each agent runs its own single-agent algorithm ignoring others | Simple but unstable; environment looks non-stationary |
| Centralised training, decentralised execution (CTDE) | A central critic sees everyone’s observations during training, but each agent acts only on its own observations at test time | Best of both worlds; dominant paradigm |
| Fully centralised | One “super-agent” controls all agents | Scales poorly; action space explodes combinatorially |
| Communication learning | Agents learn to send messages to each other | Enables coordination but adds complexity |
CTDE is the industry standard. Algorithms like MAPPO, QMIX, and MADDPG all follow this pattern.
Python frameworks
PettingZoo
The multi-agent equivalent of Gymnasium. It provides a standard API where environments expose agents as an iterable and follow either a parallel (all agents act simultaneously) or AEC (agents act one at a time) model.
from pettingzoo.mpe import simple_spread_v3
env = simple_spread_v3.parallel_env()
observations, infos = env.reset()
EPyMARL / PyMARL2
Research frameworks for CTDE algorithms. They bundle QMIX, VDN, MAPPO, and others with StarCraft Multi-Agent Challenge (SMAC) environments.
RLlib (Ray)
A scalable RL library with first-class multi-agent support. You define a policy mapping function that assigns agents to policies, and RLlib handles parallel rollout collection.
Communication between agents
Some architectures let agents exchange learned messages:
- CommNet — agents broadcast a continuous vector that gets averaged into a shared message.
- TarMAC — agents use attention to decide whose messages to listen to.
- DIAL — messages are discrete during execution and continuous (differentiable) during training.
Communication helps in partially observable settings where no single agent sees the full picture.
Reward design challenges
In cooperative MARL, a shared team reward can cause credit assignment problems: did the team succeed because of agent A’s great move or despite agent B’s mistake? Techniques to address this include:
- Individual rewards alongside team rewards.
- Counterfactual baselines (COMA) — estimate what would have happened if an agent had acted differently.
- Value decomposition (QMIX, VDN) — factor the team value function into per-agent components.
Common misconception
Many people think you can just run N independent PPO agents and get good multi-agent behaviour. For simple settings this might work, but for tasks requiring coordination or competition, independent learners often fail to converge because each agent’s changing policy makes the others’ learning signal noisy.
The one thing to remember: MARL’s defining challenge is non-stationarity — every agent’s learning changes every other agent’s world — and the best solutions train centrally but act independently.
See Also
- Python Environment Wrappers How thin add-on layers let you change what a learning program sees and does without rewriting the game itself
- Python Monte Carlo Tree Search The clever trick behind AlphaGo — how a program explores millions of possible moves by playing quick random games against itself
- Python Openai Gym Environments Why OpenAI Gym is the playground where robots and programs learn by trial and error — no prior coding knowledge needed
- Python Policy Gradient Methods Instead of scoring every move, what if the program just learned which moves feel right? That is policy gradients
- Python Q Learning Implementation How a program builds a cheat sheet of every situation and every action to figure out the best move — no teacher required