Multi-Agent Reinforcement Learning — Core Concepts

Why multi-agent is fundamentally different

In single-agent RL, the environment is (mostly) stationary — its rules do not change while the agent learns. Add a second learning agent and that assumption collapses. Each agent’s policy shift changes the effective environment for every other agent. This non-stationarity is the root challenge of MARL.

Three interaction modes

Cooperative

All agents share a common goal and receive the same (or closely aligned) reward. Example: a team of drones searching for survivors after an earthquake. Success depends on coordination.

Competitive

Agents have opposing objectives — one agent’s gain is another’s loss. Example: two players in a board game. This is a zero-sum setting.

Mixed (cooperative-competitive)

Teams cooperate internally but compete against other teams. Example: a 5v5 video game. This is the most common real-world scenario and the hardest to train.

Key paradigms

ParadigmIdeaTradeoff
Independent learnersEach agent runs its own single-agent algorithm ignoring othersSimple but unstable; environment looks non-stationary
Centralised training, decentralised execution (CTDE)A central critic sees everyone’s observations during training, but each agent acts only on its own observations at test timeBest of both worlds; dominant paradigm
Fully centralisedOne “super-agent” controls all agentsScales poorly; action space explodes combinatorially
Communication learningAgents learn to send messages to each otherEnables coordination but adds complexity

CTDE is the industry standard. Algorithms like MAPPO, QMIX, and MADDPG all follow this pattern.

Python frameworks

PettingZoo

The multi-agent equivalent of Gymnasium. It provides a standard API where environments expose agents as an iterable and follow either a parallel (all agents act simultaneously) or AEC (agents act one at a time) model.

from pettingzoo.mpe import simple_spread_v3
env = simple_spread_v3.parallel_env()
observations, infos = env.reset()

EPyMARL / PyMARL2

Research frameworks for CTDE algorithms. They bundle QMIX, VDN, MAPPO, and others with StarCraft Multi-Agent Challenge (SMAC) environments.

RLlib (Ray)

A scalable RL library with first-class multi-agent support. You define a policy mapping function that assigns agents to policies, and RLlib handles parallel rollout collection.

Communication between agents

Some architectures let agents exchange learned messages:

  • CommNet — agents broadcast a continuous vector that gets averaged into a shared message.
  • TarMAC — agents use attention to decide whose messages to listen to.
  • DIAL — messages are discrete during execution and continuous (differentiable) during training.

Communication helps in partially observable settings where no single agent sees the full picture.

Reward design challenges

In cooperative MARL, a shared team reward can cause credit assignment problems: did the team succeed because of agent A’s great move or despite agent B’s mistake? Techniques to address this include:

  • Individual rewards alongside team rewards.
  • Counterfactual baselines (COMA) — estimate what would have happened if an agent had acted differently.
  • Value decomposition (QMIX, VDN) — factor the team value function into per-agent components.

Common misconception

Many people think you can just run N independent PPO agents and get good multi-agent behaviour. For simple settings this might work, but for tasks requiring coordination or competition, independent learners often fail to converge because each agent’s changing policy makes the others’ learning signal noisy.

The one thing to remember: MARL’s defining challenge is non-stationarity — every agent’s learning changes every other agent’s world — and the best solutions train centrally but act independently.

pythonreinforcement-learningaimulti-agent

See Also