Mixture of Experts (MoE) is a neural network architecture in which multiple specialist sub-networks (called “experts”) are combined with a routing mechanism (called a “gating network” or “router”) that selects which experts to activate for each input. The key insight is that not all parts of a model need to process every input. By activating only a subset of experts per token or input, MoE models can have very large total parameter counts while keeping the computational cost of processing any single input manageable.

How MoE Works

In a standard transformer model, every token passes through every layer, and every parameter participates in every computation. In an MoE transformer, certain feed-forward layers are replaced with a set of parallel expert networks and a router. For each input token, the router produces a probability distribution over the available experts and selects the top-k experts (typically k=1 or k=2) to process that token. The outputs of the selected experts are combined, weighted by the router’s confidence scores, and passed to the next layer.

This architecture creates two important properties. First, the model’s total knowledge capacity scales with the total number of parameters across all experts, which can be very large. Second, the computational cost per forward pass scales with the number of active parameters, which is much smaller. A model with 8 experts where 2 are active per token uses roughly the compute of a dense model one-quarter its size while potentially matching the quality of a dense model closer to its full size.

Historical Development

The MoE concept dates back to Jacobs et al. (1991), but it gained significant practical relevance with the development of sparsely-gated MoE layers by Shazeer et al. at Google in 2017. Google’s Switch Transformer (2021) simplified the architecture by routing each token to a single expert, making training more stable and efficient. Subsequent models including Mixtral from Mistral AI (2023) and Grok from xAI demonstrated that MoE architectures could achieve competitive quality with dense models at significantly lower inference cost.

Router Design

The router is the critical component that determines MoE effectiveness. Router design involves several trade-offs:

Load balancing. Without explicit balancing, routers tend to collapse, sending most tokens to a small number of popular experts while others remain unused. This wastes parameters and reduces the model’s effective capacity. Load balancing losses or auxiliary objectives are typically added during training to encourage even distribution across experts.

Token dropping. When experts have fixed capacity and receive more tokens than they can process, excess tokens must be dropped or rerouted. This can degrade quality for dropped tokens. Capacity factors and buffer mechanisms manage this trade-off.

Expert granularity. The choice of how many experts to use and how large each expert should be affects both quality and efficiency. More, smaller experts provide finer-grained specialization but increase routing complexity. Fewer, larger experts are simpler to manage but provide less specialization benefit.

Advantages

Cost-effective scaling. MoE models can scale to trillions of parameters while remaining feasible to train and serve, because only a fraction of parameters are active per input.

Specialization. Experts tend to develop specializations during training, with different experts becoming proficient at different types of inputs, languages, or domains. This emergent specialization can improve quality on diverse workloads.

Flexible compute allocation. Because different inputs activate different experts, the model implicitly allocates more processing to inputs that need it (by routing them to more relevant experts) and less to straightforward inputs.

Challenges

Memory requirements. Although only a subset of experts are active per input, all expert weights must be loaded into memory. This makes MoE models more memory-intensive to serve than dense models with equivalent active parameter counts.

Training instability. MoE models can exhibit training instability due to router collapse, load imbalance, and the discrete nature of routing decisions. Techniques such as jitter noise, load balancing losses, and expert parallelism address these issues but add complexity.

Serving complexity. Efficient MoE serving requires expert parallelism across multiple accelerators, with careful communication patterns to route tokens to the correct experts. This is more complex than serving dense models and requires specialized infrastructure.

MoE in Practice

MoE architecture is now used in several of the most capable production language models. It has become a standard approach for building models that need to be both large (for quality) and efficient (for serving cost). The pattern also influences system-level design: the principle of routing different inputs to different specialist processors applies at the system level in compound AI architectures that route queries to different models or processing paths.