Transformers Learn Mixture-of Experts Regression In Context
Penn collection
Degree type
Discipline
Subject
Transformers
Large Language Models (LLM)
Machine Learning Theory
Funder
Grant number
License
Copyright date
Distributor
Related resources
Author
Contributor
Abstract
We study whether small transformers can learn to perform mixture-of-experts (MoE) regression purely in context. Using a synthetic setup, each prompt contains a few (x, y) pairs produced by one of many linear “experts,” plus a query x*. A tiny GPT-2-style model (L layers, H heads) is trained end-to-end to predict y* without parameter updates. Our theory models the problem via an expert “band gap” γ that separates the correct expert from distractors; the resulting signal-to-noise ratio (SNR = γ²/σ²) predicts when accurate routing is possible and how much residual error remains once the right expert is selected. Experiments across depths, heads, and data regimes validate these predictions: at high SNR, the model quickly achieves near-perfect gating and approaches the oracle loss floor; at lower SNR, training plateaus with higher residual error. Increasing depth primarily improves the post-gating regression stage rather than routing itself. Together, these results suggest transformers naturally decompose in-context MoE tasks into two phases—gating, then per-expert linear regression—and that a simple SNR control parameter explains success and failure. We discuss implications for tool use, modular reasoning, and designing curricula that emphasize separability between routing and computation.