Transformers Learn Mixture-of Experts Regression In Context

Loading...
Thumbnail Image
Penn collection
Interdisciplinary Centers, Units and Projects::Center for Undergraduate Research and Fellowships (CURF)::Fall Research Expo
Degree type
Discipline
Computer Sciences
Subject
Machine Learning
Transformers
Large Language Models (LLM)
Machine Learning Theory
Funder
Grant number
License
author or copyright holder retaining all copyrights in the submitted work
Copyright date
2025-10-06
Distributor
Related resources
Author
Zhang, Lyuxin David
Xue, Anton
Edelman, Ezra
Goel, Surbhi
Wong, Eric
Contributor
Abstract

We study whether small transformers can learn to perform mixture-of-experts (MoE) regression purely in context. Using a synthetic setup, each prompt contains a few (x, y) pairs produced by one of many linear “experts,” plus a query x*. A tiny GPT-2-style model (L layers, H heads) is trained end-to-end to predict y* without parameter updates. Our theory models the problem via an expert “band gap” γ that separates the correct expert from distractors; the resulting signal-to-noise ratio (SNR = γ²/σ²) predicts when accurate routing is possible and how much residual error remains once the right expert is selected. Experiments across depths, heads, and data regimes validate these predictions: at high SNR, the model quickly achieves near-perfect gating and approaches the oracle loss floor; at lower SNR, training plateaus with higher residual error. Increasing depth primarily improves the post-gating regression stage rather than routing itself. Together, these results suggest transformers naturally decompose in-context MoE tasks into two phases—gating, then per-expert linear regression—and that a simple SNR control parameter explains success and failure. We discuss implications for tool use, modular reasoning, and designing curricula that emphasize separability between routing and computation.

Advisor
Date of presentation
2025-09-15
Conference name
Conference dates
Conference location
Date Range for Data Collection (Start Date)
Date Range for Data Collection (End Date)
Digital Object Identifier
Series name and number
Volume number
Issue number
Publisher
Publisher DOI
Journal Issue
Comments
This project was funded by the Pincus-Magaziner Family Undergraduate Research and Travel Fund
Recommended citation
Collection