Tech · Hugging Face
Releasing these models should help the community to study these questions and build toward modular language models that are easier
Compiled by KHAO Editorial — aggregated from 1 outlet. See llms.txt for citation guidance.
★ Tier-1 Source
To deploy, adapt, inspect, and compose.
Key facts
- When they keep only 25% of the experts (32 expert subset), EMO loses only about 1% absolute performance across all benchmarks; even when they keep only 12.5% of the experts (16 expert subset)
- For example, in an MoE with 10 total experts and 2 active experts per token, all tokens in a document are restricted to route within the same pool of 4 experts, as shown in the figure
- To see what EMO learned after training, they clustered router activations of the first 100 tokens across 12K pretraining documents
- Comparison of training of a standard MoE and EMO (k = 2, n = 10, shared experts omitted for simplicity). (Left) In a standard MoE, each token independently selects its top-k experts
Summary
Large language models are typically trained and deployed as monolithic systems: a single model is initialized, pretrained, fine-tuned, and served as one unified entity. Mixture-of-experts (MoE) models seem like a natural way to relax this constraint. In practice, however, existing MoEs still need the full model to work well. The team instead want MoE models whose experts organize into coherent groups that can be selectively used and composed. One way to encourage this during pretraining is to route tokens to experts based on predefined semantic domains, such as math, biology, or code. More importantly, fixing the domains upfront also fixes the model's modular structure: if a new domain or capability emerges at inference time, it isn't obvious which experts should be used.