Mixture of Experts Papers
updated
MoE-LLaVA: Mixture of Experts for Large Vision-Language Models
Paper
• 2401.15947
• Published
• 53
DeepSeekMoE: Towards Ultimate Expert Specialization in
Mixture-of-Experts Language Models
Paper
• 2401.06066
• Published
• 59
SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention
Paper
• 2312.07987
• Published
• 41
Switch Transformers: Scaling to Trillion Parameter Models with Simple
and Efficient Sparsity
Paper
• 2101.03961
• Published
• 13
Outrageously Large Neural Networks: The Sparsely-Gated
Mixture-of-Experts Layer
Paper
• 1701.06538
• Published
• 7
Sparse Networks from Scratch: Faster Training without Losing Performance
Paper
• 1907.04840
• Published
• 3
A Mixture of h-1 Heads is Better than h Heads
Paper
• 2005.06537
• Published
• 2
FastMoE: A Fast Mixture-of-Expert Training System
Paper
• 2103.13262
• Published
• 2
SpeechMoE: Scaling to Large Acoustic Models with Dynamic Routing Mixture
of Experts
Paper
• 2105.03036
• Published
• 2
GShard: Scaling Giant Models with Conditional Computation and Automatic
Sharding
Paper
• 2006.16668
• Published
• 4
A Review of Sparse Expert Models in Deep Learning
Paper
• 2209.01667
• Published
• 3
Building a great multi-lingual teacher with sparsely-gated mixture of
experts for speech recognition
Paper
• 2112.05820
• Published
• 2