
Mixture of Experts

Table of contents

An ensemble learning (meta-learning) technique in the field of neural networks.

Note reference: Mixture of Experts

Keywords: MoE, Gating Model, Ensemble Learning, Meta-learning, Divide and Conquer, Stack Generalization


Relationship with other techniques:

Classification And Regression Trees (CART). As a divide-and-conquer approach. Decision tree. As a hierarchical mixture of experts.

Stacked generalization (stacking). As meta-learning model.

Further readings

Literature review of MoE in 2012: Twenty Years of Mixture of Experts

Improvement of MoE in 2021: Switch Transformers: Scaling to a Trillion Parameter Models with Simple and Efficient Sparsity

Mistral.AI ‘s open-source Mistral 7B Model in 2023: Mistral 7B


Note reference: Mixture of Experts from HuggingFace.

Sparse MoE layers

The “experts”, where each of them is a neural network, typically FFNs.

‘Experts’ are sparse Switch FFN layers. Reference: <a rel="noopener" target="_blank" href="https://arxiv.org/abs/2101.03961">Switch Transformers paper.

‘Experts’ are sparse Switch FFN layers. Reference: Switch Transformers paper.

Gate network - ‘Router’

Determine which expert to ask. Composed of learned parameters and is pretrained together with experts.


Brief History and Some Methods


References in Green

Shazeer, N., Mirhoseini, A., Maziarz, K., Davis, A., Le, Q., Hinton, G., & Dean, J. (2017). Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer.

Lepikhin, D., Lee, H., Xu, Y., Chen, D., Firat, O., Huang, Y., Krikun, M., Shazeer, N., & Chen, Z. (2020). GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding.

Zoph, B., Bello, I., Kumar, S., Du, N., Huang, Y., Dean, J., Shazeer, N., & Fedus, W. (2022). ST-MoE: Designing Stable and Transferable Sparse Expert Models.
