Mixture of Experts

An ensemble learning (meta-learning) technique in the field of neural networks.

Note reference: Mixture of Experts

Keywords: MoE, Gating Model, Ensemble Learning, Meta-learning, Divide and Conquer, Stack Generalization

Workflow

Division of task into subtasks.
Develop an expert for each task. (Mixture density network)
Use a gating model to decide which expert to use. (‘Classifier selection algorithm’)
- Use expectation maximization (EM) to train the experts and gating model together. (Gaining dynamic confidence)
Pool predictions and gating model output to make a prediction.

Relationship with other techniques:

Classification And Regression Trees (CART). As a divide-and-conquer approach. Decision tree. As a hierarchical mixture of experts.

Stacked generalization (stacking). As meta-learning model.

Elements

Note reference: Mixture of Experts from HuggingFace.

Sparse MoE layers

The “experts”, where each of them is a neural network, typically FFNs.

‘Experts’ are sparse Switch FFN layers. Reference: <a rel="noopener" target="_blank" href="https://arxiv.org/abs/2101.03961">Switch Transformers paper.

‘Experts’ are sparse Switch FFN layers. Reference: Switch Transformers paper.

Gate network - ‘Router’

Determine which expert to ask. Composed of learned parameters and is pretrained together with experts.

Challenges

In training: compute-efficient pre-train, but struggles to generalize during fine-tuning and vulnerable to overfitting.
In inference: Not using all parameters leads to faster inference compared to models with the same number of parameters, but all of the parameters need to be loaded in RAM and thus requires large memory.

Brief History and Some Methods

References in Green

Shazeer, N., Mirhoseini, A., Maziarz, K., Davis, A., Le, Q., Hinton, G., & Dean, J. (2017). Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer.

Lepikhin, D., Lee, H., Xu, Y., Chen, D., Firat, O., Huang, Y., Krikun, M., Shazeer, N., & Chen, Z. (2020). GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding.

Zoph, B., Bello, I., Kumar, S., Du, N., Huang, Y., Dean, J., Shazeer, N., & Fedus, W. (2022). ST-MoE: Designing Stable and Transferable Sparse Expert Models.

/ML/