Introduction
In order to follow the evolution of artificial intelligence, we turned to LLMs benchmarks to take the temperature of which LLMs are dominant. Starting with the most famous benchmark, the Open LLM leaderboard, we immediately notice that the top 10 is invaded by a host of models including “MoE” in their names?!
In search of validation of this trend, we turn to the Chatbot Arena Benchmark, recognized as the reference for evaluating the best global language model. Again, the top 10 are full of MOEs! Indeed, GPT4 is suspected of being one, it is accompanied by Mistral's model, Mixtral (Mix indicating its MoE architecture).
Faced with this invasion, it is becoming essential to introduce what MoEs are as well as explain the performance gain that comes from their use.
Mixture of Experts (MOE): expectations
At first glance, the mixtures of experts seem to be a fabulous solution that promises us:
- Better performances,
- A faster (and therefore less expensive) model training,
- Significantly lower inference costs.
But what is a MoE?
Let's start by explaining how you can improve the performance of an LLM? A machine learning practitioner who has witnessed the historic rise of deep learning will start by increasing the simple quantity of parameters in our model. Mixtures of Experts is a more refined way to do that.
Architecture of the Mixture of Experts
What is an expert? It is simply a neural network with unique weights learned independently.
The second brick composing the essence of Mixtures of Experts is the router:
The router takes the tokens given to the model as input and redirects them to the most appropriate expert (s). This is an important detail, each token (~mot) can go through a different expert. The results of the various experts are thus aggregated and standardized to form the output of the model.
More formally, the router is a learned model ($G$) and we can represent the operation of the MoES by the following equation (with $E_i$ the expert networks):
$$y =\ sum_ {i=1} ^ {n} {g (x) _ie_i (x)} $$
A gating function that could be chosen is Softmax. Another example is Mixtral's gating function, which adds a $topK$ function that only keeps the K best experts for each token (the $swiglu_i$ representing expert networks). ):
$$y =\ sum_ {i=0} ^ {n-1} {n-1} {Softmax (ToPk (x\ cdot W_g}))) _i\ cdot Swiglu_i (x) $
The case of the Transformers
An important detail that we have omitted so far is that the MoEs that we mentioned in the introduction are present in Transformers architectures and that introduces some nuances.
Experts and their router mechanism are introduced in place of the feed-forward blocks of the transformers architecture. The attention layer is divided (For the less mathematical, this is why Mistral 8x7b is actually a 47b model and not 56b). With this, we also see that the experts in each Transformers block are different.
MOEs: Why are they better?
- The most effective inference/pre-training
MOEs are the way to have additional parameters “for free”. Indeed, the top-K mechanism assures us that most experts are not used, so at the inference, we only use a fraction of the parameters. This fraction is optimized thanks to the mechanism of the router, which is also driven.
- Training is cheaper.
MOE training is more effective. In fact, by dividing the parameters between different experts, the number of gradients to be calculated is greatly reduced. The training will therefore converge more quickly and this represents a saving of calculation and therefore of time as well as money!
Is there a specialization of experts?
According to the common perception, an expert is a person who is competent in a particular field. Is it the same for our MoE experts?
Experts are not subject experts, as our intuition might tell us. Indeed, there is no expert who receives a load that is significantly greater than another on a given subject in the dataset The Pile. The only exception being for “DM Mathematics” which seems to point to a syntactic specialization.
However, it also seems that some experts are specialized in the treatment of certain tokens, such as punctuation tokens or conjunct tokens. However, these specialization results remain the exception; the results in the following table are selected from other examples without clear semantic or syntactic specialization.
The disadvantages of MOEs
MoES models, beyond their different nature, behave differently from classical models when you want to do fine-tuning. Indeed, MOEs are more susceptible to overfitting. Generally, the lower the number of experts, the easier it will be to fine-tune our model. However, there are exceptions such as the TriviaQA dataset, for which fine-tune MoE excels.
Even if MOEs use only a fraction of their parameters, it is still necessary to load the model and all its experts into memory. You must therefore have machines with a VRAM (video RAM) to be able to choose MOEs.
This particularity highlights the fact that it is necessary to choose a MoES if there is enough VRAM and if a high inference speed is required. Otherwise, a dense model (as opposed to MoES) will be more appropriate.
Conclusion
Mixtures of Experts are a simple way to increase the performance of LLMs, all with a lower learning cost as well as a lower inference cost. However, this architecture comes with its own challenges. Such models require additional memory resources (VRAM) in order to be able to load the model and to this is added a loss of adaptability of the model which is, in principle, more difficult to fine-tune.
In summary, the Mixtures of Experts mark a major advance, offering better performances, however, they require significant investments in equipment and expertise to be fully exploited.
References
- huggingface blog: https://huggingface.co/blog/moe
- Mistral paper: Mixtral of Experts
- Switch Transformer: arXiv
- MegaBlocks: arXiv, Github