diff --git a/posts/2023-11-16-rr.md b/posts/2023-11-16-rr.md index 950ad5c..06c3715 100644 --- a/posts/2023-11-16-rr.md +++ b/posts/2023-11-16-rr.md @@ -1,13 +1,13 @@ --- layout: post title: "Reading Roundup" -categories: [X] +categories: [] year: 2023 -type: paper -author: Gemini Team, Google -exturl: https://storage.googleapis.com/deepmind-media/gemini/gemini_1_report.pdf +type: blog post --- +Short format collection covering a couple of blog posts / papers I've read recently. + ## The Reversal Curse: A Stark Reflection on LLMs' Limitations in Reasoning The paper on the "Reversal Curse" in Auto-regressive Large Language Models (LLMs) casts a sobering light on the overestimated generalization capabilities of these models. It's a striking revelation that LLMs, long thought to be adept at extrapolating and reasoning beyond their training data, actually demonstrate a significant shortfall in this regard. This failure to generalize is exemplified in the Reversal Curse, where LLMs, trained on statements in a specific order, fail to infer the logical reverse. diff --git a/posts/2023-12-12-moe.md b/posts/2023-12-12-moe.md index 04ac546..77ca7b0 100644 --- a/posts/2023-12-12-moe.md +++ b/posts/2023-12-12-moe.md @@ -8,8 +8,17 @@ type: post Mixture of Experts (MoE) are the flavor of the week following Mistral's release of **Mixtral 7Bx8**. Mistral are just killing it at the moment, love their style of just dropping a torrent link and let the results speak for themselves. The contrast to Google's announcement of Gemini is hilarious and it makes sense, Mistral is never going to appear more flashy or have the budget for a huge announcement ceremony, instead they entertain the 90's hacker vibe instead. Anyway as I was saying, Mixture of Experts are in vouge right now, but they are hardly a new discovery so today I'd like to present a brief overview of their history and hopefully arrive at an understanding for their prevalence in modern LLMs. -## Mixture of Experts - A brief history +# Mixture of Experts - A brief history MoE's can be traced back to the early 90's with a paper from none other than one of the _Godfathers of Deep Learning_ Geoffrey Hinton, [Adaptive Mixtures of Local Experts](https://ieeexplore.ieee.org/document/6797059). The original idea was akin to ensemble learning; a system composed of separate networks, each experts in a different subset of the training data. The experts where chosen based on a gating network (typically a linear layer), and these gates are trained together with the expert networks. -As the deep learning revolution took off in the 10's, a couple of important advancements came to MoEs. In [Learning Factored Representations in a Deep Mixture of Experts](https://arxiv.org/abs/1312.4314), the authors present MoE layers as a small part of a larger multilayer network. Previously, MoEs had comprised the entire system, but now they simply became a part of larger networks enabling MoE models to be both large and effective. People also realized that instead of we can dynamically activate or deactivate components based on the input token, allowing models to scale without impairing inference speeds [1](https://openreview.net/pdf?id=BNYMo3QRxh7PwR1riEDL) [2](https://arxiv.org/abs/1308.3432). This work culminated in the foundation of modern MoEs, a paper again co-authored by Geoffrey Hinton, [Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer](https://arxiv.org/abs/1701.06538). As the title suggests, this paper introduced a Sparsely-Gated Mixture of Experts layer. These layers would consist of thousands of feed-forward sub-networks with a gating network that determines a _sparse_ combination of the experts to be used for each token. The idea of sparsity is akin to conditional computation, in a dense model all the parameters would be used for all the inputs, but sparsity (or conditional computation) as I explained earlier allows us to only run parts of the whole system. +As the deep learning revolution took off in the 10's, a couple of important advancements came to MoEs. In [Learning Factored Representations in a Deep Mixture of Experts](https://arxiv.org/abs/1312.4314), the authors present MoE layers as a small part of a larger multilayer network. Previously, MoEs had comprised the entire system, but now they simply became a part of larger networks enabling MoE models to be both large and effective. People also realized that instead of we can dynamically activate or deactivate components based on the input token, allowing models to scale without impairing inference speeds [1](https://openreview.net/pdf?id=BNYMo3QRxh7PwR1riEDL) [2](https://arxiv.org/abs/1308.3432). This work culminated in the foundation of modern MoEs, a paper again co-authored by Geoffrey Hinton, [Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer](https://arxiv.org/abs/1701.06538). As the title suggests, this paper introduced a Sparsely-Gated Mixture of Experts layer. These layers would consist of thousands of feed-forward sub-networks with a gating network that determines a _sparse_ combination of the experts to be used for each token. The idea of sparsity is akin to conditional computation, in a dense model all the parameters would be used for all the inputs, but sparsity (or conditional computation) as I explained earlier allows us to only run parts of the whole system. If a model is trained with N experts this schematic allows the users to choose how many M << N experts they want use at the time depending on their computational resources. + +## Load Balancing + +When training a model with MoE layers it is common to add noise to the gating mechanism to avoid expert occlusion. As one might imagine, it is common for the gating network to converge to mostly activate a subset of the total number of experts, making the whole concept of MoEs less efficient. This problem is circular as favored experts are trained quicker and hence selected more. In addition to noise, an auxiliary loss is added to encourage giving all experts a roughly equal number of training examples. + +Google was one the first to blend large scale Transformers with MoEs in a framework they call [GShard](https://arxiv.org/abs/2006.16668). GShard replaced every other FFN layer with a MoE layer using top-2 gating. To maintain a balanced load and efficiency at scale, GShard introduced two additional load balancing techniques: + +- **Random routing**. The top expert is always picked but the second expert is sampled according to the gating weight probabilities. +- **Expert capacity**. A threshold for how many tokens can be processed by one expert. If both experts are at capacity, the token is considered overflowed and is sent to the next layer via a skip connection.