Skip to content

Commit

Permalink
/blog bug final
Browse files Browse the repository at this point in the history
  • Loading branch information
LeonEricsson committed Dec 12, 2023
1 parent eb63c5d commit aff78c5
Show file tree
Hide file tree
Showing 2 changed files with 15 additions and 6 deletions.
8 changes: 4 additions & 4 deletions posts/2023-11-16-rr.md
Original file line number Diff line number Diff line change
@@ -1,13 +1,13 @@
---
layout: post
title: "Reading Roundup"
categories: [X]
categories: []
year: 2023
type: paper
author: Gemini Team, Google
exturl: https://storage.googleapis.com/deepmind-media/gemini/gemini_1_report.pdf
type: blog post
---

Short format collection covering a couple of blog posts / papers I've read recently.

## The Reversal Curse: A Stark Reflection on LLMs' Limitations in Reasoning

The paper on the "Reversal Curse" in Auto-regressive Large Language Models (LLMs) casts a sobering light on the overestimated generalization capabilities of these models. It's a striking revelation that LLMs, long thought to be adept at extrapolating and reasoning beyond their training data, actually demonstrate a significant shortfall in this regard. This failure to generalize is exemplified in the Reversal Curse, where LLMs, trained on statements in a specific order, fail to infer the logical reverse.
Expand Down
13 changes: 11 additions & 2 deletions posts/2023-12-12-moe.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,8 +8,17 @@ type: post

Mixture of Experts (MoE) are the flavor of the week following Mistral's release of **Mixtral 7Bx8**. Mistral are just killing it at the moment, love their style of just dropping a torrent link and let the results speak for themselves. The contrast to Google's announcement of Gemini is hilarious and it makes sense, Mistral is never going to appear more flashy or have the budget for a huge announcement ceremony, instead they entertain the 90's hacker vibe instead. Anyway as I was saying, Mixture of Experts are in vouge right now, but they are hardly a new discovery so today I'd like to present a brief overview of their history and hopefully arrive at an understanding for their prevalence in modern LLMs.

## Mixture of Experts - A brief history
# Mixture of Experts - A brief history

MoE's can be traced back to the early 90's with a paper from none other than one of the _Godfathers of Deep Learning_ Geoffrey Hinton, [Adaptive Mixtures of Local Experts](https://ieeexplore.ieee.org/document/6797059). The original idea was akin to ensemble learning; a system composed of separate networks, each experts in a different subset of the training data. The experts where chosen based on a gating network (typically a linear layer), and these gates are trained together with the expert networks.

As the deep learning revolution took off in the 10's, a couple of important advancements came to MoEs. In [Learning Factored Representations in a Deep Mixture of Experts](https://arxiv.org/abs/1312.4314), the authors present MoE layers as a small part of a larger multilayer network. Previously, MoEs had comprised the entire system, but now they simply became a part of larger networks enabling MoE models to be both large and effective. People also realized that instead of we can dynamically activate or deactivate components based on the input token, allowing models to scale without impairing inference speeds [1](https://openreview.net/pdf?id=BNYMo3QRxh7PwR1riEDL) [2](https://arxiv.org/abs/1308.3432). This work culminated in the foundation of modern MoEs, a paper again co-authored by Geoffrey Hinton, [Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer](https://arxiv.org/abs/1701.06538). As the title suggests, this paper introduced a Sparsely-Gated Mixture of Experts layer. These layers would consist of thousands of feed-forward sub-networks with a gating network that determines a _sparse_ combination of the experts to be used for each token. The idea of sparsity is akin to conditional computation, in a dense model all the parameters would be used for all the inputs, but sparsity (or conditional computation) as I explained earlier allows us to only run parts of the whole system.
As the deep learning revolution took off in the 10's, a couple of important advancements came to MoEs. In [Learning Factored Representations in a Deep Mixture of Experts](https://arxiv.org/abs/1312.4314), the authors present MoE layers as a small part of a larger multilayer network. Previously, MoEs had comprised the entire system, but now they simply became a part of larger networks enabling MoE models to be both large and effective. People also realized that instead of we can dynamically activate or deactivate components based on the input token, allowing models to scale without impairing inference speeds [1](https://openreview.net/pdf?id=BNYMo3QRxh7PwR1riEDL) [2](https://arxiv.org/abs/1308.3432). This work culminated in the foundation of modern MoEs, a paper again co-authored by Geoffrey Hinton, [Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer](https://arxiv.org/abs/1701.06538). As the title suggests, this paper introduced a Sparsely-Gated Mixture of Experts layer. These layers would consist of thousands of feed-forward sub-networks with a gating network that determines a _sparse_ combination of the experts to be used for each token. The idea of sparsity is akin to conditional computation, in a dense model all the parameters would be used for all the inputs, but sparsity (or conditional computation) as I explained earlier allows us to only run parts of the whole system. If a model is trained with N experts this schematic allows the users to choose how many M << N experts they want use at the time depending on their computational resources.

## Load Balancing

When training a model with MoE layers it is common to add noise to the gating mechanism to avoid expert occlusion. As one might imagine, it is common for the gating network to converge to mostly activate a subset of the total number of experts, making the whole concept of MoEs less efficient. This problem is circular as favored experts are trained quicker and hence selected more. In addition to noise, an auxiliary loss is added to encourage giving all experts a roughly equal number of training examples.

Google was one the first to blend large scale Transformers with MoEs in a framework they call [GShard](https://arxiv.org/abs/2006.16668). GShard replaced every other FFN layer with a MoE layer using top-2 gating. To maintain a balanced load and efficiency at scale, GShard introduced two additional load balancing techniques:

- **Random routing**. The top expert is always picked but the second expert is sampled according to the gating weight probabilities.
- **Expert capacity**. A threshold for how many tokens can be processed by one expert. If both experts are at capacity, the token is considered overflowed and is sent to the next layer via a skip connection.

0 comments on commit aff78c5

Please sign in to comment.