Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why Transformer++ #21

Open
sdmorrey opened this issue Jun 15, 2024 · 1 comment
Open

Why Transformer++ #21

sdmorrey opened this issue Jun 15, 2024 · 1 comment

Comments

@sdmorrey
Copy link

I found this project being discussed in local llama subreddit.
I read the paper but had questions.

One of the questions that came up that is gnawing at me... Why Transformer++ as your basis of comparison? That model is basically from the Stone Age at this point.

Have you performed any comparisons with more recent SOTA models or against the frontier models?

Thanks!

@ridgerchu
Copy link
Owner

ridgerchu commented Jun 15, 2024

We chose Transformer++ as the base architecture for our language model because it serves as the foundation for many modern state-of-the-art models, such as LLaMa2/3, Mistral, Qwen, and Yi. These models have built upon the Transformer++ architecture with minor modifications, demonstrating its effectiveness and versatility.
Moreover, recent research on linear transformers, such as Mamba and GLA, has utilized Transformer++ as the baseline for comparison. This further highlights the significance and relevance of the Transformer++ architecture in the field of natural language processing.
The perceived underperformance of our model can be attributed to the limited training data compared to other models. For instance, Gemma was trained on 6 trillion tokens, while LLaMa 3 was trained on an impressive 15 trillion tokens, where we only have GPUs that is able to train 100B tokens. These numbers are clearly reported in their respective papers, and the training data used is not openly available.
Although we have access to the fineweb corpus, which contains 15 trillion tokens, training a model on such a large dataset remains a challenging and resource-intensive task. It is estimated that renting the necessary H100 GPUs to train on this scale would cost nearly 1 million dollars.
We are actively seeking support and contributions from the community to help us train our model on larger datasets and further improve its performance. If you are interested in contributing those computation resources, we would be immensely grateful for your support. ^_^

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants