This project has been created to help you understand how large language models (LLMs) work. It provides a fundamental language model and tokenizer along with scripts for pretraining, instruction fine-tuning (SFT) using LoRA, and more.
.
├── README.md
├── assets
│ └── diy.png
├── model
│ ├── __init__.py
│ ├── decoder.py
│ ├── linear.py
│ ├── mlm_head.py
│ ├── transformer.py
│ └── transformer_block.py
├── requirements.txt
├── settings.py
├── sft
│ ├── __init__.py
│ ├── dataset.py
│ └── train.py
├── tokenizer
│ ├── __init__.py
│ └── tokenizer.py
└── train
├── __init__.py
├── dataset.py
└── trainer.py
flowchart TD
subgraph TransformerModel
A[Input Tokens] --> B[Decoder]
B --> C[LM Head - Linear]
C --> D[Output Logits]
end
subgraph Decoder [Decoder]
direction TB
E[Token Embedding]
F[Position Embedding]
E --> G[Sum Embeddings]
F --> G
G --> H[Stack of Transformer Blocks]
end
subgraph TransformerBlock [Transformer Block]
direction LR
I[Multi-Head Self-Attention] --> J[Add & Norm]
J --> K[Feed-Forward - LoRA Enabled]
K --> L[Add & Norm]
end
H --> I
Item | Description |
---|---|
Number of parameters | 186M |
Model size | 711MB |
LLM DIY KIT offers a minimalist implementation of a language model built from scratch. This repository includes:
- A decoder-only Transformer model.
- A custom GPT-2 based tokenizer.
- Pretraining on a large textual corpus (e.g., Simple Wikipedia).
- Instruction fine-tuning using LoRA for efficient adaptation.
- Example training scripts and datasets for both pretraining and fine-tuning.
LLM DIY KIT offers a minimalist implementation of a language model built from scratch. This repository includes:
- A decoder-only Transformer model.
- A custom GPT-2 based tokenizer.
- Pretraining on a large textual corpus (e.g., Simple Wikipedia).
- Instruction fine-tuning using LoRA for efficient adaptation.
- Example training scripts and datasets for both pretraining and fine-tuning.
Install the required packages:
pip3 install -r requirements.txt
Pretrain the baseline Transformer model by running:
PYTHONPATH=$(pwd) python3 train/trainer.py
This script trains the model on the Simple Wikipedia dataset (loaded automatically via HuggingFace Datasets) and saves the pretrained weights to baseline_transformer.pth
.
Fine-tune the pretrained model using LoRA on an instruction dataset by running:
PYTHONPATH=$(pwd) python3 sft/train.py
This script loads the pretrained weights, applies LoRA (with only the additional low-rank parameters being trainable), and saves the fine-tuned model to lora_sft_transformer.pth
.
The project utilizes a GPT-2 based tokenizer from the Transformers library. The tokenizer is configured to use the eos_token
as the pad token to properly handle padding during training.
- Extend fine-tuning guidelines with further architecture updates.
- Increase model parameters and benchmark performance.
- Provide prediction examples.
- Include detailed model architecture descriptions along with video tutorials.
This project is licensed under the MIT License.