Skip to content

KennethanCeyer/diy-generative-ai-lm

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DIY Kit | Generative AI | Language Model

Open In Colab

This project has been created to help you understand how large language models (LLMs) work. It provides a fundamental language model and tokenizer along with scripts for pretraining, instruction fine-tuning (SFT) using LoRA, and more.


Project Structure

.
├── README.md
├── assets
│   └── diy.png
├── model
│   ├── __init__.py
│   ├── decoder.py
│   ├── linear.py
│   ├── mlm_head.py
│   ├── transformer.py
│   └── transformer_block.py
├── requirements.txt
├── settings.py
├── sft
│   ├── __init__.py
│   ├── dataset.py
│   └── train.py
├── tokenizer
│   ├── __init__.py
│   └── tokenizer.py
└── train
    ├── __init__.py
    ├── dataset.py
    └── trainer.py

Model Architecture

flowchart TD
    subgraph TransformerModel
      A[Input Tokens] --> B[Decoder]
      B --> C[LM Head - Linear]
      C --> D[Output Logits]
    end

    subgraph Decoder [Decoder]
      direction TB
      E[Token Embedding]
      F[Position Embedding]
      E --> G[Sum Embeddings]
      F --> G
      G --> H[Stack of Transformer Blocks]
    end

    subgraph TransformerBlock [Transformer Block]
      direction LR
      I[Multi-Head Self-Attention] --> J[Add & Norm]
      J --> K[Feed-Forward - LoRA Enabled]
      K --> L[Add & Norm]
    end

    H --> I
Loading
Item Description
Number of parameters 186M
Model size 711MB

Introduction

LLM DIY KIT offers a minimalist implementation of a language model built from scratch. This repository includes:

  • A decoder-only Transformer model.
  • A custom GPT-2 based tokenizer.
  • Pretraining on a large textual corpus (e.g., Simple Wikipedia).
  • Instruction fine-tuning using LoRA for efficient adaptation.
  • Example training scripts and datasets for both pretraining and fine-tuning.

Introduction

LLM DIY KIT offers a minimalist implementation of a language model built from scratch. This repository includes:

  • A decoder-only Transformer model.
  • A custom GPT-2 based tokenizer.
  • Pretraining on a large textual corpus (e.g., Simple Wikipedia).
  • Instruction fine-tuning using LoRA for efficient adaptation.
  • Example training scripts and datasets for both pretraining and fine-tuning.

Getting Started

Step 1: Install Dependencies

Install the required packages:

pip3 install -r requirements.txt

Step 2: Pretrain the Model

Pretrain the baseline Transformer model by running:

PYTHONPATH=$(pwd) python3 train/trainer.py

This script trains the model on the Simple Wikipedia dataset (loaded automatically via HuggingFace Datasets) and saves the pretrained weights to baseline_transformer.pth.

Step 3: Instruction Fine-Tuning (SFT) with LoRA

Fine-tune the pretrained model using LoRA on an instruction dataset by running:

PYTHONPATH=$(pwd) python3 sft/train.py

This script loads the pretrained weights, applies LoRA (with only the additional low-rank parameters being trainable), and saves the fine-tuned model to lora_sft_transformer.pth.


Tokenizer

The project utilizes a GPT-2 based tokenizer from the Transformers library. The tokenizer is configured to use the eos_token as the pad token to properly handle padding during training.


Future Work

  • Extend fine-tuning guidelines with further architecture updates.
  • Increase model parameters and benchmark performance.
  • Provide prediction examples.
  • Include detailed model architecture descriptions along with video tutorials.

License

This project is licensed under the MIT License.

Releases

No releases published

Packages

No packages published

Languages