DIY Kit | Generative AI | Language Model

This project has been created to help you understand how large language models (LLMs) work. It provides a fundamental language model and tokenizer along with scripts for pretraining, instruction fine-tuning (SFT) using LoRA, and more.

Project Structure

.
├── README.md
├── assets
│   └── diy.png
├── model
│   ├── __init__.py
│   ├── decoder.py
│   ├── linear.py
│   ├── mlm_head.py
│   ├── transformer.py
│   └── transformer_block.py
├── requirements.txt
├── settings.py
├── sft
│   ├── __init__.py
│   ├── dataset.py
│   └── train.py
├── tokenizer
│   ├── __init__.py
│   └── tokenizer.py
└── train
    ├── __init__.py
    ├── dataset.py
    └── trainer.py

Model Architecture

flowchart TD
    subgraph TransformerModel
      A[Input Tokens] --> B[Decoder]
      B --> C[LM Head - Linear]
      C --> D[Output Logits]
    end

    subgraph Decoder [Decoder]
      direction TB
      E[Token Embedding]
      F[Position Embedding]
      E --> G[Sum Embeddings]
      F --> G
      G --> H[Stack of Transformer Blocks]
    end

    subgraph TransformerBlock [Transformer Block]
      direction LR
      I[Multi-Head Self-Attention] --> J[Add & Norm]
      J --> K[Feed-Forward - LoRA Enabled]
      K --> L[Add & Norm]
    end

    H --> I

Item	Description
Number of parameters	186M
Model size	711MB

Introduction

LLM DIY KIT offers a minimalist implementation of a language model built from scratch. This repository includes:

A decoder-only Transformer model.
A custom GPT-2 based tokenizer.
Pretraining on a large textual corpus (e.g., Simple Wikipedia).
Instruction fine-tuning using LoRA for efficient adaptation.
Example training scripts and datasets for both pretraining and fine-tuning.

Introduction

LLM DIY KIT offers a minimalist implementation of a language model built from scratch. This repository includes:

A decoder-only Transformer model.
A custom GPT-2 based tokenizer.
Pretraining on a large textual corpus (e.g., Simple Wikipedia).
Instruction fine-tuning using LoRA for efficient adaptation.
Example training scripts and datasets for both pretraining and fine-tuning.

Getting Started

Step 1: Install Dependencies

Install the required packages:

pip3 install -r requirements.txt

Step 2: Pretrain the Model

Pretrain the baseline Transformer model by running:

PYTHONPATH=$(pwd) python3 train/trainer.py

This script trains the model on the Simple Wikipedia dataset (loaded automatically via HuggingFace Datasets) and saves the pretrained weights to baseline_transformer.pth.

Step 3: Instruction Fine-Tuning (SFT) with LoRA

Fine-tune the pretrained model using LoRA on an instruction dataset by running:

PYTHONPATH=$(pwd) python3 sft/train.py

This script loads the pretrained weights, applies LoRA (with only the additional low-rank parameters being trainable), and saves the fine-tuned model to lora_sft_transformer.pth.

Tokenizer

The project utilizes a GPT-2 based tokenizer from the Transformers library. The tokenizer is configured to use the eos_token as the pad token to properly handle padding during training.

Future Work

Extend fine-tuning guidelines with further architecture updates.
Increase model parameters and benchmark performance.
Provide prediction examples.
Include detailed model architecture descriptions along with video tutorials.

License

This project is licensed under the MIT License.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DIY Kit | Generative AI | Language Model

Project Structure

Model Architecture

Introduction

Introduction

Getting Started

Step 1: Install Dependencies

Step 2: Pretrain the Model

Step 3: Instruction Fine-Tuning (SFT) with LoRA

Tokenizer

Future Work

License

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
assets		assets
model		model
sft		sft
tokenizer		tokenizer
train		train
.DS_Store		.DS_Store
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
settings.py		settings.py

KennethanCeyer/diy-generative-ai-lm

Folders and files

Latest commit

History

Repository files navigation

DIY Kit | Generative AI | Language Model

Project Structure

Model Architecture

Introduction

Introduction

Getting Started

Step 1: Install Dependencies

Step 2: Pretrain the Model

Step 3: Instruction Fine-Tuning (SFT) with LoRA

Tokenizer

Future Work

License

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages