Skip to content

Building a Language Model from scratch with PyTorch and training it with dialogues from The Office.

Notifications You must be signed in to change notification settings

ToukoH/office-gpt

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Office GPT

This is a Language Model which generates dialogues in style of the TV-show "The Office". Data set used for the training can be found from https://www.kaggle.com/datasets/nasirkhalid24/the-office-us-complete-dialoguetranscript/data. Only the first four seasons were used for training. I got the idea for this project from https://cs224d.stanford.edu/reports/oguz.pdf and Andrej Karpathy's intro to Language Models.

The model can be trained by running:

python3 scripts/train.py

After this, inference can be ran by:

python3 scripts/inference.py START_TEXT MAX_TOKENS

The output is pure gibberish, but this is what you achieve with 10 805 078 parameters and lazy hyperparameter tuning.

Example output:

Michael:
Good morning Jim!

Jim:
I am gonna call you down in the back, you can get your party.

Michael:
Yes, it is everyone with your stripper and it co-rabbed, happy. And for smaking sprays in deposition.  Thank you.

Jan:
Michael?

Michael:
They have no idea.

Jan:
No, I didn't give a cats, I think it is in my own caskward.

Michael:
What did I die to do?

Jan:
I like the party, so...

Michael:
It was just fine for me. 

Technical details

  • Multi-Head Attention: Enhances the model's ability to process different parts of the input sequence in parallel.

  • Layer Normalization: Applied both before the multi-head attention mechanism and before the feed-forward network in each transformer block.

  • Residual Connections: Used in each transformer block to facilitate the flow of information and gradients through the network.

  • Embedding Layers: The model utilizes separate embedding layers for tokens and positional encodings.

  • Character-Level Tokenization: Each character is treated as a distinct entity, allowing the model to learn and generate text at the granular level of individual characters.

About

Building a Language Model from scratch with PyTorch and training it with dialogues from The Office.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages