Skip to content

Commit

Permalink
Bonus material: extending tokenizers (#496)
Browse files Browse the repository at this point in the history
* Bonus material: extending tokenizers

* small wording update
  • Loading branch information
rasbt authored Jan 22, 2025
1 parent dce4603 commit a22d612
Show file tree
Hide file tree
Showing 7 changed files with 1,224 additions and 2 deletions.
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -120,6 +120,7 @@ Several folders contain optional materials as a bonus for interested readers:
- [Converting GPT to Llama](ch05/07_gpt_to_llama)
- [Llama 3.2 From Scratch](ch05/07_gpt_to_llama/standalone-llama32.ipynb)
- [Memory-efficient Model Weight Loading](ch05/08_memory_efficient_weight_loading/memory-efficient-state-dict.ipynb)
- [Extending the Tiktoken BPE Tokenizer with New Tokens](ch05/09_extending-tokenizers/extend-tiktoken.ipynb)
- **Chapter 6: Finetuning for classification**
- [Additional experiments finetuning different layers and using larger models](ch06/02_bonus_additional-experiments)
- [Finetuning different models on 50k IMDB movie review dataset](ch06/03_bonus_imdb-classification)
Expand Down
3 changes: 3 additions & 0 deletions ch02/05_bpe-from-scratch/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# Byte Pair Encoding (BPE) Tokenizer From Scratch

- [bpe-from-scratch.ipynb](bpe-from-scratch.ipynb) contains optional (bonus) code that explains and shows how the BPE tokenizer works under the hood.
3 changes: 3 additions & 0 deletions ch05/09_extending-tokenizers/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# Extending the Tiktoken BPE Tokenizer with New Tokens

- [extend-tiktoken.ipynb](extend-tiktoken.ipynb) contains optional (bonus) code to explain how we can add special tokens to a tokenizer implemented via `tiktoken` and how to update the LLM accordingly
Loading

0 comments on commit a22d612

Please sign in to comment.