Repository for Deep Learning Project in Fall Semester 2024: On the Effect of Whitespace Tokenization on Language Models
Hannes Büchi, Yahya Emara, Hanno Hiss, Felix Möller
This repository contains the code and resources for the project titled "On the Effect of Whitespace Tokenization on Language Models". The project investigates how whitespace tokenization influences the performance, efficiency, and generalization capabilities of language models.
Repository Structure
tokenizer.py: Contains the script and configurations for the tokenizer, including the training and evaluation of tokenization schemes.
train_colab.py: Code to train the language models using various tokenization strategies.
eval.py: Script to evaluate the performance of the trained models on multiple benchmarks.
results: Collected results and analysis scripts for comparing the outcomes of different experiments.