On the Effect of Whitespace Tokenization in Language Models

Repository for Deep Learning Project in Fall Semester 2024: On the Effect of Whitespace Tokenization on Language Models

Hannes Büchi, Yahya Emara, Hanno Hiss, Felix Möller

This repository contains the code and resources for the project titled "On the Effect of Whitespace Tokenization on Language Models". The project investigates how whitespace tokenization influences the performance, efficiency, and generalization capabilities of language models.

Repository Structure

tokenizer.py: Contains the script and configurations for the tokenizer, including the training and evaluation of tokenization schemes.

train_colab.py: Code to train the language models using various tokenization strategies.

eval.py: Script to evaluate the performance of the trained models on multiple benchmarks.

results: Collected results and analysis scripts for comparing the outcomes of different experiments.

Name		Name	Last commit message	Last commit date
Latest commit History 56 Commits
results		results
tokenizers		tokenizers
.gitignore		.gitignore
README.md		README.md
count_tokens.py		count_tokens.py
deeplearning_template.py		deeplearning_template.py
environment.yml		environment.yml
eval.py		eval.py
prepare_dataset.py		prepare_dataset.py
tokenizer.py		tokenizer.py
train_colab.py		train_colab.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

On the Effect of Whitespace Tokenization in Language Models

About

Releases

Packages

Contributors 4

Languages

FelixMoeller3/deep_learning

Folders and files

Latest commit

History

Repository files navigation

On the Effect of Whitespace Tokenization in Language Models

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages