Skip to content

FelixMoeller3/deep_learning

Repository files navigation

On the Effect of Whitespace Tokenization in Language Models

Repository for Deep Learning Project in Fall Semester 2024: On the Effect of Whitespace Tokenization on Language Models

Hannes Büchi, Yahya Emara, Hanno Hiss, Felix Möller

This repository contains the code and resources for the project titled "On the Effect of Whitespace Tokenization on Language Models". The project investigates how whitespace tokenization influences the performance, efficiency, and generalization capabilities of language models.

Repository Structure

tokenizer.py: Contains the script and configurations for the tokenizer, including the training and evaluation of tokenization schemes.

train_colab.py: Code to train the language models using various tokenization strategies.

eval.py: Script to evaluate the performance of the trained models on multiple benchmarks.

results: Collected results and analysis scripts for comparing the outcomes of different experiments.

About

Repo for Deep Learning Project in Fall Semester 2024

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •  

Languages