Skip to content

Latest commit

 

History

History
34 lines (24 loc) · 1.43 KB

README.md

File metadata and controls

34 lines (24 loc) · 1.43 KB

Dataset Preparation

Option 1: Download Pre-Tokenized Datasets (Recommended)

Our Llama-2 tokenized datasets are available for download from Google Cloud Buckets:

gsutil -m cp -r gs://llama-2-pile/* llama-2-pile/
gsutil -m cp -r gs://llama-2-books3/* llama-2-books3/

Once downloaded, set the dataset_path flag in train.py to the directory containing the tokenizer_name-meta-llama folder. This will allow the dataloader to find the correct path.

Option 2: Tokenize Datasets Yourself

Since the raw Pile and Books3 datasets are no longer publically available on Huggingface, we recommend acquiring them via correspondence to their authors or from the community.

Before tokenization, set raw_json_path and cache_dir in tokenization.py to the path where the raw dataset (in json format) is stored and where you want to store the tokenized dataset, respectively.

Our tokenization script is based on FlashAttention. Tokenize the raw datasets using the commands below.

Pile:

export PYTHONPATH=$PWD:$PYTHONPATH
pytest -q -s ttt/dataloader/tokenization.py -k "pile"

This takes around 20h on a 64-core CPU. The processed dataset is 716G.

Books3:

export PYTHONPATH=$PWD:$PYTHONPATH
pytest -q -s ttt/dataloader/tokenization.py -k "books"

This takes around 3h on a 64-core CPU. The processed dataset is 61G.