Our Llama-2 tokenized datasets are available for download from Google Cloud Buckets:
gsutil -m cp -r gs://llama-2-pile/* llama-2-pile/
gsutil -m cp -r gs://llama-2-books3/* llama-2-books3/
Once downloaded, set the dataset_path
flag in train.py
to the directory containing the tokenizer_name-meta-llama
folder. This will allow the dataloader to find the correct path.
Since the raw Pile and Books3 datasets are no longer publically available on Huggingface, we recommend acquiring them via correspondence to their authors or from the community.
Before tokenization, set raw_json_path
and cache_dir
in tokenization.py
to the path where the raw dataset (in json format) is stored and where you want to store the tokenized dataset, respectively.
Our tokenization script is based on FlashAttention. Tokenize the raw datasets using the commands below.
Pile:
export PYTHONPATH=$PWD:$PYTHONPATH
pytest -q -s ttt/dataloader/tokenization.py -k "pile"
This takes around 20h on a 64-core CPU. The processed dataset is 716G.
Books3:
export PYTHONPATH=$PWD:$PYTHONPATH
pytest -q -s ttt/dataloader/tokenization.py -k "books"
This takes around 3h on a 64-core CPU. The processed dataset is 61G.