- Clone this repository
- Have python installed
- Install the requirements of the requirements.txt
- This can be done with anaconda (
conda install tqdm
) or pip (e.g.:pip install tqdm
) (I used a conda environmentgpt-2
that was a clone of the basic python env)conda create --name gpt-2 --clone base
- Install tensorflow (for CPU
pip install tensorflow
, for GPUpip install tensorflow-gpu
) - If you want to use the GPU you also need to install CUDA (Tensorflow 1.14 did not find files from CUDA 10.1, because it relies on CUDA 10.0 (This link helped with the missing file)) and cuDNN 7.6.1 (https://developer.nvidia.com/cuda-downloads)
- On windows you also need VisualStudio 2017 (I installed 2019 first, may also work), but the CUDA installation will tell you that (Only need C++ for Windows Development in the install menu)
- I had the issue that when running
- add them to your PATH variable if not done by cuDNN already:
- C:\tools\cuda\bin
- C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.0\bin
- This can be done with anaconda (
B) With Docker (more detail in Docker Chapter)
- Install Docker
- Clone this repository
- Build the docker image:
docker build --tag=transformers .
- Run an interactive and detached image:
docker run -it -d transformers
- To get the running containers:
docker ps
-a shows all (also stopped containers) - To copy files to the running docker image:
docker cp wordvectors/ <container-name>:/gpt-2
- To copy files from the running docker image to the host:
docker cp <container-name>:/gpt-2 .
- To get the running containers:
- To enter the running docker image:
docker exec -it <container-name>
- Go into gpt-2 directory
- (Not needed anymore) For Windows:
set PYTHONPATH=src
For LinuxPYTHONPATH=src
- (Not needed anymore) For Windows:
- If using an environment (with anaconda): activate the conda environement (e.g.:
activate gpt-2
) - (only needs to be done once) Download model
python download_model.py 117M
- Encode your data set as Byte-Pair Encoding (only needs to be done once per dataset)
python encode.py --model_name 117M data\<yourData>.txt data\<yourData>.npz
- Then only use the npz file for training
- Note: When using another model this needs to be done again
- Note: it is possible to give the encoding a file or a whole directory. It will go through every file in the directory then. (The parameters for training a model are well described in train.py)
- Train the 117M model:
python train.py --model_name 117M --run_name <yourModelName> --dataset data\<yourData>.npz --batch_size 1 --top_p 0.9 --save_every 2000 --sample_every 1000
- Train the 345M model:
python train.py --model_name 345M --run_name <yourModelName> --dataset data\<yourData>.npz --batch_size 1 --top_p 0.9 --save_every 2000 --sample_every 1000
- If training the 345M model did not work due to OOM issues it is possible to use SGD instead of ADAM:
- TODO try with memory_saving_gradients
python train.py --model_name 345M --run_name <yourModelName> --dataset data\<yourData>.npz --optimizer sgd --learning_rate 0.001 --batch_size 1 --top_p 0.9 --save_every 2000 --sample_every 1000
- To resume from the latest checkpoint (there will be a folder checkpoint) just run the line from 5. again
- To resume from a specific checkpoint
python train.py --restore_from path/to/checkpoint --model_name 117M --dataset data\<yourData>.npz --batch_size 1 --top_p 0.9 --save_every 2500 --sample_every 1000
- To start fresh either delete the folder or run
python train.py --restore_from 'fresh' --model_name 117M --dataset data\<yourData>.npz --batch_size 1 --top_p 0.9 --save_every 2500 --sample_every 1000
- Train the 345M model:
- Create a folder in
models
with your trained model (e.g.:trained
) - Go to your checkpoints of your model and copy
checkpoint
,model-xxx.data00000-of-00001
,model-xxx.index
andmodel-xxx.meta
into the newtrained
folder- Note: the model specified in the top line of
checkpoint
is used
- Note: the model specified in the top line of
- Go to models/117M (or 345M if trained with it) and copy
encoder.json
,hparams.json
andvocab.bpe
to yourtrained
folder - Go to gpt-2 again
- Generate unconditioned samples:
python src/generate_unconditional_samples.py --top_p 0.9 --model_name <yourModelName> --nsamples 3
- Generate conditioned samples:
python src\interactive_conditional_samples.py --top_p 0.9 --model_name <yourModelName>
- Generate conditioned samples using a text file:
python src\conditional_samples_with_input.py --top_p 0.9 --model_name 117M --nsamples 3 --length 80 < input.txt
- NOTE:
--length 100
limits the output of the samples to 100 tokens (not characters or words) - NOTE2: In input.txt the text that is used to condition the model is put.
- NOTE3: The checkpoint of the model specified in the
model_checkpoint_path
in the checkpoint file is the one that is used.
- Generate unconditioned samples:
Important: Don't run your python files on the cluster, but use a script in which you specify the python script to run (explained below)
-
Get access to the cluster Link to Sonic
-
Use Putty or ssh to connect to cluster (explanation how is given in link above)
- Check what modules are available:
module avail
(these can be loaded in the scriptmodule load <module-name>
)
- Check what modules are available:
-
Use conda to create a virtual environment
conda create --name gpt-2 --clone base
conda activate gpt-2
- if not already done:
conda config --add channels conda-forge
- if not already done:
conda install --file requirements.txt
or install withpip install -r requirements.txt
conda install tensorflow-gpu
- Create a data folder (
mkdir data
) and put your dataset in it- If not yet Byte-Pair-Encoded do that by specifying the last line of the script
encodeDataset.sh
with your configuration (see Training)
- If not yet Byte-Pair-Encoded do that by specifying the last line of the script
-
Create a .sh script to submit a job to the cluster with specifications about the script (example script is given in
finetuneGPT345M.sh
)- Submitting a job to the cluster:
sbatch myjob.sh
and gives back a jobid- To use GPU:
sbatch --partition=csgpu myjob.sh
- Also make sure that you specify
#SBATCH --gres=gpu:1
otherwise your job will end up in the queue but not start
- To use GPU:
- Check running jobs:
squeue
- Cancel running job:
scancel <jobid>
- Submitting a job to the cluster:
Creates an image with the specified packages from the Dockerfile (the requirements) --> only needs to be done once
docker build --tag=transformers .
docker image ls
(lists allinstalled
images)docker ps -a
(shows all containers)docker rm <container-name>
(container-name is at the end of docker ps command)docker run <image-name>
-d
=detached/background-it
=interactive shell, --rm
=removes container after exit,-m 32g
=allows the container to use 32gb of RAM (Doesn't seem to work with current Docker version under Windows)-ipc=
host`` (needs this to make multiprocessing possible))
docker exec -it <container-name> /bin/bash
(enter running container)docker cp . <container-name>:/gpt2
(copies files from host to container)docker stop <container-name>
(stops the container)docker container prune
(removes all stopped containers)
docker run -it -d transformers
- get container name from
docker ps
command - Copy files from host to container:
docker cp . <container-name>:/gpt2
- enter running container again:
docker exec -it <container-name> /bin/bash
- run your script
python probabilities.py
- get the created wordvectors from the container
docker cp <container-name>:gpt2/wordvectors/. wordvectors\