This is the code we used in our paper
Breaking the Softmax Bottleneck: A High-Rank RNN Language Model
Zhilin Yang*, Zihang Dai*, Ruslan Salakhutdinov, William W. Cohen (*: equal contribution)
Preprint 2017
Python 3.6, PyTorch 0.4.1
The original implementation and tuning were based on PyTorch 0.2.0. The code base has been upgraded to be compatible with 0.4.1. To exactly reproduce the results in our paper, you would need to use PyTorch 0.2.0 and do
git checkout 4c43dee3f8a0aacea759c07f10d8f80dc0bb9bb2
to roll back to the previous version.
Below are results of the current version on Penn Treebank as reported in #9 . One may need further tuning to match the original results.
MoS w/o finetune: Valid 58.34 Test 56.18
MoS: Valid 56.83 Test 54.64
MoS + dynamic evaluation: Valid 49.03 Test: 48.43
First, train the model
python --data data/penn --dropouti 0.4 --dropoutl 0.29 --dropouth 0.225 --seed 28 --batch_size 12 --lr 20.0 --epoch 1000 --nhid 960 --nhidlast 620 --emsize 280 --n_experts 15 --save PTB --single_gpu
Second, finetune the model
python --data data/penn --dropouti 0.4 --dropoutl 0.29 --dropouth 0.225 --seed 28 --batch_size 12 --lr 25.0 --epoch 1000 --nhid 960 --emsize 280 --n_experts 15 --save PATH_TO_FOLDER --single_gpu
is the folder created by the first step (concatenation of PTB with a timestamp).
Third, run dynamic evaluation
python --model PATH_TO_FOLDER/ --lamb 0.075
First, train the model
python --epochs 1000 --data data/wikitext-2 --save WT2 --dropouth 0.2 --seed 1882 --n_experts 15 --nhid 1150 --nhidlast 650 --emsize 300 --batch_size 15 --lr 15.0 --dropoutl 0.29 --small_batch_size 5 --max_seq_len_delta 20 --dropouti 0.55 --single_gpu
Second, finetune the model
python --epochs 1000 --data data/wikitext-2 --save PATH_TO_FOLDER --dropouth 0.2 --seed 1882 --n_experts 15 --nhid 1150 --emsize 300 --batch_size 15 --lr 20.0 --dropoutl 0.29 --small_batch_size 5 --max_seq_len_delta 20 --dropouti 0.55 --single_gpu
Third, run dynamic evaluation
python --data data/wikitext-2 --model PATH_TO_FOLDER/ --epsilon 0.002
This will yield the same results as using one single GPU, but will be faster.
First, train the model
CUDA_VISIBLE_DEVICES=0,1,2 python --epochs 1000 --data data/wikitext-2 --save WT2 --dropouth 0.2 --seed 1882 --n_experts 15 --nhid 1150 --nhidlast 650 --emsize 300 --batch_size 15 --lr 15.0 --dropoutl 0.29 --small_batch_size 15 --max_seq_len_delta 20 --dropouti 0.55
Second, finetune the model
CUDA_VISIBLE_DEVICES=0,1,2 python --epochs 1000 --data data/wikitext-2 --save PATH_TO_FOLDER --dropouth 0.2 --seed 1882 --n_experts 15 --nhid 1150 --emsize 300 --batch_size 15 --lr 20.0 --dropoutl 0.29 --small_batch_size 15 --max_seq_len_delta 20 --dropouti 0.55
Third, run dynamic evaluation
python --data data/wikitext-2 --model PATH_TO_FOLDER/ --epsilon 0.002
A large portion of this repo is borrowed from the following repos: and