This repository contains code for analysing phonotactics. It accompanies the paper: Phonotactic Complexity and Its Trade-offs (Pimentel et al., TACL 2020).
It is a study about languages phonotactics and how it relates to other language features, such as word length.
Create a conda environment with
$ source config/conda.sh
And install your appropriate version of PyTorch.
First, download NorthEuraLex data from this link and put it in the datasets/northeuralex
folder.
Then, parse it using the following command:
$ python data_layer/parse.py --data northeuralex
You can train the base models (without shared embeddings) with the commands:
$ python learn_layer/train_base.py --model <model> [--opt]
$ python learn_layer/train_base_bayes.py --model <model>
$ python learn_layer/train_base_cv.py --model <model> [--opt]
Different commands are:
train_base
: Trains model with default data split;train_base_bayes
: Trains model using bayesing optimization and default data split;train_base_cv
: Trains cross validated models.
Model can be:
- lstm: LSTM with default one hot embeddings
- phoible: LSTM with phoible embeddings
- phoible-lookup: LSTM with both one hot and phoible embeddings
And --opt
is an optional parameter that tells the script to use bayes optimized hyper-parameters. It can only be used after training model with train_base_bayes
.
You can train models with shared embeddings using the commands:
$ python learn_layer/train_shared.py --model <model> [--opt]
$ python learn_layer/train_shared_bayes.py --model <model>
$ python learn_layer/train_shared_cv.py --model <model> [--opt]
Model can be:
- shared-lstm: LSTM with shared one hot embeddings
- shared-phoible: LSTM with shared phoible embeddings
- shared-phoible-lookup: LSTM with both one hot and phoible shared embeddings
You can train ngram models with the following commands:
$ python learn_layer/train_ngram.py --model ngram
$ python learn_layer/train_unigram.py --model unigram
$ python learn_layer/train_ngram_cv.py --model ngram
$ python learn_layer/train_unigram_cv.py --model ngram
Model can be:
- ngram: ngram model by default is a trigram
- unigram: Unigram model
You can train models on aritificial data using the commands:
$ python learn_layer/train_artificial.py --artificial-type <artificial-type>
$ python learn_layer/train_artificial_bayes.py --artificial-type <artificial-type>
$ python learn_layer/train_artificial_cv.py --artificial-type <artificial-type>
$ python learn_layer/train_artificial_ngram.py --model ngram --artificial-type <artificial-type>
Artificial type can be:
- harmony: Artificial data with vowel harmony removed;
- devoicing: Artificial data with final obstruent devoicing removed.
You can also call a script to train all models sequentially (it might take a while):
$ source learn_layer/train_multi.sh
Get compiled result data:
$ python analysis_layer/compile_results.py
$ python analysis_layer/get_lang_inventory.py
Plot all results with commands:
$ mkdir plot
$ python visualization_layer/plot_lstm.py
$ python visualization_layer/plot_full.py
$ python visualization_layer/plot_inventory.py
$ python visualization_layer/plot_kde.py
$ python visualization_layer/plot_artificial_scatter.py
If this code or the paper were usefull to you, consider citing it:
@article{pimentel-etal-2020-phonotactics,
title={Phonotactic Complexity and its Trade-offs},
author={Pimentel, Tiago and Roark, Brian and Cotterell, Ryan},
journal={Transactions of the Association for Computational Linguistics},
volume={8},
pages={1--18},
year={2020},
publisher={MIT Press},
doi={10.1162/tacl\_a\_00296},
url={https://www.mitpressjournals.org/doi/abs/10.1162/tacl_a_00296}
}
This project was tested with libraries:
numpy==1.15.4
pandas==0.24.1
scikit-learn==0.20.2
tqdm==4.31.1
matplotlib==2.0.2
seaborn==0.9.0
torch==1.0.1.post2
To ask questions or report problems, please open an issue.