-
Notifications
You must be signed in to change notification settings - Fork 5.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added phonetisaurus-based g2p scripts #2730
Changes from 2 commits
61d9560
771a556
55227df
4e75be7
a5d60d6
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
This file was deleted.
This file was deleted.
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,62 @@ | ||
#!/bin/bash | ||
# Copyright 2014 Johns Hopkins University (Author: Yenda Trmal) | ||
# Copyright 2016 Xiaohui Zhang | ||
# 2018 Ruizhe Huang | ||
# Apache 2.0 | ||
|
||
# This script applies a trained Phonetisarus G2P model to | ||
# synthesize pronunciations for missing words (i.e., words in | ||
# transcripts but not the lexicon), and output the expanded lexicon. | ||
huangruizhe marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
# Begin configuration section. | ||
stage=0 | ||
nbest= # Generate up to $nbest variants | ||
pmass= # Generate so many variants to produce $pmass ammount, like 90%, of the prob mass | ||
# End configuration section. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @huangruizhe Can you add "thresh" as an option here? Please refer to /export/b19/xzhang/tedlium/s5_r2/steps/dict/apply_g2p.sh (Sorry I just realized today that I already wrote a script like the current one 2 years ago.. ) Also, please explain a bit more about the nbest and pmass options, also by referring to the above script. Thanks! There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Fixed. |
||
|
||
echo "$0 $@" # Print the command line for logging | ||
|
||
[ -f ./path.sh ] && . ./path.sh; # source the path. | ||
. utils/parse_options.sh || exit 1; | ||
|
||
set -u | ||
set -e | ||
|
||
if [ $# != 3 ]; then | ||
echo "Usage: $0 [options] <g2p-model> <word-list> <lexicon-out>" | ||
echo "... where <g2p-model> is the trained g2p model." | ||
echo " <word-list> is a list of words whose pronunciation is to be generated." | ||
echo " <lexicon-out> output lexicon, whose format is <word>\t<prob>\t<pronunciation> for each line." | ||
echo "e.g.: $0 exp/g2p/model.fst exp/g2p/oov_words.txt data/local/dict_nosp/lexicon.txt" | ||
echo "" | ||
echo "main options (for others, see top of script file)" | ||
echo " --nbest <int> # Maximum number of hypotheses to produce. By default, nbest=1." | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. the default value is 20 I think |
||
echo " --pmass <float> # Select the maximum number of hypotheses summing to a total mass of pmass amount, within [0, 1], for a word." | ||
echo " --nbest <int> --pmass <float> # When specified together, we generate the intersection of these two options." | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. remove this line |
||
exit 1; | ||
fi | ||
|
||
model=$1 | ||
word_list=$2 | ||
out_lexicon=$3 | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. check whether phonetisaurus is installed here. Please refer to /export/b19/xzhang/tedlium/s5_r2/steps/dict/apply_g2p.sh also. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Fixed. |
||
[ ! -z $nbest ] && [[ ! $nbest =~ ^[0-9]+$ ]] && echo "$0: nbest should be a positive integer." && exit 1 | ||
[ ! -z $pmass ] && ! { [[ $pmass =~ ^[0-9]+\.?[0-9]*$ ]] && [ $(bc <<< "$pmass >= 0") -eq 1 -a $(bc <<< "$pmass <= 1") -eq 1 ]; } \ | ||
&& echo "$0: pmass should be within [0, 1]." && exit 1 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. doesn't have to check pmass here, since phonetisaurus checks it inside. |
||
[ -z $pmass ] && [ -z $nbest ] && nbest=1 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. don't allow this case. if the user specified nothing, just throw an error |
||
|
||
if [ -z $pmass ]; then | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this part of the code would get simpler if you set nbest=20 in the beginning of the script. I don't claim I know the right answers, just thinking aloud. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. actually there doesn't seem to be a default. I agree 20 is too much-- normally 3 would be a reasonable limit. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Actually the reason we used 20 by default is that, when the user only sets pmass, in order to get correct pron-probs, we need to specify an "nbest" value large enough, since phonetisaurus only computes pron-probs on the nbest list. When the user wants to rely on nbest, we always leave the responsibility to the user for setting the proper value, and that's why we didn't set it at the beginning. In summary, we want to allow all three ways of specifying constraints (pmass/nbest or both), and let the user to determine the proper values needed. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks! I will look into phonetisaurus codes again and make sure why we chose to set these default values. Will also consider how to make the code simpler. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The code has been made simpler, but we keep the "nbest=20, pmass=1.0" stuff. The justification is as follows (an elaboration of what Xiaohui said above): Users have three options here:
What we meant by "default" was a bit misleading. They are actually some implicitly-set values, due to implementation reasons. |
||
echo "Synthesizing pronunciations for words in $word_list based on nbest=$nbest" | ||
options="--nbest $nbest --pmass 1.0" | ||
elif [ -z $nbest ]; then | ||
echo "Synthesizing pronunciations for words in $word_list based on pmass=$pmass" | ||
options="--pmass $pmass --nbest 20" | ||
else | ||
echo "Synthesizing pronunciations for words in $word_list based on nbest=$nbest and pmass=$pmass" | ||
options="--pmass $pmass --nbest $nbest" | ||
fi | ||
phonetisaurus-apply $options --model $model --thresh 5 --accumulate --verbose --prob --word_list $word_list 1>$out_lexicon | ||
|
||
echo "Finished. Synthesized lexicon for new words is in $out_lexicon" | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @huangruizhe Can you address Yenda's earlier comment: generating a list of failed words in a file and point it to the user in the echo message? The warning message from phonetisaurus is not consolidated into a file. So the user may miss it and want to find those words in a file. Actually I noticed your "out_lexicon_failed" is not used at all. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Fixed. |
||
exit 0 |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,82 @@ | ||
#!/bin/bash | ||
|
||
# Copyright 2017 Intellisist, Inc. (Author: Navneeth K) | ||
# 2017 Xiaohui Zhang | ||
# 2018 Ruizhe Huang | ||
# Apache License 2.0 | ||
|
||
# This script trains a g2p model using Phonetisaurus. | ||
|
||
stage=0 | ||
encoding='utf-8' | ||
only_words=true | ||
silence_phones= | ||
|
||
echo "$0 $@" # Print the command line for logging | ||
|
||
[ -f ./path.sh ] && . ./path.sh; # source the path. | ||
. utils/parse_options.sh || exit 1; | ||
|
||
set -u | ||
set -e | ||
|
||
if [ $# != 2 ]; then | ||
echo "Usage: $0 [options] <lexicon-in> <work-dir>" | ||
echo " where <lexicon-in> is the training lexicon (one pronunciation per " | ||
echo " word per line, with lines like 'hello h uh l ow') and" | ||
echo " <work-dir> is directory where the models will be stored" | ||
echo "e.g.: $0 --silence-phones data/local/dict/silence_phones.txt data/local/dict/lexicon.txt exp/g2p/" | ||
echo "" | ||
echo "main options (for others, see top of script file)" | ||
echo " --cmd (utils/run.pl|utils/queue.pl <queue opts>) # how to run jobs." | ||
echo " --silence-phones <silphones-list> # e.g. data/local/dict/silence_phones.txt." | ||
echo " # A list of silence phones, one or more per line" | ||
echo " # Relates to --only-words option" | ||
echo " --only-words (true|false) (default: true) # If true, exclude silence words, i.e." | ||
echo " # words with one or multiple phones which are all silence." | ||
exit 1; | ||
fi | ||
|
||
lexicon=$1 | ||
wdir=$2 | ||
|
||
[ ! -f $lexicon ] && echo "Cannot find $lexicon" && exit | ||
|
||
isuconv=`which uconv` | ||
if [ -z $isuconv ]; then | ||
echo "uconv was not found. You must install the icu4c package." | ||
exit 1; | ||
fi | ||
|
||
mkdir -p $wdir | ||
|
||
|
||
# For input lexicon, remove pronunciations containing non-utf-8-encodable characters, | ||
# and optionally remove words that are mapped to a single silence phone from the lexicon. | ||
if [ $stage -le 0 ]; then | ||
if $only_words && [ ! -z "$silence_phones" ]; then | ||
awk 'NR==FNR{a[$1] = 1; next} {s=$2;for(i=3;i<=NF;i++) s=s" "$i; if(!(s in a)) print $1" "s}' \ | ||
$silence_phones $lexicon | \ | ||
awk '{printf("%s\t",$1); for (i=2;i<NF;i++){printf("%s ",$i);} printf("%s\n",$NF);}' | \ | ||
uconv -f utf-8 -t utf-8 -x Any-NFC - | awk 'NF > 0'> $wdir/lexicon_tab_separated.txt | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. -f "$encoding" -t "$encoding" I guess it's a matter of design decisions if we want to put NFC there -- IMO the user should be responsible for that. Also, not sure how that would work for any other encodings than the unicode ones. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree that the unicode normalization should be something that is the user's responsibility as the data preparation stage, before this script gets called. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Fixed. |
||
else | ||
awk '{printf("%s\t",$1); for (i=2;i<NF;i++){printf("%s ",$i);} printf("%s\n",$NF);}' $lexicon | \ | ||
uconv -f utf-8 -t utf-8 -x Any-NFC - | awk 'NF > 0'> $wdir/lexicon_tab_separated.txt | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. dtto There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Fixed. |
||
fi | ||
fi | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. check whether phonetisaurus is installed here. Please refer to /export/b19/xzhang/tedlium/s5_r2/steps/dict/train_g2p.sh also. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Fixed. |
||
if [ $stage -le 1 ]; then | ||
# Align lexicon stage. Lexicon is assumed to have first column tab separated | ||
phonetisaurus-align --input=$wdir/lexicon_tab_separated.txt --ofile=${wdir}/aligned_lexicon.corpus || exit 1; | ||
fi | ||
|
||
if [ $stage -le 2 ]; then | ||
# Convert aligned lexicon to arpa using make_kn_lm.py, a re-implementation of srilm's ngram-count functionality. | ||
./utils/lang/make_kn_lm.py -ngram-order 7 -text ${wdir}/aligned_lexicon.corpus -lm ${wdir}/aligned_lexicon.arpa | ||
fi | ||
|
||
if [ $stage -le 3 ]; then | ||
# Convert the arpa file to FST. | ||
phonetisaurus-arpa2wfst --lm=${wdir}/aligned_lexicon.arpa --ofile=${wdir}/model.fst | ||
fi | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
another design decision I guess -- copy the lexicon into a separate local/ script and instead of generating a single file, a new dict directory could be generated -- I think that would make a nice and coherent interface
BTW, expanded lexicon has a specific meaning for babel scripts and we have even published a paper with that nomenclature, so maybe some other word would be more suitable to prevent confusion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about "extended"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also, did you mean doing line 114-116 inside steps/dict/apply_g2p_phonetisaurus.sh ? That's indeed nice in most cases but in some cases we just want to generate prons for a word list rather than producing a valid dict dir. What do you think? @jtrmal
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm leaning slightly towards "extended", but feel free to decide on your own.
Ad second question -- perhaps its ok as it is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Naming: fixed.
Dict directory issue: keep it as it is.