Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added phonetisaurus-based g2p scripts #2730

Merged
merged 5 commits into from
Oct 9, 2018
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
42 changes: 0 additions & 42 deletions egs/multi_en/s5/local/g2p/apply_g2p.sh

This file was deleted.

67 changes: 0 additions & 67 deletions egs/multi_en/s5/local/g2p/train_g2p.sh

This file was deleted.

27 changes: 23 additions & 4 deletions egs/multi_en/s5/run.sh
Original file line number Diff line number Diff line change
Expand Up @@ -58,8 +58,8 @@ if [ $stage -le 1 ]; then
# We prepare the basic dictionary in data/local/dict_combined.
local/prepare_dict.sh $swbd $tedlium2
(
local/g2p/train_g2p.sh --stage 0 --silence-phones \
"data/local/dict_combined/silence_phones.txt" data/local/dict_combined exp/g2p || touch exp/g2p/.error
steps/dict/train_g2p_phonetisaurus.sh --stage 0 --silence-phones \
"data/local/dict_combined/silence_phones.txt" data/local/dict_combined/lexicon.txt exp/g2p || touch exp/g2p/.error
) &
fi

Expand Down Expand Up @@ -114,8 +114,27 @@ if [ $stage -le 4 ]; then
mkdir -p $dict_dir
rm $dict_dir/lexiconp.txt 2>/dev/null || true
cp data/local/dict_combined/{extra_questions,nonsilence_phones,silence_phones,optional_silence}.txt $dict_dir
local/g2p/apply_g2p.sh --var-counts 1 exp/g2p/model.fst data/local/g2p_phonetisarus \
data/local/dict_combined/lexicon.txt $dict_dir/lexicon.txt || exit 1;

echo 'Gathering missing words...'

lexicon=data/local/dict_combined/lexicon.txt
g2p_tmp_dir=data/local/g2p_phonetisarus
mkdir -p $g2p_tmp_dir

# awk command from http://stackoverflow.com/questions/2626274/print-all-but-the-first-three-columns
cat data/*/train/text | \
local/count_oovs.pl $lexicon | \
awk '{if (NF > 3 ) {for(i=4; i<NF; i++) printf "%s ",$i; print $NF;}}' | \
perl -ape 's/\s/\n/g;' | \
sort | uniq > $g2p_tmp_dir/missing.txt
cat $g2p_tmp_dir/missing.txt | \
grep "^[a-z]*$" > $g2p_tmp_dir/missing_onlywords.txt

steps/dict/apply_g2p_phonetisaurus.sh --nbest 1 exp/g2p/model.fst $g2p_tmp_dir/missing_onlywords.txt $g2p_tmp_dir/missing_lexicon.txt || exit 1;

expanded_lexicon=$dict_dir/lexicon.txt
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another design decision I guess -- copy the lexicon into a separate local/ script and instead of generating a single file, a new dict directory could be generated -- I think that would make a nice and coherent interface

BTW, expanded lexicon has a specific meaning for babel scripts and we have even published a paper with that nomenclature, so maybe some other word would be more suitable to prevent confusion?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about "extended"?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, did you mean doing line 114-116 inside steps/dict/apply_g2p_phonetisaurus.sh ? That's indeed nice in most cases but in some cases we just want to generate prons for a word list rather than producing a valid dict dir. What do you think? @jtrmal

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm leaning slightly towards "extended", but feel free to decide on your own.
Ad second question -- perhaps its ok as it is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Naming: fixed.
Dict directory issue: keep it as it is.

echo "Adding new pronunciations to get expanded lexicon $expanded_lexicon"
cat <(cut -f 1,3 $g2p_tmp_dir/missing_lexicon.txt) $lexicon | sort | uniq > $expanded_lexicon
fi

# We'll do multiple iterations of pron/sil-prob estimation. So the structure of
Expand Down
62 changes: 62 additions & 0 deletions egs/wsj/s5/steps/dict/apply_g2p_phonetisaurus.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
#!/bin/bash
# Copyright 2014 Johns Hopkins University (Author: Yenda Trmal)
# Copyright 2016 Xiaohui Zhang
# 2018 Ruizhe Huang
# Apache 2.0

# This script applies a trained Phonetisarus G2P model to
# synthesize pronunciations for missing words (i.e., words in
# transcripts but not the lexicon), and output the expanded lexicon.
huangruizhe marked this conversation as resolved.
Show resolved Hide resolved

# Begin configuration section.
stage=0
nbest= # Generate up to $nbest variants
pmass= # Generate so many variants to produce $pmass ammount, like 90%, of the prob mass
# End configuration section.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@huangruizhe Can you add "thresh" as an option here? Please refer to /export/b19/xzhang/tedlium/s5_r2/steps/dict/apply_g2p.sh (Sorry I just realized today that I already wrote a script like the current one 2 years ago.. ) Also, please explain a bit more about the nbest and pmass options, also by referring to the above script. Thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.


echo "$0 $@" # Print the command line for logging

[ -f ./path.sh ] && . ./path.sh; # source the path.
. utils/parse_options.sh || exit 1;

set -u
set -e

if [ $# != 3 ]; then
echo "Usage: $0 [options] <g2p-model> <word-list> <lexicon-out>"
echo "... where <g2p-model> is the trained g2p model."
echo " <word-list> is a list of words whose pronunciation is to be generated."
echo " <lexicon-out> output lexicon, whose format is <word>\t<prob>\t<pronunciation> for each line."
echo "e.g.: $0 exp/g2p/model.fst exp/g2p/oov_words.txt data/local/dict_nosp/lexicon.txt"
echo ""
echo "main options (for others, see top of script file)"
echo " --nbest <int> # Maximum number of hypotheses to produce. By default, nbest=1."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the default value is 20 I think

echo " --pmass <float> # Select the maximum number of hypotheses summing to a total mass of pmass amount, within [0, 1], for a word."
echo " --nbest <int> --pmass <float> # When specified together, we generate the intersection of these two options."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove this line

exit 1;
fi

model=$1
word_list=$2
out_lexicon=$3

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check whether phonetisaurus is installed here. Please refer to /export/b19/xzhang/tedlium/s5_r2/steps/dict/apply_g2p.sh also.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

[ ! -z $nbest ] && [[ ! $nbest =~ ^[0-9]+$ ]] && echo "$0: nbest should be a positive integer." && exit 1
[ ! -z $pmass ] && ! { [[ $pmass =~ ^[0-9]+\.?[0-9]*$ ]] && [ $(bc <<< "$pmass >= 0") -eq 1 -a $(bc <<< "$pmass <= 1") -eq 1 ]; } \
&& echo "$0: pmass should be within [0, 1]." && exit 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't have to check pmass here, since phonetisaurus checks it inside.

[ -z $pmass ] && [ -z $nbest ] && nbest=1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't allow this case. if the user specified nothing, just throw an error


if [ -z $pmass ]; then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this part of the code would get simpler if you set nbest=20 in the beginning of the script.
Things to consider:
nbest=20 seems too much (I think Dan knows about some thesis showing that extra variants are actually harmful)
pmass=1.0 might be too big, causing too large graph during the generation of the variants, perhaps 0.95 might be good enough?

I don't claim I know the right answers, just thinking aloud.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually there doesn't seem to be a default. I agree 20 is too much-- normally 3 would be a reasonable limit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually the reason we used 20 by default is that, when the user only sets pmass, in order to get correct pron-probs, we need to specify an "nbest" value large enough, since phonetisaurus only computes pron-probs on the nbest list. When the user wants to rely on nbest, we always leave the responsibility to the user for setting the proper value, and that's why we didn't set it at the beginning. In summary, we want to allow all three ways of specifying constraints (pmass/nbest or both), and let the user to determine the proper values needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I will look into phonetisaurus codes again and make sure why we chose to set these default values. Will also consider how to make the code simpler.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code has been made simpler, but we keep the "nbest=20, pmass=1.0" stuff. The justification is as follows (an elaboration of what Xiaohui said above):

Users have three options here:

  1. only set nbest, e.g. nbest=3
    In this case, pmass needs to be implicitly set to 1.0 (instead of 0.0 by phonetisaurus's default), which would never affect our option nbest=3.

  2. only set pmass, e.g. pmass=0.95
    In this case, nbest is implicitly set to 20, because phonetisaurus computes probability over nbest list (PhonetisaurusScript.h: 166~186), thus we need to specify a large enough nbest value here.

  3. set both nbest and pmass
    In this case, the user takes the intersections of the two options and has full control of them.

What we meant by "default" was a bit misleading. They are actually some implicitly-set values, due to implementation reasons.

echo "Synthesizing pronunciations for words in $word_list based on nbest=$nbest"
options="--nbest $nbest --pmass 1.0"
elif [ -z $nbest ]; then
echo "Synthesizing pronunciations for words in $word_list based on pmass=$pmass"
options="--pmass $pmass --nbest 20"
else
echo "Synthesizing pronunciations for words in $word_list based on nbest=$nbest and pmass=$pmass"
options="--pmass $pmass --nbest $nbest"
fi
phonetisaurus-apply $options --model $model --thresh 5 --accumulate --verbose --prob --word_list $word_list 1>$out_lexicon

echo "Finished. Synthesized lexicon for new words is in $out_lexicon"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@huangruizhe Can you address Yenda's earlier comment: generating a list of failed words in a file and point it to the user in the echo message? The warning message from phonetisaurus is not consolidated into a file. So the user may miss it and want to find those words in a file. Actually I noticed your "out_lexicon_failed" is not used at all.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

exit 0
82 changes: 82 additions & 0 deletions egs/wsj/s5/steps/dict/train_g2p_phonetisaurus.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,82 @@
#!/bin/bash

# Copyright 2017 Intellisist, Inc. (Author: Navneeth K)
# 2017 Xiaohui Zhang
# 2018 Ruizhe Huang
# Apache License 2.0

# This script trains a g2p model using Phonetisaurus.

stage=0
encoding='utf-8'
only_words=true
silence_phones=

echo "$0 $@" # Print the command line for logging

[ -f ./path.sh ] && . ./path.sh; # source the path.
. utils/parse_options.sh || exit 1;

set -u
set -e

if [ $# != 2 ]; then
echo "Usage: $0 [options] <lexicon-in> <work-dir>"
echo " where <lexicon-in> is the training lexicon (one pronunciation per "
echo " word per line, with lines like 'hello h uh l ow') and"
echo " <work-dir> is directory where the models will be stored"
echo "e.g.: $0 --silence-phones data/local/dict/silence_phones.txt data/local/dict/lexicon.txt exp/g2p/"
echo ""
echo "main options (for others, see top of script file)"
echo " --cmd (utils/run.pl|utils/queue.pl <queue opts>) # how to run jobs."
echo " --silence-phones <silphones-list> # e.g. data/local/dict/silence_phones.txt."
echo " # A list of silence phones, one or more per line"
echo " # Relates to --only-words option"
echo " --only-words (true|false) (default: true) # If true, exclude silence words, i.e."
echo " # words with one or multiple phones which are all silence."
exit 1;
fi

lexicon=$1
wdir=$2

[ ! -f $lexicon ] && echo "Cannot find $lexicon" && exit

isuconv=`which uconv`
if [ -z $isuconv ]; then
echo "uconv was not found. You must install the icu4c package."
exit 1;
fi

mkdir -p $wdir


# For input lexicon, remove pronunciations containing non-utf-8-encodable characters,
# and optionally remove words that are mapped to a single silence phone from the lexicon.
if [ $stage -le 0 ]; then
if $only_words && [ ! -z "$silence_phones" ]; then
awk 'NR==FNR{a[$1] = 1; next} {s=$2;for(i=3;i<=NF;i++) s=s" "$i; if(!(s in a)) print $1" "s}' \
$silence_phones $lexicon | \
awk '{printf("%s\t",$1); for (i=2;i<NF;i++){printf("%s ",$i);} printf("%s\n",$NF);}' | \
uconv -f utf-8 -t utf-8 -x Any-NFC - | awk 'NF > 0'> $wdir/lexicon_tab_separated.txt
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-f "$encoding" -t "$encoding"

I guess it's a matter of design decisions if we want to put NFC there -- IMO the user should be responsible for that. Also, not sure how that would work for any other encodings than the unicode ones.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that the unicode normalization should be something that is the user's responsibility as the data preparation stage, before this script gets called.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

else
awk '{printf("%s\t",$1); for (i=2;i<NF;i++){printf("%s ",$i);} printf("%s\n",$NF);}' $lexicon | \
uconv -f utf-8 -t utf-8 -x Any-NFC - | awk 'NF > 0'> $wdir/lexicon_tab_separated.txt
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dtto

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

fi
fi

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check whether phonetisaurus is installed here. Please refer to /export/b19/xzhang/tedlium/s5_r2/steps/dict/train_g2p.sh also.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

if [ $stage -le 1 ]; then
# Align lexicon stage. Lexicon is assumed to have first column tab separated
phonetisaurus-align --input=$wdir/lexicon_tab_separated.txt --ofile=${wdir}/aligned_lexicon.corpus || exit 1;
fi

if [ $stage -le 2 ]; then
# Convert aligned lexicon to arpa using make_kn_lm.py, a re-implementation of srilm's ngram-count functionality.
./utils/lang/make_kn_lm.py -ngram-order 7 -text ${wdir}/aligned_lexicon.corpus -lm ${wdir}/aligned_lexicon.arpa
fi

if [ $stage -le 3 ]; then
# Convert the arpa file to FST.
phonetisaurus-arpa2wfst --lm=${wdir}/aligned_lexicon.arpa --ofile=${wdir}/model.fst
fi

Loading