-
Notifications
You must be signed in to change notification settings - Fork 6.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Suggestions on training on English to Vietnamese translation #458
Comments
Hi @raghava14 Unfortunately I can't give exact answers to your questions without proper experimentation, but here are some thoughts:
Hope that helps... |
Hi @edunov ,
I think it is better than the one claimed here tensorflow/tensor2tensor#611 so I am closing the issue now. But as a next step I want to see if I can increase any BLEU score with Understanding Back-Translation at Scale . But unlike Monolingual corpus you obtained from WMT18 there aren't much resources available for monolingual data in Vietnamese. I am trying to use wikipedia dump but not sure how it would work. If you have any suggestions or thoughts please share them so that I can see if there can be any improvement in BLEU. Thanks again for the help. |
Hi @raghava14 , I tried IWSLT_DE_EN architecture for translation from English to Vietnamese. I'm able to produce BLEU score 26.86. Did you use joint dictionary or separate dictionary? |
Hi @jiachangliu I used |
@raghava14 Thank you very much. I used separate dictionaries. I will try |
Hi ,
I am trying on English to Vietnamese translation using IWLST data from Stanford NLP which has 133K pairs. I want to replicate results presented here tensorflow/tensor2tensor#611 where Transformer base architecture is used. I have a few quick questions
Can preparation of data be done same as prepare-iwslt14.sh or whether it is better if
learn_bpe.py
is done on a larger wikipedia dump andapply_bpe.py
on the smaller corpus mentioned above. Are 133k sentences alone sufficient to learn bpe so that I can just use the pre-processing script or should go for latter case. Also since moses doesn't support Vietnamese how to proceed about tokenization.Whether to use
--joined-dictionary
or not during preprocessing. As I see here it is better to use joined dictionary if the alphabets are shared. In Vietnamese I felt there are more diacritics as compared to German where joined dictionary is used without ambiguity.Also I used the below command on EN-DE to get good results for 4.5 million sentence pairs from WMT.
CUDA_VISIBLE_DEVICES=0 python train.py data-bin/wmt16_en_de_bpe32k --arch transformer_wmt_en_de --share-all-embeddings --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 --lr-scheduler inverse_sqrt --warmup-init-lr 1e-07 --warmup-updates 4000 --lr 0.0007 --min-lr 1e-09 --criterion label_smoothed_cross_entropy --label-smoothing 0.1 --weight-decay 0.0 --max-tokens 4096 --save-dir checkpoints/en-de --update-freq 8 --no-progress-bar --log-format simple --keep-interval-updates 20
Since the data set is small whether I can use the same architecture or use IWLST architecture for DE-EN if so in any case do I need to change any hyper parameters.
I know the questions are naive but I think it would help for new users as well who are trying on Vietnamese. I am working on single gpu setup.Please give some suggestions or leads on this kind of task and dataset.
Thanks.
The text was updated successfully, but these errors were encountered: