Skip to content

Latest commit

 

History

History
105 lines (80 loc) · 41.9 KB

OFFLINE.md

File metadata and controls

105 lines (80 loc) · 41.9 KB

Offline ST systems for IWSLT 2021

The goal of the Offline Speech Translation Task is to examine automatic methods for translating audio speech in one language into text in the target language and to answer: is the cascaded solution still the dominant technology in spoken language translation? (official website)

Here, we release our systems submitted to IWSLT2021 and show how to evaluate the systems. For more details about the model structure and training datasets, see our system report.

@inproceedings{zhao2021iwslt,
  author       = {Chengqi Zhao and Zhicheng Liu and Jian Tong and Tao Wang 
                    and Mingxuan Wang and Rong Ye and Qianqian Dong and Jun Cao and Lei Li},
  booktitle    = {Proceedings of the 18th International Conference on Spoken Language Translation},
  title        = {The Volctrans Neural Speech Translation System for IWSLT 2021},
  year         = {2021},
}

Results & Models

The major training data for the offline ST task is MuST-C V2. We report results on dev/testsets of both MuST-C V1&V2 as a reference.

ASR

Transformer ASR (WER) [asr.tgz]
MuST-C v2 dev [hypo] 5.2
MuST-C v2 tst-COM [hypo] 5.7
MuST-C v1 dev [hypo] 10.6
MuST-C v1 tst-COM [hypo] 7.4
iwslt.tst2020 [hypo] -
iwslt.tst2021 [hypo] -

MT

We report detokenized BLEU (by sacrebleu toolkit) for MT models.

MuST-C v2 dev MuST-C v2 tst-COM MuST-C v1 dev MuST-C v1 tst-COM
MT(w/o punc. & lc) [mt1.tgz] 32.0
[hypo] [hypo_notag] [bleu]
34.1
[hypo] [hypo_notag] [bleu]
32.2
[hypo] [hypo_notag] [bleu]
34.0
[hypo] [hypo_notag] [bleu]
MT(w punc. & tc) [mt1.tgz] 33.8
[hypo] [hypo_notag] [bleu]
36.2
[hypo] [hypo_notag] [bleu]
33.7
[hypo] [hypo_notag] [bleu]
35.9
[hypo] [hypo_notag] [bleu]
ensemble MT(w/o punc. & lc) [mt1.tgz, mt2.tgz, mt3.tgz, mt4.tgz] 33.8
[hypo] [hypo_notag] [bleu]
35.2
[hypo] [hypo_notag] [bleu]
33.8
[hypo] [hypo_notag] [bleu]
35.3
[hypo] [hypo_notag] [bleu]
ensemble MT(w punc. & tc) [mt1.tgz, mt2.tgz, mt3.tgz, mt4.tgz] 34.7
[hypo] [hypo_notag] [bleu]
36.7
[hypo] [hypo_notag] [bleu]
34.6
[hypo] [hypo_notag] [bleu]
36.2
[hypo] [hypo_notag] [bleu]

ST

We report detokenized BLEU (by sacrebleu toolkit) for ST models.

The BLEU scores of iwslt.tst2020&2021 are from the IWSLT 2021 organizers. Note that there are two references in 2021, so the results of iwslt.tst2021 mean "BLEU ref2 / BLEU ref1 / BLEU both".

# SYSTEM MuST-C v2 dev MuST-C v2 tst-COM MuST-C v1 dev MuST-C v1 tst-COM iwslt.tst2020 iwslt.tst2021
1 cascade (ASR -> MT) 29.9
[hypo] [hypo_notag] [bleu]
32.1
[hypo] [hypo_notag] [bleu]
28.4
[hypo] [hypo_notag] [bleu]
31.3
[hypo] [hypo_notag] [bleu]
21.0
[hypo] [hypo_notag]
20.3/16.4/27.7
[hypo] [hypo_notag]
2 cascade (ASR -> ensemble MT) 31.7
[hypo] [hypo_notag] [bleu]
33.3
[hypo] [hypo_notag] [bleu]
30.1
[hypo] [hypo_notag] [bleu]
32.3
[hypo] [hypo_notag] [bleu]
22.2
[hypo] [hypo_notag]
21.8/17.1/29.5
[hypo] [hypo_notag]
3 direct ST base [st0.tgz] 23.9
[hypo] [hypo_notag] [bleu]
23.9
[hypo] [hypo_notag] [bleu]
- - - -
4 direct ST [st1.tgz] 28.9
[hypo] [hypo_notag] [bleu]
29.9
[hypo] [hypo_notag] [bleu]
27.9
[hypo] [hypo_notag] [bleu]
29.5
[hypo] [hypo_notag] [bleu]
-
[hypo] [hypo_notag]
-
[hypo] [hypo_notag]
5 direct ST++ [st2.tgz] 29.6
[hypo] [hypo_notag] [bleu]
30.4
[hypo] [hypo_notag] [bleu]
28.3
[hypo] [hypo_notag] [bleu]
29.7
[hypo] [hypo_notag] [bleu]
21.6
[hypo] [hypo_notag]
20.4/17.0/28.1
[hypo] [hypo_notag]
6 direct ST++* [st3.tgz] 30.0
[hypo] [hypo_notag] [bleu]
30.2
[hypo] [hypo_notag] [bleu]
28.2
[hypo] [hypo_notag] [bleu]
29.6
[hypo] [hypo_notag] [bleu]
-
[hypo] [hypo_notag]
-
[hypo] [hypo_notag]
7 ensemble (4, 5, 6) 30.4
[hypo] [hypo_notag] [bleu]
31.2
[hypo] [hypo_notag] [bleu]
29.0
[hypo] [hypo_notag] [bleu]
30.6
[hypo] [hypo_notag] [bleu]
22.4
[hypo] [hypo_notag]
21.1/17.5/29.2
[hypo] [hypo_notag]
8 direct ST + fbank2vec-512 [f2v_st.tgz] 28.7
[hypo] [hypo_notag] [bleu]
29.1
[hypo] [hypo_notag] [bleu]
26.7
[hypo] [hypo_notag] [bleu]
27.6
[hypo] [hypo_notag] [bleu]
- -
9 PMTL-ST + fbank2vec-768 [f2v_pmtl.tgz] 29.6
[hypo] [hypo_notag] [bleu]
29.6
[hypo] [hypo_notag] [bleu]
26.9
[hypo] [hypo_notag] [bleu]
28.1
[hypo] [hypo_notag] [bleu]
- -
10 PMTL-ST + fbank2vec-768 ++ [f2v_pmtlplus.tgz] 30.8
[hypo] [hypo_notag] [bleu]
31.1
[hypo] [hypo_notag] [bleu]
28.8
[hypo] [hypo_notag] [bleu]
30.1
[hypo] [hypo_notag] [bleu]
- -
11 PMTL-ST + fbank2vec-768 ++* [f2v_pmtlplus2.tgz] 30.9
[hypo] [hypo_notag] [bleu]
31.1
[hypo] [hypo_notag] [bleu]
28.8
[hypo] [hypo_notag] [bleu]
30.1
[hypo] [hypo_notag] [bleu]
23.5
[hypo] [hypo_notag]
21.6/18.2/30.6
[hypo] [hypo_notag]
12 ensemble (10, 11) 31.0
[hypo] [hypo_notag] [bleu]
31.1
[hypo] [hypo_notag] [bleu]
28.8
[hypo] [hypo_notag] [bleu]
30.1
[hypo] [hypo_notag] [bleu]
- -
13 ensemble (9, 10, 11) 31.4
[hypo] [hypo_notag] [bleu]
31.5
[hypo] [hypo_notag] [bleu]
29.3
[hypo] [hypo_notag] [bleu]
30.6
[hypo] [hypo_notag] [bleu]
- -
14 ensemble (8, 9, 10, 11) 31.6
[hypo] [hypo_notag] [bleu]
31.8
[hypo] [hypo_notag] [bleu]
29.5
[hypo] [hypo_notag] [bleu]
30.8
[hypo] [hypo_notag] [bleu]
24.3
[hypo] [hypo_notag]
21.7/18.7/31.3
[hypo] [hypo_notag]

How to reproduce

Here we only introduce how to reproduce the hypothesis and BLEU scores above. For more details about the model structure and training datasets, see our system report, and see speech-to-text recipe for how to train end-to-end ST models with NeurST.

MT

Step 1: download and untar the checkpoint for MT (e.g., mt1/)

Step 2: run

 ./scripts/evaluate_mt.sh mustc-v2-dev mt1/ ./

It will automatically download the test files, translate and generate following files:

  • mustc_v2.0_en-de.dev.de.hypo.txt: the translations
  • mustc_v2.0_en-de.dev.de.hypo.notag.txt: the translations without tags such as applause, laughing etc.
  • mustc_v2.0_en-de.dev.bleu.txt: the BLEU scores

ST (Cascade & E2E)

Step 1: download and untar the checkpoint for ST (e.g., asr/ and mt/ for cascade system, st3/ for end-to-end system)

Step 2: run

# cascade 
./scripts/evaluate_cascade.sh mustc-v2-dev asr/ mt1/ ./

# end-to-end
./scripts/evaluate_e2e.sh mustc-v2-dev st3/ ./

It will also generate the hypothesis files and BLEU scores. The available testsets are:

  • mustc-v2-dev
  • mustc-v2-tst
  • mustc-v1-dev
  • mustc-v1-tst
  • tst2020
  • tst2021

For IWSLT official testsets (tst2020 & tst2021), only hypothesis files are produced.

Additionally, for model ensemble, we can simply provide multiple checkpoint paths separated by a comma, e.g., st1/,st2/,st3/ for the ensemble ST model.