Issue with English-Vietnamese dataset alignment #6

lengockyquang · 2019-07-14T15:26:35Z

Hi @stefan-it , I just download IWSLT 15 English Vietnamese dataset and i saw some blank in both files. So I tried to remove all blank lines with Notepad++. Then I saw the number sentences of train.en and train.vi is not equal, 133168 sents for train.en and 133205 for train.vi

stefan-it · 2019-07-14T22:26:59Z

Hi @lengockyquang,

I checked the training file and a wc -l train.en yields to a line number of 133.317 (both for the train.vi file). I think something is wrong with the Notepad++ display (maybe some issues with line breaks).

But could you just give some examples of empty lines? I'll check it then :)

lengockyquang · 2019-07-15T02:42:54Z

I've checked some empty lines and realized that there are some weird cases that on source sentences are empty lines but on target sentences are not.

I think this is reason that when we remove blank lines on both file, it leads to mis-align between them.

huybik · 2021-09-25T12:47:06Z

Hello, thanks lengockyqang. When we know the cause, then the fix is easy.

def align(inpt, trgt):
    x = inpt.split('\n')
    y = trgt.split('\n')

    i = 0
    while i < len(x):
        if len(x[i]) < 2 or len(y[i]) < 2:
            x.pop(i)
            y.pop(i)
        else: i+=1
    
    assert len(x) == len(y)
    return x,y

x,y = align(inpt, trgt)
print(x[-3], y[-3])
>> thank you very much for your time  rất cảm ơn đã lắng nghe

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue with English-Vietnamese dataset alignment #6

Issue with English-Vietnamese dataset alignment #6

lengockyquang commented Jul 14, 2019 •

edited

Loading

stefan-it commented Jul 14, 2019

lengockyquang commented Jul 15, 2019

huybik commented Sep 25, 2021 •

edited

Loading

Issue with English-Vietnamese dataset alignment #6

Issue with English-Vietnamese dataset alignment #6

Comments

lengockyquang commented Jul 14, 2019 • edited Loading

stefan-it commented Jul 14, 2019

lengockyquang commented Jul 15, 2019

huybik commented Sep 25, 2021 • edited Loading

lengockyquang commented Jul 14, 2019 •

edited

Loading

huybik commented Sep 25, 2021 •

edited

Loading