Skip to content
This repository has been archived by the owner on May 2, 2024. It is now read-only.

Issue with English-Vietnamese dataset alignment #6

Open
lengockyquang opened this issue Jul 14, 2019 · 3 comments
Open

Issue with English-Vietnamese dataset alignment #6

lengockyquang opened this issue Jul 14, 2019 · 3 comments

Comments

@lengockyquang
Copy link

lengockyquang commented Jul 14, 2019

Hi @stefan-it , I just download IWSLT 15 English Vietnamese dataset and i saw some blank in both files. So I tried to remove all blank lines with Notepad++. Then I saw the number sentences of train.en and train.vi is not equal, 133168 sents for train.en and 133205 for train.vi

@stefan-it
Copy link
Owner

Hi @lengockyquang,

I checked the training file and a wc -l train.en yields to a line number of 133.317 (both for the train.vi file). I think something is wrong with the Notepad++ display (maybe some issues with line breaks).

But could you just give some examples of empty lines? I'll check it then :)

@lengockyquang
Copy link
Author

I've checked some empty lines and realized that there are some weird cases that on source sentences are empty lines but on target sentences are not.

image

I think this is reason that when we remove blank lines on both file, it leads to mis-align between them.

@huybik
Copy link

huybik commented Sep 25, 2021

Hello, thanks lengockyqang. When we know the cause, then the fix is easy.

def align(inpt, trgt):
    x = inpt.split('\n')
    y = trgt.split('\n')

    i = 0
    while i < len(x):
        if len(x[i]) < 2 or len(y[i]) < 2:
            x.pop(i)
            y.pop(i)
        else: i+=1
    
    assert len(x) == len(y)
    return x,y

x,y = align(inpt, trgt)
print(x[-3], y[-3])
>> thank you very much for your time  rất cảm ơn đã lắng nghe 

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants