-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added universal propositions bank for French and German #1866
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've tested with the following code and the output does not look fully correct:
from flair.datasets import UP_GERMAN
# load corpus
corpus = UP_GERMAN()
print(corpus)
# print first sentence in train data
print(corpus.train[0])
This prints:
Sentence: "sent_id Sehr gute Beratung , schnelle Behebung der Probleme , so stelle ich mir Kundenservice vor ." [− Tokens: 17 − Token-Labels: "sent_id Sehr gute Beratung , schnelle Behebung der Probleme , so <AM-MNR> stelle ich <A0> mir Kundenservice <A1> vor ."]
Two problems here:
(1) the sentence is read as "sent_id Sehr gute Beratung , schnelle Behebung der Probleme , so stelle ich mir Kundenservice vor .", but sent_id should not be part of the sentence. The problem is that the UP files have lines that are comments. These lines are prefixed by a # symbol and should be skipped. You can get this behavior by setting the column_symbol
in the ColumnCorpus class.
(2) the frames are not annotated. The annotation is printed as "_sent_id Sehr gute Beratung , schnelle Behebung der Probleme , so <AM-MNR> stelle ich <A0> mir Kundenservice <A1> vor ._"
But the annotation should be the verbs. So the object currently selects the wrong column as frame.
flair/datasets/sequence_labeling.py
Outdated
base_path: Path = Path(base_path) | ||
|
||
# column format | ||
columns = {1: "text", 10: "frame"} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the column are wrong, it seems that column 10 is not the frame information
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was column nr. 9, starting with 0 to count, sorry, my mistake. It got fixed
train_file="de-up-train.conllu", | ||
test_file="de-up-dev.conllu", | ||
dev_file="de-up-test.conllu", | ||
in_memory=in_memory, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the comment_symbol
parameter is missing here (the UP and UD datasets have comments that should not be read)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added comment_symbol="#" in both classes, thank you
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, but test and dev splits are switched! Can you change this?
flair/datasets/sequence_labeling.py
Outdated
encoding="utf-8", | ||
train_file="de-up-train.conllu", | ||
test_file="de-up-dev.conllu", | ||
dev_file="de-up-test.conllu", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is switched: You are loading the dev split as test_file
and the test split as dev_file
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh. Merde. Sorry for that. Such a sloppy work. I am going to fix that in a couple of minutes, thank you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed it, thank you!
flair/datasets/sequence_labeling.py
Outdated
encoding="utf-8", | ||
train_file="fr-up-train.conllu", | ||
test_file="fr-up-dev.conllu", | ||
dev_file="fr-up-test.conllu", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here
@Dabendorf thanks for adding this! |
I have added the German and the French data from he Universal Proposition Bank (https://github.com/System-T/UniversalPropositions)