-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Missing CorrectForm and Typo annotations in multi-word tokens #443
Comments
Yes, per https://universaldependencies.org/u/overview/typos.html#misspelled-multiword-token it should be placed on the internal word if the multiword token is concatenative.
Yes please! |
Here's the list. There are certainly going to be some valid cases in this list, as I'm using an automated validation check to identify unknown multi-word token values (along with the corresponding words it splits into) and there will be multi-word tokens I don't have entries for.
|
Great, so it looks like most of these are contractions with missing apostrophes. Is it possible to make a script to autofix these, and then the few miscellaneous ones can be fixed by hand? |
It should technically be possible, I think. I don't currently have the bandwidth to implement such a script. |
OK I implemented some regexes to fix most of these. @rhdunn would you mind spot-checking the corrections and rerunning the script to see if there are any remaining issues? |
Thanks. I've rerun the script on the current dev branch with the following results:
Note: the |
Thanks, most of these are now fixed. Some of these are established colloquial forms marked as |
@rhdunn does your script show any issues that still need addressing or should I close this? |
I'm still getting the following:
The others are the colloquial forms you mentioned earlier, so are fine. |
For:
there is a
CorrectForm
annotation on the internal word of the multi-word token, but there is no correspondingTypo=Yes
+CorrectForm
annotation on the multi-word token itself. Is this intentional? -- This makes it difficult to extract the correct form when only viewing the tokens. It also makes validation of multi-word forms difficult, as the repaired (corrected) text in the word stream differs from the token stream.I've also noticed several missing annotations in the data (token and word) for multi-word tokens, e.g.:
I can create a full list of sentences with these issues.
The text was updated successfully, but these errors were encountered: