-
-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add lexicon tools for use with XRI datasets #621
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 2 files at r1, all commit messages.
Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @bhartmoore and @ddaspit)
silnlp/common/count_words.py
line 11 at r1 (raw file):
def is_word(word: str) -> bool: val = False
It looks this function defines a string as a word as long as it has a single alphabetic character. If that's the intended functionality, it can stop once any alphabetic character is found, to be more efficient, e.g.
Code snippet:
return any(char.isalpha() for char in word)
silnlp/common/compare_lex.py
line 7 at r1 (raw file):
from ..common.environment import SIL_NLP_ENV def get_all_words(src_file: str) -> list:
The List
type from from typing import List
should be used for consistency with the rest of the repo
@mshannon-sil Thank you for the review. I've made changes to address your comments. |
I missed that it was only called once. In that case, I wouldn't keep it as a function. |
Looking back, I realized that it was because we may want to update the definition of a word over time (e.g. in the newest set, there are alpha words in the source that are translated as numerals in the target, so I probably need to allow for those), and we may want to make it more sophisticated than just looking at whitespace. So if you're ok with it, I may keep it as a function even though it's reduced to one (very elegant!) line. |
I think it could be worthwhile to keep the single-line |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, it's fine to keep it as is then.
Reviewable status: 0 of 2 files reviewed, 2 unresolved discussions (waiting on @bhartmoore and @ddaspit)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 2 files reviewed, 1 unresolved discussion (waiting on @bhartmoore and @ddaspit)
These two scripts run checks on XRI datasets. EITL team uses them regularly and I'd like to add them to silnlp.
count_words.py
prints total and unique word counts for the source and target datasets and gives stats on the diversity of target renderings for the most common source words. (May need a more precise name.)compare_lex.py
compares lexicon data for two different source-target sets. It can be used to find what percent coverage each dataset has for the other, e.g. what percent of Bible words are represented in the training data. It defines "word" naively based on whitespace. Can be run on source side, target side, or both.I've cleaned these up per a few helpful suggestions from Matthew:
I expect there maybe concerns with style or functionality. In particular,
count_words.py
uses numpy, pandas, and collections.Counter, which may be cumbersome, and takes a long time to run. I'm glad to address all comments. Thanks.This change is![Reviewable](https://camo.githubusercontent.com/1541c4039185914e83657d3683ec25920c672c6c5c7ab4240ee7bff601adec0b/68747470733a2f2f72657669657761626c652e696f2f7265766965775f627574746f6e2e737667)