Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add lexicon tools for use with XRI datasets #621

Open
wants to merge 4 commits into
base: master
Choose a base branch
from
Open

Conversation

bhartmoore
Copy link
Collaborator

@bhartmoore bhartmoore commented Jan 8, 2025

These two scripts run checks on XRI datasets. EITL team uses them regularly and I'd like to add them to silnlp.

  • count_words.py prints total and unique word counts for the source and target datasets and gives stats on the diversity of target renderings for the most common source words. (May need a more precise name.)

  • compare_lex.py compares lexicon data for two different source-target sets. It can be used to find what percent coverage each dataset has for the other, e.g. what percent of Bible words are represented in the training data. It defines "word" naively based on whitespace. Can be run on source side, target side, or both.

I've cleaned these up per a few helpful suggestions from Matthew:

  • lines < 120 chars
  • uses SIL_NLP_ENV for paths to remain platform agnostic
  • placed in silnlp/common

I expect there maybe concerns with style or functionality. In particular, count_words.py uses numpy, pandas, and collections.Counter, which may be cumbersome, and takes a long time to run. I'm glad to address all comments. Thanks.


This change is Reviewable

@bhartmoore bhartmoore self-assigned this Jan 8, 2025
Copy link
Collaborator

@mshannon-sil mshannon-sil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 2 files at r1, all commit messages.
Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @bhartmoore and @ddaspit)


silnlp/common/count_words.py line 11 at r1 (raw file):

def is_word(word: str) -> bool:
    val = False

It looks this function defines a string as a word as long as it has a single alphabetic character. If that's the intended functionality, it can stop once any alphabetic character is found, to be more efficient, e.g.

Code snippet:

return any(char.isalpha() for char in word)

silnlp/common/compare_lex.py line 7 at r1 (raw file):

from ..common.environment import SIL_NLP_ENV

def get_all_words(src_file: str) -> list:

The List type from from typing import List should be used for consistency with the rest of the repo

@bhartmoore
Copy link
Collaborator Author

@mshannon-sil Thank you for the review. I've made changes to address your comments.
Should is_word() still be a function now that it's a single line? It's only called once.
You're right about the functionality; it's meant to get rid of numbers and stray punctuation that were showing up in the lexicon.

@mshannon-sil
Copy link
Collaborator

I missed that it was only called once. In that case, I wouldn't keep it as a function.

@bhartmoore
Copy link
Collaborator Author

I missed that it was only called once. In that case, I wouldn't keep it as a function.

Looking back, I realized that it was because we may want to update the definition of a word over time (e.g. in the newest set, there are alpha words in the source that are translated as numerals in the target, so I probably need to allow for those), and we may want to make it more sophisticated than just looking at whitespace. So if you're ok with it, I may keep it as a function even though it's reduced to one (very elegant!) line.

@benjaminking
Copy link
Collaborator

I think it could be worthwhile to keep the single-line is_word function here, because the name of the function can act as a bit of documentation, especially since the purpose of that line might not be immediately obvious otherwise.

Copy link
Collaborator

@mshannon-sil mshannon-sil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, it's fine to keep it as is then.

Reviewable status: 0 of 2 files reviewed, 2 unresolved discussions (waiting on @bhartmoore and @ddaspit)

Copy link
Collaborator

@mshannon-sil mshannon-sil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: 0 of 2 files reviewed, 1 unresolved discussion (waiting on @bhartmoore and @ddaspit)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants