Add lexicon tools for use with XRI datasets #621

bhartmoore · 2025-01-08T20:28:01Z

These two scripts run checks on XRI datasets. EITL team uses them regularly and I'd like to add them to silnlp.

count_words.py prints total and unique word counts for the source and target datasets and gives stats on the diversity of target renderings for the most common source words. (May need a more precise name.)
compare_lex.py compares lexicon data for two different source-target sets. It can be used to find what percent coverage each dataset has for the other, e.g. what percent of Bible words are represented in the training data. It defines "word" naively based on whitespace. Can be run on source side, target side, or both.

I've cleaned these up per a few helpful suggestions from Matthew:

lines < 120 chars
uses SIL_NLP_ENV for paths to remain platform agnostic
placed in silnlp/common

I expect there maybe concerns with style or functionality. In particular, count_words.py uses numpy, pandas, and collections.Counter, which may be cumbersome, and takes a long time to run. I'm glad to address all comments. Thanks.

This change is

mshannon-sil

Reviewed 2 of 2 files at r1, all commit messages.
Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @bhartmoore and @ddaspit)

silnlp/common/count_words.py line 11 at r1 (raw file):

def is_word(word: str) -> bool:
    val = False

It looks this function defines a string as a word as long as it has a single alphabetic character. If that's the intended functionality, it can stop once any alphabetic character is found, to be more efficient, e.g.

Code snippet:

return any(char.isalpha() for char in word)

silnlp/common/compare_lex.py line 7 at r1 (raw file):

from ..common.environment import SIL_NLP_ENV

def get_all_words(src_file: str) -> list:

The List type from from typing import List should be used for consistency with the rest of the repo

bhartmoore · 2025-01-09T21:18:45Z

@mshannon-sil Thank you for the review. I've made changes to address your comments.
Should is_word() still be a function now that it's a single line? It's only called once.
You're right about the functionality; it's meant to get rid of numbers and stray punctuation that were showing up in the lexicon.

mshannon-sil · 2025-01-09T21:59:39Z

I missed that it was only called once. In that case, I wouldn't keep it as a function.

bhartmoore · 2025-01-10T13:23:20Z

I missed that it was only called once. In that case, I wouldn't keep it as a function.

Looking back, I realized that it was because we may want to update the definition of a word over time (e.g. in the newest set, there are alpha words in the source that are translated as numerals in the target, so I probably need to allow for those), and we may want to make it more sophisticated than just looking at whitespace. So if you're ok with it, I may keep it as a function even though it's reduced to one (very elegant!) line.

benjaminking · 2025-01-10T14:04:28Z

I think it could be worthwhile to keep the single-line is_word function here, because the name of the function can act as a bit of documentation, especially since the purpose of that line might not be immediately obvious otherwise.

mshannon-sil

Makes sense, it's fine to keep it as is then.

Reviewable status: 0 of 2 files reviewed, 2 unresolved discussions (waiting on @bhartmoore and @ddaspit)

mshannon-sil

Reviewable status: 0 of 2 files reviewed, 1 unresolved discussion (waiting on @bhartmoore and @ddaspit)

Add lexicon tools for use with XRI datasets

6d0367b

bhartmoore requested review from ddaspit and mshannon-sil January 8, 2025 20:28

bhartmoore self-assigned this Jan 8, 2025

Clean up file name handling

9ff05e7

mshannon-sil requested changes Jan 9, 2025

View reviewed changes

Address review comments about List type and efficiency.

1ab9d01

mshannon-sil reviewed Jan 10, 2025

View reviewed changes

mshannon-sil approved these changes Jan 10, 2025

View reviewed changes

Correct type on --num arg.

012d04b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add lexicon tools for use with XRI datasets #621

Add lexicon tools for use with XRI datasets #621

bhartmoore commented Jan 8, 2025 •

edited by ddaspit

Loading

mshannon-sil left a comment

bhartmoore commented Jan 9, 2025

mshannon-sil commented Jan 9, 2025

bhartmoore commented Jan 10, 2025

benjaminking commented Jan 10, 2025

mshannon-sil left a comment

mshannon-sil left a comment

Add lexicon tools for use with XRI datasets #621

Are you sure you want to change the base?

Add lexicon tools for use with XRI datasets #621

Conversation

bhartmoore commented Jan 8, 2025 • edited by ddaspit Loading

mshannon-sil left a comment

Choose a reason for hiding this comment

bhartmoore commented Jan 9, 2025

mshannon-sil commented Jan 9, 2025

bhartmoore commented Jan 10, 2025

benjaminking commented Jan 10, 2025

mshannon-sil left a comment

Choose a reason for hiding this comment

mshannon-sil left a comment

Choose a reason for hiding this comment

bhartmoore commented Jan 8, 2025 •

edited by ddaspit

Loading