-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement __contains__ for vocab. #75
Conversation
Yep, Python philosophy: why not require a proper implementation if you can do it incredibly slowly ;). |
let embeds = self.embeddings.borrow(); | ||
Ok(embeds | ||
.vocab() | ||
.idx(&word) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reminder, we should implement word(&self) -> Option<...>
and subword(&self) -> Option<...>
on WordIndex
, so that these cases can be simplified to:
.vocab().idx(&word).and_then(WordIndex::word).is_some()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want to open the issue at finalfusion-rust
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WordIndex::Word(_) => true, | ||
WordIndex::Subword(_) => false, | ||
}) | ||
.unwrap_or(false)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is_some()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, is_some()
conveys that the word is in-vocab if there's a WordIndex::Subword
returned. We'd only return false
if no subword indices can be generated for a word. I'd like to return false for any word that isn't part of the actual vocabulary
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah right. But this can be made nice with is_some()
with the new WordIndices
changes in finalfusion-rust
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only with a bumped dependency ;)
I was playing around with some alternative vocab implementations last night. I thought it'd be nice to index the vocab both with Then I tried to implement an iterator on the vocabulary since we don't get that for free on Implementing If I go ahead and implement I opened an issue at PyO3/pyo3#611 regarding this. Do you have more experience with this or any idea what's going on here? |
Python performs linear search over sequences if __contains__ is not explicitly implemented. This is painfully slow over large vocabularies, therefore implement __contains__ explicitly.
Python performs linear search over sequences if
__contains__
isnot explicitly implemented. This is painfully slow over large
vocabularies, therefore implement
__contains__
explicitly.Fixes #74
thanks @twuebi and @ketxd for pointing this out.