Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement __contains__ for vocab. #75

Merged
merged 1 commit into from
Oct 5, 2019
Merged

Implement __contains__ for vocab. #75

merged 1 commit into from
Oct 5, 2019

Conversation

sebpuetz
Copy link
Member

@sebpuetz sebpuetz commented Oct 3, 2019

Python performs linear search over sequences if __contains__ is
not explicitly implemented. This is painfully slow over large
vocabularies, therefore implement __contains__ explicitly.

Fixes #74

thanks @twuebi and @ketxd for pointing this out.

@sebpuetz sebpuetz requested a review from danieldk October 3, 2019 12:07
@danieldk
Copy link
Member

danieldk commented Oct 3, 2019

Yep, Python philosophy: why not require a proper implementation if you can do it incredibly slowly ;).

let embeds = self.embeddings.borrow();
Ok(embeds
.vocab()
.idx(&word)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reminder, we should implement word(&self) -> Option<...> and subword(&self) -> Option<...> on WordIndex, so that these cases can be simplified to:

.vocab().idx(&word).and_then(WordIndex::word).is_some()

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to open the issue at finalfusion-rust?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WordIndex::Word(_) => true,
WordIndex::Subword(_) => false,
})
.unwrap_or(false))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is_some()

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, is_some() conveys that the word is in-vocab if there's a WordIndex::Subword returned. We'd only return false if no subword indices can be generated for a word. I'd like to return false for any word that isn't part of the actual vocabulary

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah right. But this can be made nice with is_some() with the new WordIndices changes in finalfusion-rust.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only with a bumped dependency ;)

@sebpuetz
Copy link
Member Author

sebpuetz commented Oct 4, 2019

I was playing around with some alternative vocab implementations last night. I thought it'd be nice to index the vocab both with String and int, since we could return the word of the given index and also the index for a given word. This is doable through PyMappingProtocol and an enum, where extract() constructs the correct variant depending on the type coming from python.

Then I tried to implement an iterator on the vocabulary since we don't get that for free on PyMappingProtocol by implementing __getitem__ and __len__, which is when things got weird:

Implementing PyMappingProtocol::__iter__ for the vocab worked fine, the method returns some iterator struct, implementing PyIterProtocol - similar to what we do in PyEmbeddings. Although, for some reason in Python that doesn't make the vocab iterable, e.g. for _ in vocab fails, but vocab.__iter__() works.

If I go ahead and implement PyIteratorProtocol for the vocab, the vocab becomes iterable (for _ in vocab works) but ignores both PyMappingProtocol::__iter__ and PyMappingProtocol::__contains__ and uses PyIteratorProtocol::__iter__ for a linear search for something like "a" in vocab. vocab.__contains__("a") on the other hand calls PyMappingProtocol::__contains__.

I opened an issue at PyO3/pyo3#611 regarding this. Do you have more experience with this or any idea what's going on here?

Python performs linear search over sequences if __contains__ is
not explicitly implemented. This is painfully slow over large
vocabularies, therefore implement __contains__ explicitly.
@sebpuetz sebpuetz merged commit 18c3a2b into master Oct 5, 2019
@sebpuetz sebpuetz deleted the contains branch October 20, 2019 08:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

__contains__ on vocab slow
2 participants