Implement contains for vocab. #75

sebpuetz · 2019-10-03T12:06:26Z

Python performs linear search over sequences if __contains__ is
not explicitly implemented. This is painfully slow over large
vocabularies, therefore implement __contains__ explicitly.

Fixes #74

thanks @twuebi and @ketxd for pointing this out.

danieldk · 2019-10-03T12:48:47Z

Yep, Python philosophy: why not require a proper implementation if you can do it incredibly slowly ;).

danieldk · 2019-10-03T12:52:16Z

src/vocab.rs

+        let embeds = self.embeddings.borrow();
+        Ok(embeds
+            .vocab()
+            .idx(&word)


Reminder, we should implement word(&self) -> Option<...> and subword(&self) -> Option<...> on WordIndex, so that these cases can be simplified to:

.vocab().idx(&word).and_then(WordIndex::word).is_some()

Do you want to open the issue at finalfusion-rust?

finalfusion/finalfusion-rust#76

danieldk · 2019-10-03T12:54:27Z

src/vocab.rs

+                WordIndex::Word(_) => true,
+                WordIndex::Subword(_) => false,
+            })
+            .unwrap_or(false))


is_some()

Actually, is_some() conveys that the word is in-vocab if there's a WordIndex::Subword returned. We'd only return false if no subword indices can be generated for a word. I'd like to return false for any word that isn't part of the actual vocabulary

Ah right. But this can be made nice with is_some() with the new WordIndices changes in finalfusion-rust.

Only with a bumped dependency ;)

sebpuetz · 2019-10-04T09:17:12Z

I was playing around with some alternative vocab implementations last night. I thought it'd be nice to index the vocab both with String and int, since we could return the word of the given index and also the index for a given word. This is doable through PyMappingProtocol and an enum, where extract() constructs the correct variant depending on the type coming from python.

Then I tried to implement an iterator on the vocabulary since we don't get that for free on PyMappingProtocol by implementing __getitem__ and __len__, which is when things got weird:

Implementing PyMappingProtocol::__iter__ for the vocab worked fine, the method returns some iterator struct, implementing PyIterProtocol - similar to what we do in PyEmbeddings. Although, for some reason in Python that doesn't make the vocab iterable, e.g. for _ in vocab fails, but vocab.__iter__() works.

If I go ahead and implement PyIteratorProtocol for the vocab, the vocab becomes iterable (for _ in vocab works) but ignores both PyMappingProtocol::__iter__ and PyMappingProtocol::__contains__ and uses PyIteratorProtocol::__iter__ for a linear search for something like "a" in vocab. vocab.__contains__("a") on the other hand calls PyMappingProtocol::__contains__.

I opened an issue at PyO3/pyo3#611 regarding this. Do you have more experience with this or any idea what's going on here?

Python performs linear search over sequences if __contains__ is not explicitly implemented. This is painfully slow over large vocabularies, therefore implement __contains__ explicitly.

sebpuetz requested a review from danieldk October 3, 2019 12:07

danieldk requested changes Oct 3, 2019

View reviewed changes

sebpuetz force-pushed the contains branch from bef350d to af104ec Compare October 3, 2019 13:00

Implement __contains__ for vocab.

cfdb094

Python performs linear search over sequences if __contains__ is not explicitly implemented. This is painfully slow over large vocabularies, therefore implement __contains__ explicitly.

sebpuetz force-pushed the contains branch from af104ec to cfdb094 Compare October 5, 2019 07:40

sebpuetz mentioned this pull request Oct 5, 2019

Update finalfusion dependency. #76

Merged

danieldk approved these changes Oct 5, 2019

View reviewed changes

sebpuetz merged commit 18c3a2b into master Oct 5, 2019

sebpuetz deleted the contains branch October 20, 2019 08:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement contains for vocab. #75

Implement contains for vocab. #75

sebpuetz commented Oct 3, 2019 •

edited

Loading

danieldk commented Oct 3, 2019

danieldk Oct 3, 2019

sebpuetz Oct 3, 2019

danieldk Oct 3, 2019

danieldk Oct 3, 2019

sebpuetz Oct 3, 2019

sebpuetz Oct 4, 2019

danieldk Oct 5, 2019

sebpuetz Oct 5, 2019

sebpuetz commented Oct 4, 2019

Implement __contains__ for vocab. #75

Implement __contains__ for vocab. #75

Conversation

sebpuetz commented Oct 3, 2019 • edited Loading

danieldk commented Oct 3, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sebpuetz commented Oct 4, 2019

Implement contains for vocab. #75

Implement contains for vocab. #75

sebpuetz commented Oct 3, 2019 •

edited

Loading