Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Analogy and Similarity Queries. #97

Merged
merged 1 commit into from
Jun 2, 2020
Merged

Conversation

sebpuetz
Copy link
Member


tests are taken from the Rust version.

Might make sense to also use a namedtuples for the iterators and the embedding_with_norm methods...

@sebpuetz sebpuetz requested a review from danieldk May 31, 2020 11:47
@sebpuetz sebpuetz changed the title Implement Similarity Queries. Implement Analogy and Similarity Queries. May 31, 2020
@sebpuetz sebpuetz linked an issue May 31, 2020 that may be closed by this pull request
skip for skip in (self.vocab.word_index.get(skip)
for skip in skips) if skip is not None
]
indices = sims.argsort()[::-1]
Copy link
Member

@danieldk danieldk Jun 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does an O(n log n) sort. A good binary heap implementation should be faster, since insertions are O(1) in the average case. Also, the worst case would only be O(log k) if the heap is at most k items. So, inserting all items in the heap should be O(n) on average or O(n log k) worst case.

Another approach that would be faster is something like quickselect, where you'd only guarantee a particular item or set of items to be sorted. numpy seems to have a function for this:

https://numpy.org/doc/stable/reference/generated/numpy.argpartition.html

I think you could either use argpartition with an array, or you could use it to get the top-k elements and then sort, so that the running time of the sort is only O(k log k).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Went with argpartition and sorting a heap after partioning.

for skip in skips) if skip is not None
]
indices = sims.argsort()[::-1]
mask = np.isin(indices, skip_indices, invert=True, assume_unique=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the cost of this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a bunch of sorting and array allocations, I was expecting some equivalent to

[idx for idx in indices if idx not in skip_indices]

@sebpuetz
Copy link
Member Author

sebpuetz commented Jun 2, 2020

Thanks for the review, I pushed an update!

@sebpuetz sebpuetz merged commit 79be265 into master Jun 2, 2020
@sebpuetz sebpuetz deleted the sim-queries branch June 2, 2020 09:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

Implement Analogies & Similarities
2 participants