Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scoring document similarity #267

Closed
aih opened this issue Mar 26, 2021 · 1 comment
Closed

Scoring document similarity #267

aih opened this issue Mar 26, 2021 · 1 comment

Comments

@aih
Copy link
Collaborator

aih commented Mar 26, 2021

Current scores are based on the text in sections that is matched. The value of the score is not normalized across documents, and so it is hard to interpret. Ideally, we could have a 'similarity match' score where 1 is complete match and 0 is completely different. Cosine similarity metrics appear to work well for this, and the algorithm in the branch feature/golang-process, FlatGov/server_py/flatgov/common/bill_similarity_scikit.py generally works (same document has a score of 1, similar ones around ~ .4 and others are about .05). However, it is not clear how to efficiently create the initial vector for large numbers of documents, or how to search that vector efficiently.

The best approach is probably to build the search into elasticsearch, using their cosine similarity script and scoring. See:

https://www.elastic.co/guide/en/elasticsearch/reference/7.6/query-dsl-script-score-query.html#vector-functions

@aih
Copy link
Collaborator Author

aih commented May 22, 2021

The section level scores are now normalized to 100, and similarity is separately measured with n-grams to determine similarity levels. Closing.

@aih aih closed this as completed May 22, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant