Scoring document similarity #267

aih · 2021-03-26T16:35:36Z

Current scores are based on the text in sections that is matched. The value of the score is not normalized across documents, and so it is hard to interpret. Ideally, we could have a 'similarity match' score where 1 is complete match and 0 is completely different. Cosine similarity metrics appear to work well for this, and the algorithm in the branch feature/golang-process, FlatGov/server_py/flatgov/common/bill_similarity_scikit.py generally works (same document has a score of 1, similar ones around ~ .4 and others are about .05). However, it is not clear how to efficiently create the initial vector for large numbers of documents, or how to search that vector efficiently.

The best approach is probably to build the search into elasticsearch, using their cosine similarity script and scoring. See:

https://www.elastic.co/guide/en/elasticsearch/reference/7.6/query-dsl-script-score-query.html#vector-functions

The text was updated successfully, but these errors were encountered:

aih · 2021-05-22T18:06:27Z

The section level scores are now normalized to 100, and similarity is separately measured with n-grams to determine similarity levels. Closing.

aih closed this as completed May 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scoring document similarity #267

Scoring document similarity #267

aih commented Mar 26, 2021

aih commented May 22, 2021

Scoring document similarity #267

Scoring document similarity #267

Comments

aih commented Mar 26, 2021

aih commented May 22, 2021