You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Current scores are based on the text in sections that is matched. The value of the score is not normalized across documents, and so it is hard to interpret. Ideally, we could have a 'similarity match' score where 1 is complete match and 0 is completely different. Cosine similarity metrics appear to work well for this, and the algorithm in the branch feature/golang-process, FlatGov/server_py/flatgov/common/bill_similarity_scikit.py generally works (same document has a score of 1, similar ones around ~ .4 and others are about .05). However, it is not clear how to efficiently create the initial vector for large numbers of documents, or how to search that vector efficiently.
The best approach is probably to build the search into elasticsearch, using their cosine similarity script and scoring. See:
Current scores are based on the text in sections that is matched. The value of the score is not normalized across documents, and so it is hard to interpret. Ideally, we could have a 'similarity match' score where 1 is complete match and 0 is completely different. Cosine similarity metrics appear to work well for this, and the algorithm in the branch
feature/golang-process
,FlatGov/server_py/flatgov/common/bill_similarity_scikit.py
generally works (same document has a score of 1, similar ones around ~ .4 and others are about .05). However, it is not clear how to efficiently create the initial vector for large numbers of documents, or how to search that vector efficiently.The best approach is probably to build the search into elasticsearch, using their cosine similarity script and scoring. See:
https://www.elastic.co/guide/en/elasticsearch/reference/7.6/query-dsl-script-score-query.html#vector-functions
The text was updated successfully, but these errors were encountered: