TermFrequency sqrt(field.count(term))
InverseDocFrequency log( doc_count / num_docs_which_contain_term ) + 1 ) + 1
fieldNorm 1 / sqrt( num_chars_in_field)
Weight of term in document field = TF * IDF * fieldNorm
Weight of term in query = IDF * field_boost_exponent
- Multiply each term_score in query with matching term_score of doc[n] to produce the dot_product of query_vectory <-> doc[n]_vector.
- If term exists in more than one field within the same document, pick the field with highest scoring match (see here).
- dot_product * ( matching_terms / num_terms_in_query ) which punishes match that is missing some query terms.
- dot_product / norm( query_vector ) * norm( doc_vector )
Easiest way to get started, clone repo and run python -i search_engine.py
.
products = json.load( open('movies.json') )
search_engine = SearchEngine(docs=products, use_field_norms=False)
field_boosts = {'title': 1.1}
query = search_engine.query("gi joe ww2 documentary", field_boosts=field_boosts, num_results=10)
- use_field_norms (boolean): Factor in field length for TD-IDF scoring
- num_results (int): Total results to display
- field_boosts (dict[string, int]): Boost certain fields by an exponent