Lucene Vector Space Model (VSM) Basic Search Engine

Links

Scoring Formula

TermFrequency sqrt(field.count(term))

InverseDocFrequency log( doc_count / num_docs_which_contain_term ) + 1 ) + 1

fieldNorm 1 / sqrt( num_chars_in_field)

Weight of term in document field = TF * IDF * fieldNorm

Weight of term in query = IDF * field_boost_exponent

Multiply each term_score in query with matching term_score of doc[n] to produce the dot_product of query_vectory <-> doc[n]_vector.
If term exists in more than one field within the same document, pick the field with highest scoring match (see here).
dot_product * ( matching_terms / num_terms_in_query ) which punishes match that is missing some query terms.
dot_product / norm( query_vector ) * norm( doc_vector )

Instructions

Easiest way to get started, clone repo and run python -i search_engine.py.

products = json.load( open('movies.json') )
search_engine = SearchEngine(docs=products, use_field_norms=False)
field_boosts = {'title': 1.1}
query = search_engine.query("gi joe ww2 documentary", field_boosts=field_boosts, num_results=10)

Options

use_field_norms (boolean): Factor in field length for TD-IDF scoring
num_results (int): Total results to display
field_boosts (dict[string, int]): Boost certain fields by an exponent

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
README.md		README.md
movies.json		movies.json
search_engine.py		search_engine.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Lucene Vector Space Model (VSM) Basic Search Engine

Links

Scoring Formula

Instructions

Options

About

Releases

Packages

Languages

avrahamkam/lucene

Folders and files

Latest commit

History

Repository files navigation

Lucene Vector Space Model (VSM) Basic Search Engine

Links

Scoring Formula

Instructions

Options

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages