Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add smart information retrieval system for TfidfModel. Fix #1785 #1791

Merged
merged 34 commits into from
Jan 15, 2018
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
Show all changes
34 commits
Select commit Hold shift + click to select a range
5e1830b
fixing appveyor
markroxor Oct 20, 2016
6cef4b1
Merge branch 'develop' of https://github.com/RaRe-Technologies/gensim…
markroxor Dec 2, 2016
e8a3f16
verify weights
markroxor Dec 15, 2017
648bf21
verify weights
markroxor Dec 15, 2017
a6f1afb
smartirs ready
markroxor Dec 15, 2017
d091138
change old tests
markroxor Dec 15, 2017
951c549
remove lambdas
markroxor Dec 15, 2017
40c0558
address suggestions
markroxor Dec 16, 2017
b35344c
minor fix
markroxor Dec 19, 2017
0917e75
pep8 fix
markroxor Dec 19, 2017
bef79cc
numpy style doc strings
markroxor Dec 19, 2017
d3d431c
fix pickle problem
menshikh-iv Dec 21, 2017
0e6f21e
flake8 fix
markroxor Dec 21, 2017
7ee7560
fix bug in docstring
menshikh-iv Dec 21, 2017
f2251a4
Merge branch 'develop' of github.com:markroxor/gensim into develop
markroxor Dec 22, 2017
b2def84
added few tests
markroxor Dec 22, 2017
5b2d37a
fix normalize issue for pickling
markroxor Dec 22, 2017
ac4b154
fix normalize issue for pickling
markroxor Dec 22, 2017
0bacc08
test without sklearn api
markroxor Dec 22, 2017
51e0eb9
Merge branch 'smartirs' of github.com:markroxor/gensim into smartirs
markroxor Dec 22, 2017
3039732
hanging idents and new tests
markroxor Dec 25, 2017
99e6a6f
Merge branch 'develop' of https://github.com/RaRe-Technologies/gensim…
markroxor Dec 26, 2017
7d63d9c
Merge branch 'smartirs' of github.com:markroxor/gensim into smartirs
markroxor Dec 26, 2017
e5140f8
add docstring
markroxor Dec 26, 2017
4afbadd
add docstring
markroxor Dec 26, 2017
d2fe235
Merge branch 'smartirs' of github.com:markroxor/gensim into smartirs
markroxor Dec 26, 2017
52ee3c4
better way cmparing floats
markroxor Dec 27, 2017
48e84f7
old way of cmp floats
markroxor Jan 8, 2018
6d2f47b
Merge remote-tracking branch 'upstream/develop' into develop
menshikh-iv Jan 10, 2018
607ba61
Merge branch 'develop' into smartirs
menshikh-iv Jan 10, 2018
d0878a4
doc fix[1]
menshikh-iv Jan 10, 2018
b544c9c
doc fix[2]
menshikh-iv Jan 11, 2018
c4e3656
fix description TODOs
markroxor Jan 11, 2018
98ffde5
fix irksome comparision
markroxor Jan 15, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
83 changes: 56 additions & 27 deletions gensim/models/tfidfmodel.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,24 +11,27 @@
from gensim import interfaces, matutils, utils
from six import iteritems

import numpy as np

logger = logging.getLogger(__name__)


def df2idf(docfreq, totaldocs, log_base=2.0, add=0.0):
"""
Compute default inverse-document-frequency for a term with document frequency `doc_freq`::
def resolve_weights(smartirs):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

docstrings needed too (for all stuff here)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that Checks for validity of smartirs parameter. is enough. Do you have anything else in mind as well?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@markroxor need to add "Parameters" (type, description), "Raises" (type, reason), "Returns" (type, description)

if not isinstance(smartirs, str) or len(smartirs) != 3:
raise ValueError('Expected a string of length 3 except got ' + smartirs)

idf = add + log(totaldocs / doc_freq)
"""
return add + math.log(1.0 * totaldocs / docfreq, log_base)
w_tf, w_df, w_n = smartirs

if w_tf not in 'nlabL':
raise ValueError('Expected term frequency weight to be one of nlabL, except got ' + w_tf)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better use 'nlabL' instead nlabL (readability), same for got ..


if w_df not in 'ntp':
raise ValueError('Expected inverse document frequency weight to be one of ntp, except got ' + w_df)

def precompute_idfs(wglobal, dfs, total_docs):
"""Precompute the inverse document frequency mapping for all terms."""
# not strictly necessary and could be computed on the fly in TfidfModel__getitem__.
# this method is here just to speed things up a little.
return {termid: wglobal(df, total_docs) for termid, df in iteritems(dfs)}
if w_n not in 'ncb':
raise ValueError('Expected normalization weight to be one of ncb, except got ' + w_n)

return w_tf, w_df, w_n


class TfidfModel(interfaces.TransformationABC):
Expand All @@ -49,8 +52,8 @@ class TfidfModel(interfaces.TransformationABC):
Model persistency is achieved via its load/save methods.
"""

def __init__(self, corpus=None, id2word=None, dictionary=None,
wlocal=utils.identity, wglobal=df2idf, normalize=True):
def __init__(self, corpus=None, id2word=None, dictionary=None, smartirs="ntc",
wlocal=None, wglobal=None, normalize=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to support backward compatibility, why you change default values?

Good solution - by default support default behavior and smartirs=None, but if user set smartirs - use it (and ignore wlocal, wglobal, ...)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • It is backward compatible. Please check the smartirs value. :)
  • This can be followed but I refrained because I removed df2idf function totally because it is redundant. Are you sure I should add it back?

"""
Compute tf-idf by multiplying a local component (term frequency) with a
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you convert all docstrings in this file to numpy-style, according to my previous comment #1780 (comment)

global component (inverse document frequency), and normalizing
Expand Down Expand Up @@ -78,10 +81,41 @@ def __init__(self, corpus=None, id2word=None, dictionary=None,
and it will be used to directly construct the inverse document frequency
mapping (then `corpus`, if specified, is ignored).
"""
self.normalize = normalize
self.id2word = id2word
self.wlocal, self.wglobal = wlocal, wglobal
self.wlocal, self.wglobal, self.normalize = wlocal, wglobal, normalize
self.num_docs, self.num_nnz, self.idfs = None, None, None
n_tf, n_df, n_n = smartirs
self.smartirs = smartirs

if self.wlocal is None:
if n_tf == "n":
self.wlocal = lambda tf, mean=None, _max=None: tf
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better to use simple define (instead of lambda) for avoiding pickle problems (here and everywhere)

elif n_tf == "l":
self.wlocal = lambda tf, mean=None, _max=None: 1 + math.log(tf)
elif n_tf == "a":
self.wlocal = lambda tf, mean=None, _max=None: 0.5 + (0.5 * tf / _max)
elif n_tf == "b":
self.wlocal = lambda tf, mean=None, _max=None: 1 if tf > 0 else 0
elif n_tf == "L":
self.wlocal = lambda tf, mean=None, _max=None: (1 + math.log(tf)) / (1 + math.log(mean))

if self.wglobal is None:
if n_df == "n":
self.wglobal = utils.identity
elif n_df == "t":
self.wglobal = lambda docfreq, totaldocs: math.log(1.0 * totaldocs / docfreq, 10)
elif n_tf == "p":
self.wglobal = lambda docfreq, totaldocs: math.log((float(totaldocs) - docfreq) / docfreq)

if self.normalize is None or isinstance(self.normalize, bool):
if n_n == "n" or self.normalize is False:
self.normalize = lambda x: x
elif n_n == "c" or self.normalize is True:
self.normalize = matutils.unitvec
# TODO write byte-size normalisation
# elif n_n == "b":
# self.normalize = matutils.unitvec

if dictionary is not None:
# user supplied a Dictionary object, which already contains all the
# statistics we need to construct the IDF mapping. we can skip the
Expand All @@ -92,7 +126,7 @@ def __init__(self, corpus=None, id2word=None, dictionary=None,
)
self.num_docs, self.num_nnz = dictionary.num_docs, dictionary.num_nnz
self.dfs = dictionary.dfs.copy()
self.idfs = precompute_idfs(self.wglobal, self.dfs, self.num_docs)

if id2word is None:
self.id2word = dictionary
elif corpus is not None:
Expand All @@ -113,6 +147,7 @@ def initialize(self, corpus):
logger.info("collecting document frequencies")
dfs = {}
numnnz, docno = 0, -1

for docno, bow in enumerate(corpus):
if docno % 10000 == 0:
logger.info("PROGRESS: processing document #%i", docno)
Expand All @@ -127,11 +162,6 @@ def initialize(self, corpus):

# and finally compute the idf weights
n_features = max(dfs) if dfs else 0
logger.info(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why you remove this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This showed the progress of the pre - computation step. Now that the values are no longer being pre-computed I think we need to log this in a different way. How do you want it to go?

"calculating IDF weights for %i documents and %i features (%i matrix non-zeros)",
self.num_docs, n_features, self.num_nnz
)
self.idfs = precompute_idfs(self.wglobal, self.dfs, self.num_docs)

def __getitem__(self, bow, eps=1e-12):
"""
Expand All @@ -144,17 +174,16 @@ def __getitem__(self, bow, eps=1e-12):

# unknown (new) terms will be given zero weight (NOT infinity/huge weight,
# as strict application of the IDF formula would dictate)

vector = [
(termid, self.wlocal(tf) * self.idfs.get(termid))
for termid, tf in bow if self.idfs.get(termid, 0.0) != 0.0
(termid, self.wlocal(tf, mean=np.mean(np.array(bow), axis=1), _max=np.max(bow, axis=1)) * self.wglobal(self.dfs[termid], self.num_docs))
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks wasteful (creating arrays, only to throw them away). What are the performance implications of these changes? Do you have a benchmark before/after?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed the approach.

for termid, tf in bow if self.wglobal(self.dfs[termid], self.num_docs) != 0.0
]

# and finally, normalize the vector either to unit length, or use a
# user-defined normalization function
if self.normalize is True:
vector = matutils.unitvec(vector)
elif self.normalize:
vector = self.normalize(vector)

vector = self.normalize(vector)

# make sure there are no explicit zeroes in the vector (must be sparse)
vector = [(termid, weight) for termid, weight in vector if abs(weight) > eps]
Expand Down
7 changes: 4 additions & 3 deletions gensim/sklearn_api/tfidf.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,14 +21,15 @@ class TfIdfTransformer(TransformerMixin, BaseEstimator):
Base Tf-Idf module
"""

def __init__(self, id2word=None, dictionary=None, wlocal=gensim.utils.identity,
wglobal=gensim.models.tfidfmodel.df2idf, normalize=True):
def __init__(self, id2word=None, dictionary=None, smartirs="ntc", wlocal=None,
wglobal=None, normalize=True):
"""
Sklearn wrapper for Tf-Idf model.
"""
self.gensim_model = None
self.id2word = id2word
self.dictionary = dictionary
self.smartirs = smartirs
self.wlocal = wlocal
self.wglobal = wglobal
self.normalize = normalize
Expand All @@ -38,7 +39,7 @@ def fit(self, X, y=None):
Fit the model according to the given training data.
"""
self.gensim_model = TfidfModel(
corpus=X, id2word=self.id2word, dictionary=self.dictionary,
corpus=X, id2word=self.id2word, dictionary=self.dictionary, smartirs=self.smartirs,
wlocal=self.wlocal, wglobal=self.wglobal, normalize=self.normalize
)
return self
Expand Down
7 changes: 3 additions & 4 deletions gensim/test/test_sklearn_api.py
Original file line number Diff line number Diff line change
Expand Up @@ -498,7 +498,6 @@ def testPersistence(self):
original_matrix = self.model.transform(original_bow)
passed = numpy.allclose(loaded_matrix, original_matrix, atol=1e-1)
self.assertTrue(passed)

def testModelNotFitted(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to add more tests (for new functionality)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have that in my checklist but before that I need to pass the already present tests.

lsi_wrapper = LsiTransformer(id2word=dictionary, num_topics=2)
texts_new = ['graph', 'eulerian']
Expand Down Expand Up @@ -973,13 +972,13 @@ def testTransform(self):

def testSetGetParams(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't forget to add more tests (also, check situations, when you pass smartirs and wlocal for example)

# updating only one param
self.model.set_params(normalize=False)
self.model.set_params(smartirs='nnn')
model_params = self.model.get_params()
self.assertEqual(model_params["normalize"], False)
self.assertEqual(model_params["smartirs"], 'nnn')

# verify that the attributes values are also changed for `gensim_model` after fitting
self.model.fit(self.corpus)
self.assertEqual(getattr(self.model.gensim_model, 'normalize'), False)
self.assertEqual(getattr(self.model.gensim_model, 'smartirs'), 'nnn')

def testPipeline(self):
with open(datapath('mini_newsgroup'), 'rb') as f:
Expand Down