Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem while passing wlocal function in tfidf model #2444

Closed
xmedved1 opened this issue Apr 11, 2019 · 7 comments
Closed

Problem while passing wlocal function in tfidf model #2444

xmedved1 opened this issue Apr 11, 2019 · 7 comments

Comments

@xmedved1
Copy link

xmedved1 commented Apr 11, 2019

Problem description

I tried to use math.sqrt function on term frequency when computing TF-IDF model as you declare in script documentation:

wlocals : function, optional
            Function for local weighting, default for `wlocal` is :func:`~gensim.utils.identity`
            (other options: :func:`math.sqrt`, :func:`math.log1p`, etc).

========================================
Gensim implementation:

tf_array = self.wlocal(np.array(tf_array))
vector = [
   (termid, tf * self.idfs.get(termid))
   for termid, tf in zip(termid_array, tf_array) if abs(self.idfs.get(termid, 0.0)) > self.eps
]

Error:

  File "/home/xmedved1/.local/lib/python3.6/site-packages/gensim/models/tfidfmodel.py", line 432, in __getitem__
    tf_array = self.wlocal(np.array(tf_array))
TypeError: only size-1 arrays can be converted to Python scalars

======================================

Fix if I want to use math.sqrt:

tf_array = np.array(tf_array)
vector = [
    (termid, self.wlocal(tf) * self.idfs.get(termid))
    for termid, tf in zip(termid_array, tf_array) if abs(self.idfs.get(termid, 0.0)) > self.eps
]

==========================================
gensim version 3.7.2

==========================================
I don't know if this is a problem to fix, because in Gensim implementation I can do some normalization that includes all items in list (something like tf = n_i_j / sum(n_k_j) where i,k is token and i!=k and j is the document –> this is not allowed in my fix). So I thing the problem is the documentation of wlocal parameter.

Best regards M

@piskvorky
Copy link
Owner

@Witiko could you please have a look? Many thanks.

@Witiko
Copy link
Contributor

Witiko commented Apr 11, 2019

@piskvorky This appears to be a regression from 0bfb9da, which is part of #1791 (Gensim 3.3.0). Since then, wlocal has been applied to a numpy array of term frequencies rather than individual term frequencies, see the corresponding diff.

One way to regain the original behaviour would be to use numpy.vectorize, but a rewrite of the SMART implementation would also be required, since it depends on the ability to compute statistics of all term frequencies (see the smartirs_wlocal function).

An alternate solution would be to document the new behavior, although this is arguably a breaking change.

@piskvorky
Copy link
Owner

piskvorky commented Apr 11, 2019

I don't really understand what that means, what are these corners @xmedved1 pointed out in his fix. What are the actual code contracts and invariants now?

@Witiko Can you please suggest a fix to the documentation (+a unit test to capture any future regressions), if this is indeed a problem with documentation? CC @markroxor as author of #1791.

TFIDF is a simple model, I'm strongly -1 on introducing any bloat or complexity that obscures its basic use-cases because of some obscure useful-maybe-if options.

@Witiko
Copy link
Contributor

Witiko commented Apr 11, 2019

@xmedved1's fix and changing tf_array = self.wlocal(np.array(tf_array)) to tf_array = np.vectorize(self.wlocal)(np.array(tf_array)) seem equivalent, although mine makes fewer changes. However, @markroxor's SMART implementation from #1791 (and also my #2420) assumes that wlocal accepts numpy arrays, not individual frequencies (see the smartirs_wlocal function), so this would require a larger rewrite, @xmedved1's fix is not enough.

I agree with @xmedved1 that it would be better to document rather than fix this, since a wlocal function benefits from access to the entire term frequency vector rather than just a single term frequency. It allows for a) more efficient transformations and b) transformations that require all term frequencies (see the a and L transformations in the smartirs_wlocal function).

As for the update of the documentation, changing the current text:

wlocals : function, optional

Function for local weighting, default for `wlocal` is :func:`~gensim.utils.identity`
(other options: :func:`math.sqrt`, :func:`math.log1p`, etc).

to the following:

wlocal : function, optional

Function for local weighting, default for `wlocal` is :func:`~gensim.utils.identity`
(other options: :func:`numpy.sqrt`, :func:`numpy.log1p`, etc).

should clear any misunderstanding. However, it could be considered a breaking change, since it is a departure from the behaviour of Gensim before 3.3.0. Semver and consideration for other developers dictate that a breaking change should be postponed until Gensim 4.

A unit test can consist of a simple wlocals function that asserts the type of the parameter (a numpy array), and performs some simple transformation (array + 1). We would then initialize a TfidfModel, transform a BOW vector using the model, and check if we receive the expected transformed BOW vector + 1.

@piskvorky
Copy link
Owner

piskvorky commented Apr 12, 2019

Reading that as a user, that documentation update (math => numpy) still doesn't tell me much. What should my own function look like?

@xmedved1 @Witiko Can we make the documented API contract stronger? Especially if it potentially breaks code somewhere else (smartirs something).

@Witiko
Copy link
Contributor

Witiko commented Apr 12, 2019

Reading that as a user, that documentation update (math => numpy) still doesn't tell me much. What should my own function look like?

Perhaps we can show one of the SMART functions as an example:

wlocal : callable, optional

Function for local weighting, default for `wlocal` is :func:`~gensim.utils.identity`
(other options: :func:`numpy.sqrt`, `lambda tf: 0.5 + (0.5 * tf / tf.max())`, etc.).

@xmedved1 @Witiko Can make the documented API contract stronger? Especially if it potentially breaks code somewhere else (smartirs something).

Currently, wlocal is a callable with the following signature: numpy.array -> iterable. We can make this contract explicit by putting it into the documentation. We can also incorporate this into the unit test described above and make wlocal return iter(tf + 1) instead of tf + 1, so that TfidfModel is required to handle iterators, not just numpy arrays.

@piskvorky Would you like me to make the above changes in #2420?

@piskvorky
Copy link
Owner

piskvorky commented Apr 20, 2019

Good practical examples are definitely preferable to (though not precluding) complex type descriptions.

My view here is entirely pragmatic: I don't understand myself how I should to make use of this functionality. This hints at a missing tutorial / separate "example usage" section (with focus on motivation and context, not just parameter types), rather than only extending a function docstring here and there.

More tests are always welcome, yes. Thanks.

Witiko added a commit to Witiko/gensim that referenced this issue Apr 23, 2019
@mpenkov mpenkov closed this as completed in 11eb5df Jul 7, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants