Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unpredictable ZeroDivisionErrors in direct_confirmation_measure #2181

Closed
alexeyev opened this issue Sep 12, 2018 · 14 comments
Closed

Unpredictable ZeroDivisionErrors in direct_confirmation_measure #2181

alexeyev opened this issue Sep 12, 2018 · 14 comments
Labels
need info Not enough information for reproduce an issue, need more info from author

Comments

@alexeyev
Copy link

Hello.

Thank you for your great work!

It seems that the problem #1064 is back. I get a lot of this

  File "train.py", line 430, in <module>
    logger.info(" ".join(["Coherence:", coh_type, "\t\t", str(coherence.get_coherence())]))
  File "/usr/local/lib/python3.5/dist-packages/gensim/models/coherencemodel.py", line 435, in get_coherence
    confirmed_measures = self.get_coherence_per_topic()
  File "/usr/local/lib/python3.5/dist-packages/gensim/models/coherencemodel.py", line 425, in get_coherence_per_topic
    return measure.conf(segmented_topics, self._accumulator, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/gensim/topic_coherence/direct_confirmation_measure.py", line 71, in log_conditional_probability
    m_lc_i = np.log(((co_occur_count / num_docs) + EPSILON) / (w_star_count / num_docs))
ZeroDivisionError: float division by zero

when I use the version 3.5.0 installed from pypi.

Thanks in advance.

What other info should I provide?

@menshikh-iv
Copy link
Contributor

Hi @alexeyev, thanks for the report, please provide minimal code example with data that reproduce your issue.

@menshikh-iv menshikh-iv added the need info Not enough information for reproduce an issue, need more info from author label Sep 13, 2018
@menshikh-iv
Copy link
Contributor

Non-reproducible (not enough information)

@johann-petrak
Copy link
Contributor

I am getting this using version 3.4.0 (conda 3.4.0-py36h14c3975_0) when running
ldamodel.top_topics(corpusreader, processes=-1)
The corpusreader is exactly the same as the one that was used for training the model.

@menshikh-iv
Copy link
Contributor

Hi @johann-petrak please provide the code example for reproducing your issue (including a data that you used for training)

@johann-petrak
Copy link
Contributor

Sadly it is a dataset of dozens of gigabytes of size and the license does not allow me to share the data.

If this is the code producing the error
m_lc_i = np.log(((co_occur_count / num_docs) + EPSILON) / (w_star_count / num_docs))
The only options to get that error are that either num_docs is zero or w_star_count is zero.
I can try to figure out which it is and maybe a little bit about how here ...

@menshikh-iv
Copy link
Contributor

@johann-petrak this will be really nice, please do 👍

@johann-petrak
Copy link
Contributor

As I expected, this happens because w_star_count is 0. I do not know though why the entry in the accumulator for w_star is 0 and how that can happen.

@menshikh-iv
Copy link
Contributor

menshikh-iv commented Oct 15, 2018

@johann-petrak so, can you try to run your code with the debugger and see step by step how this attribute changed?

@hallelujahdrive
Copy link

I also have a same problem. I found that This problem is caused by learning LDA, not CoherenceModel.
This problem may occurs when the LDA model includes vocabulary not included in the corpus.
I generated a dictionary with more documents than documents used to generate the corpus, and this problem occurred.
This problem does not occur if you are using the same document when generating a dictionary and corpus.

@menshikh-iv
Copy link
Contributor

@hallelujahdrive hello, please give us a minimal reproducible example (code + data that produce current issue)

@johann-petrak
Copy link
Contributor

Is the method ldamodel.top_topics(corpusreader) meant to work on a corpus that is not
absolutely identical to the one that was used for training the model?
The documentation says nothing about any restriction but I have to confirm that this error occurs when the corpus does not contain all words from the model.

@hallelujahdrive
Copy link

I reproduced the problem with the following code.
I used sklearn.dataset.fetch_20newsgroups as the dataset.

from gensim.corpora.dictionary import Dictionary
from gensim.models.coherencemodel import CoherenceModel
from gensim.models.ldamodel import LdaModel
from polyglot.text import Text
from sklearn.datasets import fetch_20newsgroups


def docs2bow(docs):
    texts = []
    for doc in docs:
        tokens = Text(doc, hint_language_code='en')
        bow = []
        for token in tokens.pos_tags:
            for f in ['NOUN', 'PRON']:
                if token[1] == f:
                    bow.append(token[0])
            texts.append(bow)

    return texts


if __name__ == '__main__':
    twenty_train = fetch_20newsgroups(subset='train')
    texts = docs2bow(twenty_train.data[0:999])

    dictionary = Dictionary(texts)
    corpus = [dictionary.doc2bow(text) for text in texts[900:999]]

    lda = LdaModel(corpus, num_topics=50)

    cm = CoherenceModel(model=lda, corpus=corpus, dictionary=dictionary, coherence='u_mass')

    coherence = cm.get_coherence()
    print(coherence)

@johann-petrak
Copy link
Contributor

Could somebody in the know please explain what the intended use of this function is (see my previous comment) i.e. is it supposed to work for a corpus that is not absolutely identical to the training corpus?

As pointed out, this seems to happen whenever a corpus word present in the topics does not exist in the corpus passed to the method.

Why I also find very confusing is that the function returns a list of pairs (topicrepresentation, coherence) where the first element is the actual word distribution for the topic. Why does it not return the index of that topic instead? Is it the case that the topics returned here are supposed to be identical to the ones stored with the trained LDA model and accessible through get_topics? In that case, it would be much easier to match between these if the top_topics(corpus) method or some other method top_topicidxs(corpus) could return the indices of the topics, so return a list of pairs (topicindex, coherence).

@menshikh-iv
Copy link
Contributor

BTW should be fixed by #2259

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
need info Not enough information for reproduce an issue, need more info from author
Projects
None yet
Development

No branches or pull requests

4 participants