Unpredictable ZeroDivisionErrors in direct_confirmation_measure #2181

alexeyev · 2018-09-12T16:13:24Z

Hello.

Thank you for your great work!

It seems that the problem #1064 is back. I get a lot of this

  File "train.py", line 430, in <module>
    logger.info(" ".join(["Coherence:", coh_type, "\t\t", str(coherence.get_coherence())]))
  File "/usr/local/lib/python3.5/dist-packages/gensim/models/coherencemodel.py", line 435, in get_coherence
    confirmed_measures = self.get_coherence_per_topic()
  File "/usr/local/lib/python3.5/dist-packages/gensim/models/coherencemodel.py", line 425, in get_coherence_per_topic
    return measure.conf(segmented_topics, self._accumulator, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/gensim/topic_coherence/direct_confirmation_measure.py", line 71, in log_conditional_probability
    m_lc_i = np.log(((co_occur_count / num_docs) + EPSILON) / (w_star_count / num_docs))
ZeroDivisionError: float division by zero

when I use the version 3.5.0 installed from pypi.

Thanks in advance.

What other info should I provide?

The text was updated successfully, but these errors were encountered:

menshikh-iv · 2018-09-13T02:38:48Z

Hi @alexeyev, thanks for the report, please provide minimal code example with data that reproduce your issue.

menshikh-iv · 2018-09-19T17:37:20Z

Non-reproducible (not enough information)

johann-petrak · 2018-10-11T16:16:11Z

I am getting this using version 3.4.0 (conda 3.4.0-py36h14c3975_0) when running
ldamodel.top_topics(corpusreader, processes=-1)
The corpusreader is exactly the same as the one that was used for training the model.

menshikh-iv · 2018-10-12T03:40:24Z

Hi @johann-petrak please provide the code example for reproducing your issue (including a data that you used for training)

johann-petrak · 2018-10-12T08:06:26Z

Sadly it is a dataset of dozens of gigabytes of size and the license does not allow me to share the data.

If this is the code producing the error
m_lc_i = np.log(((co_occur_count / num_docs) + EPSILON) / (w_star_count / num_docs))
The only options to get that error are that either num_docs is zero or w_star_count is zero.
I can try to figure out which it is and maybe a little bit about how here ...

menshikh-iv · 2018-10-15T04:38:54Z

@johann-petrak this will be really nice, please do 👍

johann-petrak · 2018-10-15T14:52:33Z

As I expected, this happens because w_star_count is 0. I do not know though why the entry in the accumulator for w_star is 0 and how that can happen.

menshikh-iv · 2018-10-15T21:15:49Z

@johann-petrak so, can you try to run your code with the debugger and see step by step how this attribute changed?

hallelujahdrive · 2018-10-25T02:07:51Z

I also have a same problem. I found that This problem is caused by learning LDA, not CoherenceModel.
This problem may occurs when the LDA model includes vocabulary not included in the corpus.
I generated a dictionary with more documents than documents used to generate the corpus, and this problem occurred.
This problem does not occur if you are using the same document when generating a dictionary and corpus.

menshikh-iv · 2018-10-25T03:45:56Z

@hallelujahdrive hello, please give us a minimal reproducible example (code + data that produce current issue)

johann-petrak · 2018-10-25T08:04:55Z

Is the method ldamodel.top_topics(corpusreader) meant to work on a corpus that is not
absolutely identical to the one that was used for training the model?
The documentation says nothing about any restriction but I have to confirm that this error occurs when the corpus does not contain all words from the model.

hallelujahdrive · 2018-10-26T03:30:09Z

I reproduced the problem with the following code.
I used sklearn.dataset.fetch_20newsgroups as the dataset.

from gensim.corpora.dictionary import Dictionary
from gensim.models.coherencemodel import CoherenceModel
from gensim.models.ldamodel import LdaModel
from polyglot.text import Text
from sklearn.datasets import fetch_20newsgroups


def docs2bow(docs):
    texts = []
    for doc in docs:
        tokens = Text(doc, hint_language_code='en')
        bow = []
        for token in tokens.pos_tags:
            for f in ['NOUN', 'PRON']:
                if token[1] == f:
                    bow.append(token[0])
            texts.append(bow)

    return texts


if __name__ == '__main__':
    twenty_train = fetch_20newsgroups(subset='train')
    texts = docs2bow(twenty_train.data[0:999])

    dictionary = Dictionary(texts)
    corpus = [dictionary.doc2bow(text) for text in texts[900:999]]

    lda = LdaModel(corpus, num_topics=50)

    cm = CoherenceModel(model=lda, corpus=corpus, dictionary=dictionary, coherence='u_mass')

    coherence = cm.get_coherence()
    print(coherence)

johann-petrak · 2018-11-09T21:44:46Z

Could somebody in the know please explain what the intended use of this function is (see my previous comment) i.e. is it supposed to work for a corpus that is not absolutely identical to the training corpus?

As pointed out, this seems to happen whenever a corpus word present in the topics does not exist in the corpus passed to the method.

Why I also find very confusing is that the function returns a list of pairs (topicrepresentation, coherence) where the first element is the actual word distribution for the topic. Why does it not return the index of that topic instead? Is it the case that the topics returned here are supposed to be identical to the ones stored with the trained LDA model and accessible through get_topics? In that case, it would be much easier to match between these if the top_topics(corpus) method or some other method top_topicidxs(corpus) could return the indices of the topics, so return a list of pairs (topicindex, coherence).

menshikh-iv · 2019-01-11T12:51:40Z

BTW should be fixed by #2259

menshikh-iv added the need info Not enough information for reproduce an issue, need more info from author label Sep 13, 2018

menshikh-iv closed this as completed Sep 19, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unpredictable ZeroDivisionErrors in direct_confirmation_measure #2181

Unpredictable ZeroDivisionErrors in direct_confirmation_measure #2181

alexeyev commented Sep 12, 2018

menshikh-iv commented Sep 13, 2018

menshikh-iv commented Sep 19, 2018

johann-petrak commented Oct 11, 2018

menshikh-iv commented Oct 12, 2018

johann-petrak commented Oct 12, 2018

menshikh-iv commented Oct 15, 2018

johann-petrak commented Oct 15, 2018

menshikh-iv commented Oct 15, 2018 •

edited

Loading

hallelujahdrive commented Oct 25, 2018

menshikh-iv commented Oct 25, 2018

johann-petrak commented Oct 25, 2018

hallelujahdrive commented Oct 26, 2018

johann-petrak commented Nov 9, 2018

menshikh-iv commented Jan 11, 2019

Unpredictable ZeroDivisionErrors in direct_confirmation_measure #2181

Unpredictable ZeroDivisionErrors in direct_confirmation_measure #2181

Comments

alexeyev commented Sep 12, 2018

menshikh-iv commented Sep 13, 2018

menshikh-iv commented Sep 19, 2018

johann-petrak commented Oct 11, 2018

menshikh-iv commented Oct 12, 2018

johann-petrak commented Oct 12, 2018

menshikh-iv commented Oct 15, 2018

johann-petrak commented Oct 15, 2018

menshikh-iv commented Oct 15, 2018 • edited Loading

hallelujahdrive commented Oct 25, 2018

menshikh-iv commented Oct 25, 2018

johann-petrak commented Oct 25, 2018

hallelujahdrive commented Oct 26, 2018

johann-petrak commented Nov 9, 2018

menshikh-iv commented Jan 11, 2019

menshikh-iv commented Oct 15, 2018 •

edited

Loading