Fix overflow error for `*Vec` corpusfile-based training #2239

bm371613 · 2018-10-22T15:11:47Z

Corpus size is currently limited in terms of the total word count to what fits a 32-bit signed integer. This can result in a OverflowError: value too large to convert to int for a big corpus. This PR replaces int with a long long to allow for bigger corpora.

piskvorky · 2018-10-27T08:32:34Z

Oh wow, how did that slip through. Thanks a lot for spotting & reporting!

@menshikh-iv @persiyanov Any other places where we use counters or data structures that might overflow with large corpora?

piskvorky

How can we test this works as expected? I'd like to avoid regressions in the future.

piskvorky · 2018-10-27T08:33:30Z

gensim/models/doc2vec_corpusfile.pyx

@@ -153,7 +153,8 @@ def d2v_train_epoch_dbow(model, corpus_file, offset, start_doctag, _cython_vocab

    cdef int i, j, document_len
    cdef int effective_words = 0
-    cdef int total_effective_words = 0, total_documents = 0, total_words = 0
+    cdef int total_effective_words = 0, total_documents = 0


How about changing these to uint64 too (I prefer explicit sizes to long long, but it's not a big deal), to be on the safe side? Especially total_effective_words. Any reason not to?

There are long longs in many other places. Do you want explicit sizes everywhere?

I think it makes sense, but is not critical.

More important to me is that any variable that holds a potentially large count (token frequency, number of tokens in a corpus, number of sentences in a corpus, …) is at least 64 bit.

bm371613 · 2018-10-29T14:28:46Z

@piskvorky If "as expected" means accepting bigger values, a test could pass a big number and assert there is no OverflowError. Is that what you meant?

bm371613 · 2018-10-30T09:46:45Z

Note that CI failed due to flake8 errors on files not modified in this PR. A new version of flake8 was released a few days ago and it's likely that the CI would also fail on the target branch.

…ount

menshikh-iv · 2019-01-11T05:56:49Z

Big thanks @bm371613, can you please allow me to push into your branch (https://help.github.com/articles/allowing-changes-to-a-pull-request-branch-created-from-a-fork/#enabling-repository-maintainer-permissions-on-existing-pull-requests (4)), I need that for fix CI

bm371613 · 2019-01-11T08:26:20Z

Big thanks @bm371613, can you please allow me to push into your branch (https://help.github.com/articles/allowing-changes-to-a-pull-request-branch-created-from-a-fork/#enabling-repository-maintainer-permissions-on-existing-pull-requests (4)), I need that for fix CI

@menshikh-iv It seems to be already enabled, are your push attempts rejected?

menshikh-iv · 2019-01-11T08:27:39Z

@bm371613 yes :(

(asdadsf) ivan@P50:~/release/bm371613/gensim$ git push
ERROR: Permission to bm371613/gensim.git denied to menshikh-iv.
fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.

menshikh-iv · 2019-01-11T08:30:21Z

@bm371613 can you give me permissions manually to commit into your fork repo?

bm371613 · 2019-01-11T08:32:59Z

@menshikh-iv ok done

menshikh-iv · 2019-01-11T08:33:24Z

@bm371613 it works, thank you!

menshikh-iv · 2019-01-11T16:40:39Z

@bm371613 awesome, thanks for the fix, congratz with the first contribution 🥇

bm371613 added 2 commits October 22, 2018 17:02

long long for word count in cython code

6836a74

updated cpp files

8ecb3eb

bm371613 force-pushed the long-long-word-count branch from b06292f to 8ecb3eb Compare October 23, 2018 07:29

piskvorky added bug Issue described a bug performance Issue related to performance (in HW meaning) labels Oct 27, 2018

piskvorky reviewed Oct 27, 2018

View reviewed changes

piskvorky removed the performance Issue related to performance (in HW meaning) label Oct 29, 2018

total_effective_words: int -> long long

95fc778

bm371613 force-pushed the long-long-word-count branch from 9086321 to 95fc778 Compare October 30, 2018 10:05

updated cpp files

f08ca0f

lifengjin mentioned this pull request Dec 28, 2018

Integer overflow during FastText training with corpus_file #2258

Closed

mpenkov mentioned this pull request Jan 3, 2019

Fix integer overflow during FastText training with corpus_file #2314

Closed

mpenkov added the fasttext Issues related to the FastText model label Jan 3, 2019

menshikh-iv added 2 commits January 11, 2019 10:26

Merge remote-tracking branch 'upstream/develop' into long-long-word-c…

3585bfe

…ount

re-generate *_corpusfile

f18d4a5

menshikh-iv changed the title ~~Long long word count~~ Fix overflow error for *Vec corpusfile-based training Jan 11, 2019

menshikh-iv merged commit 13b52a2 into piskvorky:develop Jan 11, 2019

menshikh-iv deleted the long-long-word-count branch January 11, 2019 16:40

tcrick added a commit to tcrick/gensim that referenced this pull request Aug 22, 2020

Implementing int overflow error fixes piskvorky#2700 and piskvorky#2239

6cf7e34

tcrick added a commit to tcrick/gensim that referenced this pull request Aug 22, 2020

Implementing int overflow error fixes piskvorky#2700 and piskvorky#2239

b3e3d00

tcrick added a commit to tcrick/gensim that referenced this pull request Aug 22, 2020

Implementing int overflow error fixes piskvorky#2700 and piskvorky#2239

8caf9c2

tcrick added a commit to tcrick/gensim that referenced this pull request Aug 22, 2020

Implementing int overflow error fixes piskvorky#2700 and piskvorky#2239

3a0de5e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix overflow error for `*Vec` corpusfile-based training #2239

Fix overflow error for `*Vec` corpusfile-based training #2239

bm371613 commented Oct 22, 2018 •

edited by menshikh-iv

Loading

piskvorky commented Oct 27, 2018 •

edited

Loading

piskvorky left a comment

piskvorky Oct 27, 2018 •

edited

Loading

bm371613 Oct 29, 2018

piskvorky Oct 29, 2018

bm371613 commented Oct 29, 2018 •

edited

Loading

bm371613 commented Oct 30, 2018 •

edited

Loading

menshikh-iv commented Jan 11, 2019

bm371613 commented Jan 11, 2019

menshikh-iv commented Jan 11, 2019

menshikh-iv commented Jan 11, 2019

bm371613 commented Jan 11, 2019

menshikh-iv commented Jan 11, 2019

menshikh-iv commented Jan 11, 2019

Fix overflow error for *Vec corpusfile-based training #2239

Fix overflow error for *Vec corpusfile-based training #2239

Conversation

bm371613 commented Oct 22, 2018 • edited by menshikh-iv Loading

piskvorky commented Oct 27, 2018 • edited Loading

piskvorky left a comment

Choose a reason for hiding this comment

piskvorky Oct 27, 2018 • edited Loading

Choose a reason for hiding this comment

bm371613 Oct 29, 2018

Choose a reason for hiding this comment

piskvorky Oct 29, 2018

Choose a reason for hiding this comment

bm371613 commented Oct 29, 2018 • edited Loading

bm371613 commented Oct 30, 2018 • edited Loading

menshikh-iv commented Jan 11, 2019

bm371613 commented Jan 11, 2019

menshikh-iv commented Jan 11, 2019

menshikh-iv commented Jan 11, 2019

bm371613 commented Jan 11, 2019

menshikh-iv commented Jan 11, 2019

menshikh-iv commented Jan 11, 2019

Fix overflow error for `*Vec` corpusfile-based training #2239

Fix overflow error for `*Vec` corpusfile-based training #2239

bm371613 commented Oct 22, 2018 •

edited by menshikh-iv

Loading

piskvorky commented Oct 27, 2018 •

edited

Loading

piskvorky Oct 27, 2018 •

edited

Loading

bm371613 commented Oct 29, 2018 •

edited

Loading

bm371613 commented Oct 30, 2018 •

edited

Loading