-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix overflow error for *Vec
corpusfile-based training
#2239
Conversation
b06292f
to
8ecb3eb
Compare
Oh wow, how did that slip through. Thanks a lot for spotting & reporting! @menshikh-iv @persiyanov Any other places where we use counters or data structures that might overflow with large corpora? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How can we test this works as expected? I'd like to avoid regressions in the future.
gensim/models/doc2vec_corpusfile.pyx
Outdated
@@ -153,7 +153,8 @@ def d2v_train_epoch_dbow(model, corpus_file, offset, start_doctag, _cython_vocab | |||
|
|||
cdef int i, j, document_len | |||
cdef int effective_words = 0 | |||
cdef int total_effective_words = 0, total_documents = 0, total_words = 0 | |||
cdef int total_effective_words = 0, total_documents = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about changing these to uint64
too (I prefer explicit sizes to long long
, but it's not a big deal), to be on the safe side? Especially total_effective_words
. Any reason not to?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are long long
s in many other places. Do you want explicit sizes everywhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it makes sense, but is not critical.
More important to me is that any variable that holds a potentially large count (token frequency, number of tokens in a corpus, number of sentences in a corpus, …) is at least 64 bit.
@piskvorky If "as expected" means accepting bigger values, a test could pass a big number and assert there is no |
Note that CI failed due to flake8 errors on files not modified in this PR. A new version of flake8 was released a few days ago and it's likely that the CI would also fail on the target branch. |
9086321
to
95fc778
Compare
Big thanks @bm371613, can you please allow me to push into your branch (https://help.github.com/articles/allowing-changes-to-a-pull-request-branch-created-from-a-fork/#enabling-repository-maintainer-permissions-on-existing-pull-requests (4)), I need that for fix CI |
*Vec
corpusfile-based training
@menshikh-iv It seems to be already enabled, are your push attempts rejected? |
@bm371613 yes :(
|
@bm371613 can you give me permissions manually to commit into your fork repo? |
@menshikh-iv ok done |
@bm371613 it works, thank you! |
@bm371613 awesome, thanks for the fix, congratz with the first contribution 🥇 |
Fix #2258
Corpus size is currently limited in terms of the total word count to what fits a 32-bit signed integer. This can result in a
OverflowError: value too large to convert to int
for a big corpus. This PR replacesint
with along long
to allow for bigger corpora.