Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix overflow error for *Vec corpusfile-based training #2239

Merged
merged 6 commits into from
Jan 11, 2019

Conversation

bm371613
Copy link
Contributor

@bm371613 bm371613 commented Oct 22, 2018

Fix #2258

Corpus size is currently limited in terms of the total word count to what fits a 32-bit signed integer. This can result in a OverflowError: value too large to convert to int for a big corpus. This PR replaces int with a long long to allow for bigger corpora.

@bm371613 bm371613 force-pushed the long-long-word-count branch from b06292f to 8ecb3eb Compare October 23, 2018 07:29
@piskvorky
Copy link
Owner

piskvorky commented Oct 27, 2018

Oh wow, how did that slip through. Thanks a lot for spotting & reporting!

@menshikh-iv @persiyanov Any other places where we use counters or data structures that might overflow with large corpora?

@piskvorky piskvorky added bug Issue described a bug performance Issue related to performance (in HW meaning) labels Oct 27, 2018
Copy link
Owner

@piskvorky piskvorky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can we test this works as expected? I'd like to avoid regressions in the future.

@@ -153,7 +153,8 @@ def d2v_train_epoch_dbow(model, corpus_file, offset, start_doctag, _cython_vocab

cdef int i, j, document_len
cdef int effective_words = 0
cdef int total_effective_words = 0, total_documents = 0, total_words = 0
cdef int total_effective_words = 0, total_documents = 0
Copy link
Owner

@piskvorky piskvorky Oct 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about changing these to uint64 too (I prefer explicit sizes to long long, but it's not a big deal), to be on the safe side? Especially total_effective_words. Any reason not to?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are long longs in many other places. Do you want explicit sizes everywhere?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it makes sense, but is not critical.

More important to me is that any variable that holds a potentially large count (token frequency, number of tokens in a corpus, number of sentences in a corpus, …) is at least 64 bit.

@bm371613
Copy link
Contributor Author

bm371613 commented Oct 29, 2018

@piskvorky If "as expected" means accepting bigger values, a test could pass a big number and assert there is no OverflowError. Is that what you meant?

@piskvorky piskvorky removed the performance Issue related to performance (in HW meaning) label Oct 29, 2018
@bm371613
Copy link
Contributor Author

bm371613 commented Oct 30, 2018

Note that CI failed due to flake8 errors on files not modified in this PR. A new version of flake8 was released a few days ago and it's likely that the CI would also fail on the target branch.

@bm371613 bm371613 force-pushed the long-long-word-count branch from 9086321 to 95fc778 Compare October 30, 2018 10:05
@menshikh-iv
Copy link
Contributor

@menshikh-iv menshikh-iv changed the title Long long word count Fix overflow error for *Vec corpusfile-based training Jan 11, 2019
@bm371613
Copy link
Contributor Author

Big thanks @bm371613, can you please allow me to push into your branch (https://help.github.com/articles/allowing-changes-to-a-pull-request-branch-created-from-a-fork/#enabling-repository-maintainer-permissions-on-existing-pull-requests (4)), I need that for fix CI

@menshikh-iv It seems to be already enabled, are your push attempts rejected?

@menshikh-iv
Copy link
Contributor

@bm371613 yes :(

(asdadsf) ivan@P50:~/release/bm371613/gensim$ git push
ERROR: Permission to bm371613/gensim.git denied to menshikh-iv.
fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.

@menshikh-iv
Copy link
Contributor

@bm371613 can you give me permissions manually to commit into your fork repo?

@bm371613
Copy link
Contributor Author

@menshikh-iv ok done

@menshikh-iv
Copy link
Contributor

@bm371613 it works, thank you!

@menshikh-iv
Copy link
Contributor

@bm371613 awesome, thanks for the fix, congratz with the first contribution 🥇

@menshikh-iv menshikh-iv merged commit 13b52a2 into piskvorky:develop Jan 11, 2019
@menshikh-iv menshikh-iv deleted the long-long-word-count branch January 11, 2019 16:40
tcrick added a commit to tcrick/gensim that referenced this pull request Aug 22, 2020
tcrick added a commit to tcrick/gensim that referenced this pull request Aug 22, 2020
tcrick added a commit to tcrick/gensim that referenced this pull request Aug 22, 2020
tcrick added a commit to tcrick/gensim that referenced this pull request Aug 22, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issue described a bug fasttext Issues related to the FastText model
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants