Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integer overflow during FastText training with corpus_file #2258

Closed
joelkuiper opened this issue Nov 5, 2018 · 5 comments
Closed

Integer overflow during FastText training with corpus_file #2258

joelkuiper opened this issue Nov 5, 2018 · 5 comments
Assignees
Labels
bug Issue described a bug difficulty easy Easy issue: required small fix fasttext Issues related to the FastText model

Comments

@joelkuiper
Copy link

joelkuiper commented Nov 5, 2018

Description

model = FastText(corpus_file="sentences_norm.txt.gz", workers=14, iter=5, size=200, sg=1, hs=1)

with the following sizes

2018-11-05 16:57:52,809 : INFO : collected 6532860 word types from a corpus of 4728738902 raw words and 238627116 sentences
2018-11-05 16:57:52,809 : INFO : Loading a fresh vocabulary
2018-11-05 16:58:00,788 : INFO : effective_min_count=5 retains 1887156 unique words (28% of original 6532860, drops 4645704)
2018-11-05 16:58:00,788 : INFO : effective_min_count=5 leaves 4721157112 word corpus (99% of original 4728738902, drops 7581790)
2018-11-05 16:58:07,437 : INFO : deleting the raw counts dictionary of 6532860 items
2018-11-05 16:58:07,615 : INFO : sample=0.001 downsamples 26 most-common words
2018-11-05 16:58:07,615 : INFO : downsampling leaves estimated 3749158657 word corpus (79.4% of prior 4721157112)
2018-11-05 16:58:11,281 : INFO : constructing a huffman tree from 1887156 words
2018-11-05 16:59:36,077 : INFO : built huffman tree with maximum node depth 30
2018-11-05 17:00:17,300 : INFO : estimated required memory for 1887156 words, 1929637 buckets and 200 dimensions: 7871448352 bytes
2018-11-05 17:00:17,398 : INFO : resetting layer weights
2018-11-05 17:01:43,333 : INFO : Total number of ngrams is 1929637
2018-11-05 17:02:11,990 : INFO : training model with 14 workers on 1887156 vocabulary and 200 features, using sg=1 hs=1 sample=0.001 negative=5 window=5

yields

Exception in thread Thread-2120:
Traceback (most recent call last):
  File "/home/joelkuiper/anaconda3/lib/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/home/joelkuiper/anaconda3/lib/python3.6/threading.py", line 864, in run
    self._target(*self._args, **self._kwargs)
  File "/home/joelkuiper/anaconda3/lib/python3.6/site-packages/gensim/models/base_any2vec.py", line 175, in _worker_loop_corpusfile
    total_examples=total_examples, total_words=total_words, **kwargs)
  File "/home/joelkuiper/anaconda3/lib/python3.6/site-packages/gensim/models/fasttext.py", line 561, in _do_train_epoch
    total_examples, total_words, work, neu1)
  File "gensim/models/fasttext_corpusfile.pyx", line 126, in gensim.models.fasttext_corpusfile.train_epoch_sg
OverflowError: value too large to convert to int

on all workers. Note that the sg and hs parameters seem to have no relation to this, also happens without them.

Steps to reproduce

model = FastText(corpus_file="sentences_norm.txt.gz", workers=14, iter=5,size=200)

Expected Results

Should train the model

Actual Results

Exception thrown, no further output.

Traceback (most recent call last):
  File "/home/joelkuiper/anaconda3/lib/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/home/joelkuiper/anaconda3/lib/python3.6/threading.py", line 864, in run
    self._target(*self._args, **self._kwargs)
  File "/home/joelkuiper/anaconda3/lib/python3.6/site-packages/gensim/models/base_any2vec.py", line 175, in _worker_loop_corpusfile
    total_examples=total_examples, total_words=total_words, **kwargs)
  File "/home/joelkuiper/anaconda3/lib/python3.6/site-packages/gensim/models/fasttext.py", line 561, in _do_train_epoch
    total_examples, total_words, work, neu1)
  File "gensim/models/fasttext_corpusfile.pyx", line 126, in gensim.models.fasttext_corpusfile.train_epoch_sg
OverflowError: value too large to convert to int

Versions

Python 3.6.6 |Anaconda, Inc.| (default, Oct 9 2018, 12:34:16)
[GCC 7.3.0]
NumPy 1.15.3
SciPy 1.1.0
gensim 3.6.0

On Ubuntu 16.04

edit seems to work fine when passing in a LineSentence object

@joelkuiper joelkuiper changed the title OverflowError: value too large to convert to int on FastText training OverflowError: value too large to convert to int on FastText training on corpus_file Nov 5, 2018
@joelkuiper joelkuiper changed the title OverflowError: value too large to convert to int on FastText training on corpus_file OverflowError: value too large to convert to int on FastText training with corpus_file Nov 5, 2018
@CuriousG102
Copy link

CuriousG102 commented Nov 22, 2018

I see a similar error in Doc2Vec. I an verify that total_words is larger than a 32 bit integer. There's not an easy solution to this since training on a corpus_file will throw a different exception if total_words isn't present.

Traceback (most recent call last):
  File "/usr/local/Cellar/python3/3.5.2/Frameworks/Python.framework/Versions/3.5/lib/python3.5/threading.py", line 914, in _bootstrap_inner
    self.run()
  File "/usr/local/Cellar/python3/3.5.2/Frameworks/Python.framework/Versions/3.5/lib/python3.5/threading.py", line 862, in run
    self._target(*self._args, **self._kwargs)
  File "/Volumes/Backblaze_MacEx1TB50506065/cs221/project/CS221/venv/lib/python3.5/site-packages/gensim/models/base_any2vec.py", line 175, in _worker_loop_corpusfile
    total_examples=total_examples, total_words=total_words, **kwargs)
  File "/Volumes/Backblaze_MacEx1TB50506065/cs221/project/CS221/venv/lib/python3.5/site-packages/gensim/models/doc2vec.py", line 686, in _do_train_epoch
    doctag_vectors=doctag_vectors, doctag_locks=doctag_locks)
  File "gensim/models/doc2vec_corpusfile.pyx", line 280, in gensim.models.doc2vec_corpusfile.d2v_train_epoch_dm

@menshikh-iv menshikh-iv added bug Issue described a bug difficulty easy Easy issue: required small fix labels Dec 13, 2018
@menshikh-iv
Copy link
Contributor

Thanks for the report @joelkuiper!

@menshikh-iv menshikh-iv changed the title OverflowError: value too large to convert to int on FastText training with corpus_file Integer overflow during FastText training with corpus_file Dec 13, 2018
@mpenkov mpenkov added the fasttext Issues related to the FastText model label Dec 15, 2018
@mpenkov
Copy link
Collaborator

mpenkov commented Dec 15, 2018

@menshikh-iv Since this is tagged "easy", I'm guessing the fix is to replace the int declaration here with something like a long?

@menshikh-iv
Copy link
Contributor

menshikh-iv commented Dec 15, 2018

@mpenkov yes, something like this (int -> longest_int_type for all variables that can be "too large") in all *_corpusfile.pyx files

@lifengjin
Copy link

lifengjin commented Dec 28, 2018

I am experiencing this same bug as well when training Word2Vec with a large corpus. There has been a pull request for this bug here for a couple of months. Would you please fix this one? Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issue described a bug difficulty easy Easy issue: required small fix fasttext Issues related to the FastText model
Projects
None yet
Development

No branches or pull requests

5 participants