Integer overflow during `FastText` training with `corpus_file` #2258

joelkuiper · 2018-11-05T16:10:29Z

Description

model = FastText(corpus_file="sentences_norm.txt.gz", workers=14, iter=5, size=200, sg=1, hs=1)

with the following sizes

2018-11-05 16:57:52,809 : INFO : collected 6532860 word types from a corpus of 4728738902 raw words and 238627116 sentences
2018-11-05 16:57:52,809 : INFO : Loading a fresh vocabulary
2018-11-05 16:58:00,788 : INFO : effective_min_count=5 retains 1887156 unique words (28% of original 6532860, drops 4645704)
2018-11-05 16:58:00,788 : INFO : effective_min_count=5 leaves 4721157112 word corpus (99% of original 4728738902, drops 7581790)
2018-11-05 16:58:07,437 : INFO : deleting the raw counts dictionary of 6532860 items
2018-11-05 16:58:07,615 : INFO : sample=0.001 downsamples 26 most-common words
2018-11-05 16:58:07,615 : INFO : downsampling leaves estimated 3749158657 word corpus (79.4% of prior 4721157112)
2018-11-05 16:58:11,281 : INFO : constructing a huffman tree from 1887156 words
2018-11-05 16:59:36,077 : INFO : built huffman tree with maximum node depth 30
2018-11-05 17:00:17,300 : INFO : estimated required memory for 1887156 words, 1929637 buckets and 200 dimensions: 7871448352 bytes
2018-11-05 17:00:17,398 : INFO : resetting layer weights
2018-11-05 17:01:43,333 : INFO : Total number of ngrams is 1929637
2018-11-05 17:02:11,990 : INFO : training model with 14 workers on 1887156 vocabulary and 200 features, using sg=1 hs=1 sample=0.001 negative=5 window=5

yields

Exception in thread Thread-2120:
Traceback (most recent call last):
  File "/home/joelkuiper/anaconda3/lib/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/home/joelkuiper/anaconda3/lib/python3.6/threading.py", line 864, in run
    self._target(*self._args, **self._kwargs)
  File "/home/joelkuiper/anaconda3/lib/python3.6/site-packages/gensim/models/base_any2vec.py", line 175, in _worker_loop_corpusfile
    total_examples=total_examples, total_words=total_words, **kwargs)
  File "/home/joelkuiper/anaconda3/lib/python3.6/site-packages/gensim/models/fasttext.py", line 561, in _do_train_epoch
    total_examples, total_words, work, neu1)
  File "gensim/models/fasttext_corpusfile.pyx", line 126, in gensim.models.fasttext_corpusfile.train_epoch_sg
OverflowError: value too large to convert to int

on all workers. Note that the sg and hs parameters seem to have no relation to this, also happens without them.

Steps to reproduce

model = FastText(corpus_file="sentences_norm.txt.gz", workers=14, iter=5,size=200)

Expected Results

Should train the model

Actual Results

Exception thrown, no further output.

Traceback (most recent call last):
  File "/home/joelkuiper/anaconda3/lib/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/home/joelkuiper/anaconda3/lib/python3.6/threading.py", line 864, in run
    self._target(*self._args, **self._kwargs)
  File "/home/joelkuiper/anaconda3/lib/python3.6/site-packages/gensim/models/base_any2vec.py", line 175, in _worker_loop_corpusfile
    total_examples=total_examples, total_words=total_words, **kwargs)
  File "/home/joelkuiper/anaconda3/lib/python3.6/site-packages/gensim/models/fasttext.py", line 561, in _do_train_epoch
    total_examples, total_words, work, neu1)
  File "gensim/models/fasttext_corpusfile.pyx", line 126, in gensim.models.fasttext_corpusfile.train_epoch_sg
OverflowError: value too large to convert to int

Versions

Python 3.6.6 |Anaconda, Inc.| (default, Oct 9 2018, 12:34:16)
[GCC 7.3.0]
NumPy 1.15.3
SciPy 1.1.0
gensim 3.6.0

On Ubuntu 16.04

edit seems to work fine when passing in a LineSentence object

The text was updated successfully, but these errors were encountered:

CuriousG102 · 2018-11-22T06:47:58Z

I see a similar error in Doc2Vec. I an verify that total_words is larger than a 32 bit integer. There's not an easy solution to this since training on a corpus_file will throw a different exception if total_words isn't present.

Traceback (most recent call last):
  File "/usr/local/Cellar/python3/3.5.2/Frameworks/Python.framework/Versions/3.5/lib/python3.5/threading.py", line 914, in _bootstrap_inner
    self.run()
  File "/usr/local/Cellar/python3/3.5.2/Frameworks/Python.framework/Versions/3.5/lib/python3.5/threading.py", line 862, in run
    self._target(*self._args, **self._kwargs)
  File "/Volumes/Backblaze_MacEx1TB50506065/cs221/project/CS221/venv/lib/python3.5/site-packages/gensim/models/base_any2vec.py", line 175, in _worker_loop_corpusfile
    total_examples=total_examples, total_words=total_words, **kwargs)
  File "/Volumes/Backblaze_MacEx1TB50506065/cs221/project/CS221/venv/lib/python3.5/site-packages/gensim/models/doc2vec.py", line 686, in _do_train_epoch
    doctag_vectors=doctag_vectors, doctag_locks=doctag_locks)
  File "gensim/models/doc2vec_corpusfile.pyx", line 280, in gensim.models.doc2vec_corpusfile.d2v_train_epoch_dm

menshikh-iv · 2018-12-13T15:02:43Z

Thanks for the report @joelkuiper!

mpenkov · 2018-12-15T07:11:18Z

@menshikh-iv Since this is tagged "easy", I'm guessing the fix is to replace the int declaration here with something like a long?

menshikh-iv · 2018-12-15T07:18:42Z

@mpenkov yes, something like this (int -> longest_int_type for all variables that can be "too large") in all *_corpusfile.pyx files

lifengjin · 2018-12-28T19:40:47Z

I am experiencing this same bug as well when training Word2Vec with a large corpus. There has been a pull request for this bug here for a couple of months. Would you please fix this one? Thanks.

…iskvorky#2258) * replace `int` by `long long`

joelkuiper changed the title ~~OverflowError: value too large to convert to int on FastText training~~ OverflowError: value too large to convert to int on FastText training on corpus_file Nov 5, 2018

joelkuiper changed the title ~~OverflowError: value too large to convert to int on FastText training on corpus_file~~ OverflowError: value too large to convert to int on FastText training with corpus_file Nov 5, 2018

menshikh-iv added bug Issue described a bug difficulty easy Easy issue: required small fix labels Dec 13, 2018

menshikh-iv changed the title ~~OverflowError: value too large to convert to int on FastText training with corpus_file~~ Integer overflow during FastText training with corpus_file Dec 13, 2018

menshikh-iv assigned mpenkov Dec 14, 2018

mpenkov added the fasttext Issues related to the FastText model label Dec 15, 2018

irinade added a commit to irinade/gensim that referenced this issue Jan 2, 2019

Fix integer overflow during FastText training with corpus_file (p…

bb3f734

…iskvorky#2258) * replace `int` by `long long`

irinade added a commit to irinade/gensim that referenced this issue Jan 2, 2019

Fix integer overflow during FastText training with corpus_file (p…

d69f184

…iskvorky#2258) * replace `int` by `long long`

irinade mentioned this issue Jan 2, 2019

Fix integer overflow during FastText training with corpus_file #2314

Closed

bm371613 mentioned this issue Jan 11, 2019

Fix overflow error for *Vec corpusfile-based training #2239

Merged

menshikh-iv closed this as completed in #2239 Jan 11, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integer overflow during `FastText` training with `corpus_file` #2258

Integer overflow during `FastText` training with `corpus_file` #2258

joelkuiper commented Nov 5, 2018 •

edited

Loading

CuriousG102 commented Nov 22, 2018 •

edited by mpenkov

Loading

menshikh-iv commented Dec 13, 2018

mpenkov commented Dec 15, 2018

menshikh-iv commented Dec 15, 2018 •

edited

Loading

lifengjin commented Dec 28, 2018 •

edited

Loading

Integer overflow during FastText training with corpus_file #2258

Integer overflow during FastText training with corpus_file #2258

Comments

joelkuiper commented Nov 5, 2018 • edited Loading

Description

Steps to reproduce

Expected Results

Actual Results

Versions

CuriousG102 commented Nov 22, 2018 • edited by mpenkov Loading

menshikh-iv commented Dec 13, 2018

mpenkov commented Dec 15, 2018

menshikh-iv commented Dec 15, 2018 • edited Loading

lifengjin commented Dec 28, 2018 • edited Loading

Integer overflow during `FastText` training with `corpus_file` #2258

Integer overflow during `FastText` training with `corpus_file` #2258

joelkuiper commented Nov 5, 2018 •

edited

Loading

CuriousG102 commented Nov 22, 2018 •

edited by mpenkov

Loading

menshikh-iv commented Dec 15, 2018 •

edited

Loading

lifengjin commented Dec 28, 2018 •

edited

Loading