load_facebook_model memory footprint #2724

philipphager · 2020-01-08T14:36:09Z

Problem description

Hey everyone,
I encountered an issue when loading a pre-trained facebook FastText models. Loading a 7,24 GB pretrained model blows up to more than 20 GB in RAM on my machine when loading with Gensim. So my computer keeps swapping memory like crazy and never loads the model. It would be awesome if we could lower the memory footprint in Gensim's FastText loading mechanism. Is this a known problem and is anyone aware how to fix it?

Steps/code/corpus to reproduce

Download a pre-trained FastText model (e.g., cc.en.300.bin) from: https://fasttext.cc/docs/en/crawl-vectors.html
Try to load the model using load_facebook_model('cc.en.300.bin')

Versions

Please provide the output of:

Darwin-19.0.0-x86_64-i386-64bit
Python 3.7.6 | packaged by conda-forge | (default, Dec 26 2019, 23:46:52) 
[Clang 9.0.0 (tags/RELEASE_900/final)]
NumPy 1.17.2
SciPy 1.4.1
gensim 3.8.1

The text was updated successfully, but these errors were encountered:

philipphager · 2020-01-08T15:01:43Z

I guess a duplicate of: #2502

gojomo · 2020-01-09T01:18:35Z

There's some bonkers nonsense in the current FT implementation that uses more memory than necessary. Your load might be most affected by the allocation of a vectors_vocab_lockf array that's vector_size times larger than necessary – it should be one float per word; instead of one float per word's dimension – and the allocation of a vectors_ngrams_lockf array that's similarly oversized and probably doesn't need to exist at all.

I'm trying to fix this, among other things, in a big clean-up of related code, but that work isn't yet stable and may not appear in a release anytime soon.

If you just need vector-lookup, and not to continue training, using the load_facebook_vectors() method (instead of load_facebook_model()) might avoid some of the nonsense in the meantime, but I'm not sure.

You should still expect some expansion from the on-disk size, both on load, and on the first operation (such as most_similar()) that might trigger calculation & caching of unit-normed word-vectors.

gojomo · 2020-10-20T18:53:21Z

Closing as dup of #2502. (But also note: essentially fixed in 4.0.0 work.)

gojomo closed this as completed Oct 20, 2020

mpenkov mentioned this issue Oct 28, 2020

Update changelog for 4.0.0 release #2981

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

load_facebook_model memory footprint #2724

load_facebook_model memory footprint #2724

philipphager commented Jan 8, 2020

philipphager commented Jan 8, 2020

gojomo commented Jan 9, 2020 •

edited

Loading

gojomo commented Oct 20, 2020

load_facebook_model memory footprint #2724

load_facebook_model memory footprint #2724

Comments

philipphager commented Jan 8, 2020

Problem description

Steps/code/corpus to reproduce

Versions

philipphager commented Jan 8, 2020

gojomo commented Jan 9, 2020 • edited Loading

gojomo commented Oct 20, 2020

gojomo commented Jan 9, 2020 •

edited

Loading