Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

load_facebook_model memory footprint #2724

Closed
philipphager opened this issue Jan 8, 2020 · 3 comments
Closed

load_facebook_model memory footprint #2724

philipphager opened this issue Jan 8, 2020 · 3 comments

Comments

@philipphager
Copy link
Contributor

Problem description

Hey everyone,
I encountered an issue when loading a pre-trained facebook FastText models. Loading a 7,24 GB pretrained model blows up to more than 20 GB in RAM on my machine when loading with Gensim. So my computer keeps swapping memory like crazy and never loads the model. It would be awesome if we could lower the memory footprint in Gensim's FastText loading mechanism. Is this a known problem and is anyone aware how to fix it?

Steps/code/corpus to reproduce

  1. Download a pre-trained FastText model (e.g., cc.en.300.bin) from: https://fasttext.cc/docs/en/crawl-vectors.html
  2. Try to load the model using load_facebook_model('cc.en.300.bin')

Versions

Please provide the output of:

Darwin-19.0.0-x86_64-i386-64bit
Python 3.7.6 | packaged by conda-forge | (default, Dec 26 2019, 23:46:52) 
[Clang 9.0.0 (tags/RELEASE_900/final)]
NumPy 1.17.2
SciPy 1.4.1
gensim 3.8.1
@philipphager
Copy link
Contributor Author

I guess a duplicate of: #2502

@gojomo
Copy link
Collaborator

gojomo commented Jan 9, 2020

There's some bonkers nonsense in the current FT implementation that uses more memory than necessary. Your load might be most affected by the allocation of a vectors_vocab_lockf array that's vector_size times larger than necessary – it should be one float per word; instead of one float per word's dimension – and the allocation of a vectors_ngrams_lockf array that's similarly oversized and probably doesn't need to exist at all.

I'm trying to fix this, among other things, in a big clean-up of related code, but that work isn't yet stable and may not appear in a release anytime soon.

If you just need vector-lookup, and not to continue training, using the load_facebook_vectors() method (instead of load_facebook_model()) might avoid some of the nonsense in the meantime, but I'm not sure.

You should still expect some expansion from the on-disk size, both on load, and on the first operation (such as most_similar()) that might trigger calculation & caching of unit-normed word-vectors.

@gojomo
Copy link
Collaborator

gojomo commented Oct 20, 2020

Closing as dup of #2502. (But also note: essentially fixed in 4.0.0 work.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants