Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory management for training on large data sets #137

Closed
alanakbik opened this issue Oct 10, 2018 · 1 comment
Closed

Memory management for training on large data sets #137

alanakbik opened this issue Oct 10, 2018 · 1 comment
Labels
feature A new feature

Comments

@alanakbik
Copy link
Collaborator

In uses cases where training data sets are large or there is little available RAM, language model embeddings cannot be stored in memory (see #135 ).

Current solution: Currently, the only way to still train a model in such cases is to set the
embeddings_in_memory tag to False in the trainer classes (TextClassifierTrainer or SequenceTaggerTrainer). With this flag, embeddings get generated on the fly at each epoch and immediately discarded after use, which solves the memory issue but is computationally expensive since there is no re-use of already computed embeddings.

Idea: Use a key-value store to persist embeddings to disk and enable quick lookup of already computed embeddings. A nice side-effect is that if we run several experiments on the same dataset, embeddings from earlier runs can be re-used, thus speeding up parameter-sweep experiments.

@alanakbik
Copy link
Collaborator Author

will be part of release-0.3 and activated by default for CharLMEmbeddings (can still be turned off to save disk space)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature A new feature
Projects
None yet
Development

No branches or pull requests

2 participants