GH-1309: WorEmbeddings replacement that stores vectors in sqlite data… #1318

timnon · 2019-12-10T06:32:06Z

Note: reopenend this PR, since i accidentially based the old one on my master branch

The ner-tagger of flair eats a few gigs of memory when e.g. run as the backend of a simple flask-application. This is not necessary, since the main memory consumer are word embeddings which are simple vector lookups. These can be externalized to some indexed database, e.g. sqlite, but any database does the job.

I did this by creating a class WordEmbeedingsStore which can be used to replace a WordEmbeddings-instance in a flair model via duck-typing. By using this, i was able to reduce our ner-servers memory consumption from 6gig to 600mb (10x decrease) by adding a few lines of code. It can be tested using the following lines (also in the docstring). First create a headless version of a model without word embeddings:

from flair.inference_utils import WordEmbeddingsStore
from flair.models import SequenceTagger
import pickle
tagger = SequenceTagger.load("multi-ner-fast")
WordEmbeddingsStore.create_stores(tagger)
pickle.dump(tagger, open("multi-ner-fast-headless.pickle", "wb"))

and then to run the stored headless model without word embeddings, use:

from flair.data import Sentence
tagger = pickle.load(open("multi-ner-fast-headless.pickle", "rb"))
WordEmbeddingsStore.load_stores(tagger)
text = "Schade um den Ameisenbären. Lukas Bärfuss veröffentlicht Erzählungen aus zwanzig Jahren."
sentence = Sentence(text)
tagger.predict(sentence)

Please give some advice where to add this functionality, there seems to be no place for inference-related stuff, therefore i created inference_utils.py, but some other place might be more appropriate.

I like the current structure such that it is a very light integration via duck-typing, so no flair base classes need to get changed, but a tighter integration might also be possible.

…ite database to save memory, word embedding vectors are stored in external sqlite database

timnon · 2019-12-13T10:11:47Z

one more thing, the outsourcing is not restricted to sqlite, this could be handled by any database or also redis, so a general "outsourcing connector" to different backends would be appropriate.

`db` --> `self.db`, zero tensor --> zero list to avoid warning.

alanakbik · 2020-01-07T15:33:31Z

@timnon thanks a lot - this is a good solution to reduce the memory footprint of WordEmbeddings which many people will find useful! I've made a few small changes: the db variable was not always self.db causing errors and I've changed the zero tensor to avoid a torch warning.

I think a more integrated solution of this would be great, but I'm not sure how to best do this. I'll think a bit about this and let you know! Thanks again!

…tabase

timnon · 2020-01-07T16:22:39Z

Cool, thanks for the fixes, tested and everything seems to be good. As written above, any backend to store vectors would be good, but the elegance of sqlite is to have all in files, which suits the current storing mechanism.

alanakbik · 2020-01-07T17:51:46Z

Great, will merge now! Thanks again!

alanakbik · 2020-01-07T17:51:52Z

👍

flairNLPGH-1309: WorEmbeddings replacement that stores vectors in sql…

dc21080

…ite database to save memory, word embedding vectors are stored in external sqlite database

small fixes

76516d3

`db` --> `self.db`, zero tensor --> zero list to avoid warning.

timnon added 4 commits January 7, 2020 17:01

import from inference_utils

6a7140f

Merge branch 'master' into CH-1309-WordEmbeddings-replacement-with-da…

b3add4b

…tabase

remove copy import

7cded89

didnt pay attention, some cleanup due to sloppy commits

7400969

alanakbik merged commit 1626111 into flairNLP:master Jan 7, 2020

timnon deleted the CH-1309-WordEmbeddings-replacement-with-database branch January 7, 2020 18:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-1309: WorEmbeddings replacement that stores vectors in sqlite data… #1318

GH-1309: WorEmbeddings replacement that stores vectors in sqlite data… #1318

timnon commented Dec 10, 2019 •

edited

Loading

timnon commented Dec 13, 2019 •

edited

Loading

alanakbik commented Jan 7, 2020

timnon commented Jan 7, 2020

alanakbik commented Jan 7, 2020

alanakbik commented Jan 7, 2020

GH-1309: WorEmbeddings replacement that stores vectors in sqlite data… #1318

GH-1309: WorEmbeddings replacement that stores vectors in sqlite data… #1318

Conversation

timnon commented Dec 10, 2019 • edited Loading

timnon commented Dec 13, 2019 • edited Loading

alanakbik commented Jan 7, 2020

timnon commented Jan 7, 2020

alanakbik commented Jan 7, 2020

alanakbik commented Jan 7, 2020

timnon commented Dec 10, 2019 •

edited

Loading

timnon commented Dec 13, 2019 •

edited

Loading