Phraser requires unnecessary memory #2189

piskvorky · 2018-09-19T14:55:18Z

Currently, Phraser objects (= the trimmed-down version of the full bigram finder Phrases) contains the actual bigrams in an internal attribute called phrasegrams. This is the biggest and most memory-intense part of a Phraser object.

phrasegrams is a dict of {tuple of strings => (frequency [int], score [float])}. But the int (the frequency count of that particular bigram) is unused. This means we're constructing that int, plus the wrapping tuple, for no good reason, inflating the necessary RAM. See also mailing list discussion.

Task:

Drop the int from Phraser values, leaving only the float.
And while at it, rename (deprecate) the .vocab attribute of Phrases to something more appropriate, for example bigram_counts.

The text was updated successfully, but these errors were encountered:

gojomo · 2018-09-19T17:29:16Z

Note the vocab dict of a Phrases also contains unigram counts. So if renamed, a name like frequencies might make sense.

But since renaming could break user code (or necessitate more compatibility-maintaining cruft), I wouldn't rename it unless part of a major refactoring/cleanup. The name vocab isn't that bad in its context-of-initial-assignement (where it has a helpful comment), and is roughly in line with the naming of similar frequency-dicts elsewhere in gensim (such as the word2vec-and-related spaghetti-mountains). And there's a bunch of other variables/methods with _vocab in them in phrases.py, all somewhat related to the vocab dict's role, which may equally need better-names – but those better-names aren't necessarily just search-and-replacing vocab with whatever new name is chosen for the dict-property.

jenishah · 2018-09-21T09:53:46Z

I would like to take this up. Should I rename the .vocab attribute ?

menshikh-iv · 2018-09-24T03:50:24Z

I agree with @gojomo: .vocab are really ok. If you only want to rename it - just add an getter/setter with new name: this don't break anything and allows to use both (backward compatible). I'm also +1 for variant if we don't rename / add getters to don't increase "compatibility-maintaining cruft".

piskvorky added feature Issue described a new feature difficulty easy Easy issue: required small fix performance Issue related to performance (in HW meaning) labels Sep 19, 2018

souravsingh mentioned this issue Sep 22, 2018

Rename vocab attribute to bigram_counts #2195

Closed

rcortx assigned rcortx and unassigned rcortx Sep 23, 2018

menshikh-iv added the Hacktoberfest Issues marked for hacktoberfest label Sep 28, 2018

jenishah mentioned this issue Oct 3, 2018

Reduce Phraser memory usage (drop frequencies) #2208

Merged

menshikh-iv closed this as completed in #2208 Jan 11, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Phraser requires unnecessary memory #2189

Phraser requires unnecessary memory #2189

piskvorky commented Sep 19, 2018 •

edited

Loading

gojomo commented Sep 19, 2018

jenishah commented Sep 21, 2018

menshikh-iv commented Sep 24, 2018 •

edited

Loading

Phraser requires unnecessary memory #2189

Phraser requires unnecessary memory #2189

Comments

piskvorky commented Sep 19, 2018 • edited Loading

gojomo commented Sep 19, 2018

jenishah commented Sep 21, 2018

menshikh-iv commented Sep 24, 2018 • edited Loading

piskvorky commented Sep 19, 2018 •

edited

Loading

menshikh-iv commented Sep 24, 2018 •

edited

Loading