Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Phraser requires unnecessary memory #2189

Closed
piskvorky opened this issue Sep 19, 2018 · 3 comments
Closed

Phraser requires unnecessary memory #2189

piskvorky opened this issue Sep 19, 2018 · 3 comments
Labels
difficulty easy Easy issue: required small fix feature Issue described a new feature Hacktoberfest Issues marked for hacktoberfest performance Issue related to performance (in HW meaning)

Comments

@piskvorky
Copy link
Owner

piskvorky commented Sep 19, 2018

Currently, Phraser objects (= the trimmed-down version of the full bigram finder Phrases) contains the actual bigrams in an internal attribute called phrasegrams. This is the biggest and most memory-intense part of a Phraser object.

phrasegrams is a dict of {tuple of strings => (frequency [int], score [float])}. But the int (the frequency count of that particular bigram) is unused. This means we're constructing that int, plus the wrapping tuple, for no good reason, inflating the necessary RAM. See also mailing list discussion.

Task:

  • Drop the int from Phraser values, leaving only the float.
  • And while at it, rename (deprecate) the .vocab attribute of Phrases to something more appropriate, for example bigram_counts.
@piskvorky piskvorky added feature Issue described a new feature difficulty easy Easy issue: required small fix performance Issue related to performance (in HW meaning) labels Sep 19, 2018
@gojomo
Copy link
Collaborator

gojomo commented Sep 19, 2018

Note the vocab dict of a Phrases also contains unigram counts. So if renamed, a name like frequencies might make sense.

But since renaming could break user code (or necessitate more compatibility-maintaining cruft), I wouldn't rename it unless part of a major refactoring/cleanup. The name vocab isn't that bad in its context-of-initial-assignement (where it has a helpful comment), and is roughly in line with the naming of similar frequency-dicts elsewhere in gensim (such as the word2vec-and-related spaghetti-mountains). And there's a bunch of other variables/methods with _vocab in them in phrases.py, all somewhat related to the vocab dict's role, which may equally need better-names – but those better-names aren't necessarily just search-and-replacing vocab with whatever new name is chosen for the dict-property.

@jenishah
Copy link
Contributor

I would like to take this up. Should I rename the .vocab attribute ?

@menshikh-iv
Copy link
Contributor

menshikh-iv commented Sep 24, 2018

I agree with @gojomo: .vocab are really ok. If you only want to rename it - just add an getter/setter with new name: this don't break anything and allows to use both (backward compatible). I'm also +1 for variant if we don't rename / add getters to don't increase "compatibility-maintaining cruft".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
difficulty easy Easy issue: required small fix feature Issue described a new feature Hacktoberfest Issues marked for hacktoberfest performance Issue related to performance (in HW meaning)
Projects
None yet
Development

No branches or pull requests

5 participants