-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lda models load/save backward compatibility across Python versions #1039
Conversation
…ving LDA models across Pythong verions
* Moved LDA model test data files to test_data/ folder. * Saving id2word dictionary in binary format.
@@ -995,7 +996,7 @@ def __getitem__(self, bow, eps=None): | |||
""" | |||
return self.get_document_topics(bow, eps, self.minimum_phi_value, self.per_word_topics) | |||
|
|||
def save(self, fname, ignore=['state', 'dispatcher'], *args, **kwargs): | |||
def save(self, fname, ignore=['state', 'dispatcher'], separately = None, *args, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PEP8: no separately=None
(no spaces in argument params).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
# Save the dictionary separately in json. | ||
id2word_fname = utils.smart_extension(fname, '.bin') | ||
try: | ||
with utils.smart_open(id2word_fname, 'w', encoding='utf-8') as fout: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer to open the file as binary and explicitly write binary (utf8) strings into it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was saving the id2word dictionary separately in json format(following this. I've changed it now, and added id2word in the separately list, while saving the model. Though I still need to add tests to check this. Will add them first, and let you know.
with utils.smart_open(id2word_fname, 'w', encoding='utf-8') as fout: | ||
json.dump(id2word, fout) | ||
except Exception as e: | ||
logging.warning("failed to save id2words dictionary in %s: %s", id2word_fname, e) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be a warning, or exception? What's the user contract on storing this "id2words" dictionary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, it would be better to have it as an exception. Removed this now. It should return an exception if it fails to save the model. I still need to add tests to check this.
@@ -0,0 +1 @@ | |||
{"0": "interface", "1": "computer", "2": "human", "3": "response", "4": "time", "5": "survey", "6": "system", "7": "user", "8": "eps", "9": "trees", "10": "graph", "11": "minors"} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't look like a binary (extension .bin
?) file.
return _pickle.loads(f.read()) | ||
|
||
if sys.version_info > (3, 0): | ||
return _pickle.load(f, encoding='latin1') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Absolutely not! What is this latin1
?
The content is (and should be read as) binary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This works as a fix for when loading objects in Python 3 which were pickled in Python 2, which gives an exception.
Basically, Python 3 attempts to convert the pickled py2 object into a str object, when we need it to be bytes and gives an exception. I used the latin1 encoding for as a work around for that. (Asked on Stackoverflow)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a code comment to explain this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this hack needs to be marked and explained thoroughly in a comment.
I'm not familiar with such py2/py3 pickling work arounds, but isn't there a cleaner way to achieve the same effect? This sticks out like a sore thumb. @tmylk @anmol01gulati
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@piskvorky Umm, I had actually searched quite a lot, and tried various things on my system. This is the only way(a hack actually), I found, through which it works. By the way, I felt, we would not want to have this functionality in the future and could do away with the backward compatibility, if majority of the users shift to one Python 3 later (it's not the case right now though).
I'll open up a new PR to add a comment in the code though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am coding something entirely different and this solution is the only thing that worked for loading python2 pickles in python3... The creators claim that pickle is backwards compatible but apparently only if I pass latin1... Any other way just breaks and burns.
…lity of id2word dictionary in loading a model
858fe87
to
5b606e1
Compare
I've made the necessary changes now. The id2word dictionary is saved in binary format. |
A code comment on `latin`` is needed in a separate PR. |
I've branched this off PR #913 . This PR is specifically for loading LDA models. There was some still some issue in loading(across python vecrsion) word2vec models in PR #913, so thought to separately tackle that issue. This PR looks ready to be merged for now.
@tmylk @piskvorky Please review.