Lda models load/save backward compatibility across Python versions #1039

anmolgulati · 2016-12-07T17:40:31Z

I've branched this off PR #913 . This PR is specifically for loading LDA models. There was some still some issue in loading(across python vecrsion) word2vec models in PR #913, so thought to separately tackle that issue. This PR looks ready to be merged for now.
@tmylk @piskvorky Please review.

…ving LDA models across Pythong verions

… compatibility

…on 3

…and 3.5

* Moved LDA model test data files to test_data/ folder. * Saving id2word dictionary in binary format.

piskvorky · 2016-12-08T05:10:45Z

gensim/models/ldamodel.py

@@ -995,7 +996,7 @@ def __getitem__(self, bow, eps=None):
        """
        return self.get_document_topics(bow, eps, self.minimum_phi_value, self.per_word_topics)

-    def save(self, fname, ignore=['state', 'dispatcher'], *args, **kwargs):
+    def save(self, fname, ignore=['state', 'dispatcher'], separately = None, *args, **kwargs):


PEP8: no separately=None (no spaces in argument params).

piskvorky · 2016-12-08T05:12:14Z

gensim/models/ldamodel.py

+        # Save the dictionary separately in json.
+        id2word_fname = utils.smart_extension(fname, '.bin')   
+        try:
+            with utils.smart_open(id2word_fname, 'w', encoding='utf-8') as fout:


I'd prefer to open the file as binary and explicitly write binary (utf8) strings into it.

I was saving the id2word dictionary separately in json format(following this. I've changed it now, and added id2word in the separately list, while saving the model. Though I still need to add tests to check this. Will add them first, and let you know.

piskvorky · 2016-12-08T05:13:13Z

gensim/models/ldamodel.py

+            with utils.smart_open(id2word_fname, 'w', encoding='utf-8') as fout:
+                json.dump(id2word, fout)
+        except Exception as e:
+            logging.warning("failed to save id2words dictionary in %s: %s", id2word_fname, e)


Should this be a warning, or exception? What's the user contract on storing this "id2words" dictionary?

Yup, it would be better to have it as an exception. Removed this now. It should return an exception if it fails to save the model. I still need to add tests to check this.

piskvorky · 2016-12-08T05:13:47Z

gensim/test/test_data/ldamodel_python_2_7.bin

@@ -0,0 +1 @@
+{"0": "interface", "1": "computer", "2": "human", "3": "response", "4": "time", "5": "survey", "6": "system", "7": "user", "8": "eps", "9": "trees", "10": "graph", "11": "minors"}


This doesn't look like a binary (extension .bin?) file.

piskvorky · 2016-12-08T05:14:40Z

gensim/utils.py

-        return _pickle.loads(f.read())
-
+        if sys.version_info > (3, 0):
+            return _pickle.load(f, encoding='latin1')


Absolutely not! What is this latin1?

The content is (and should be read as) binary.

This works as a fix for when loading objects in Python 3 which were pickled in Python 2, which gives an exception.
Basically, Python 3 attempts to convert the pickled py2 object into a str object, when we need it to be bytes and gives an exception. I used the latin1 encoding for as a work around for that. (Asked on Stackoverflow)

Can you add a code comment to explain this?

Yes, this hack needs to be marked and explained thoroughly in a comment.

I'm not familiar with such py2/py3 pickling work arounds, but isn't there a cleaner way to achieve the same effect? This sticks out like a sore thumb. @tmylk @anmol01gulati

@piskvorky Umm, I had actually searched quite a lot, and tried various things on my system. This is the only way(a hack actually), I found, through which it works. By the way, I felt, we would not want to have this functionality in the future and could do away with the backward compatibility, if majority of the users shift to one Python 3 later (it's not the case right now though).
I'll open up a new PR to add a comment in the code though.

I am coding something entirely different and this solution is the only thing that worked for loading python2 pickles in python3... The creators claim that pickle is backwards compatible but apparently only if I pass latin1... Any other way just breaks and burns.

…lity of id2word dictionary in loading a model

anmolgulati · 2016-12-15T15:29:34Z

I've made the necessary changes now. The id2word dictionary is saved in binary format.
We only create a new file and save the id2word dictionary when it's not marked as 'ignored'. So, when loading thedictionary from the file(in ldamodel), we check if the file exists, and only then load the dictionary.
To accomodate this, I had to change the testfile() function in test_ldamodel, so thus, a different file is created for each test. Does this sound good?
@piskvorky @tmylk We could merge it then, if it looks fine.

tmylk · 2016-12-22T01:42:59Z

A code comment on `latin`` is needed in a separate PR.

anmolgulati added 10 commits October 29, 2016 21:18

Modified load/save methods to maitain compatibility in loading and sa…

a4d214f

…ving LDA models across Pythong verions

Added saved LDA models in Python 2.7 and 3.5 environments for testing…

04a4634

… compatibility

Added test for LDA Model compatibility between Python versions

aaae5ff

Modified unpickle method to allow unpickling python 2 objects in pyth…

8b2cc42

…on 3

Created and saved LDAModels with same random_seed in both Python 2.7 …

c4c1289

…and 3.5

* Fixed PEP8 fixes.

8fb383c

* Moved LDA model test data files to test_data/ folder. * Saving id2word dictionary in binary format.

Removed old LDA model files

99cd080

Merge remote-tracking branch 'rare/develop' into lda-pickle-worker

66d5f5e

Fixed numpy as np in test_ldamodel.py

2ecde2c

Recreated lda model files in python 3.5

237eff4

piskvorky requested changes Dec 8, 2016

View reviewed changes

Added id2word in 'Separately' and created lda models again

35f2dcc

anmolgulati changed the title ~~Lda models load/save backward compatibility across Python versions~~ [WIP] Lda models load/save backward compatibility across Python versions Dec 8, 2016

Pickling id2word Dictionary separately. Also added test to check equa…

5b606e1

…lity of id2word dictionary in loading a model

anmolgulati force-pushed the lda-pickle-worker branch from 858fe87 to 5b606e1 Compare December 9, 2016 02:06

anmolgulati added 4 commits December 9, 2016 07:50

Removed commented code

3937e62

Minor change.

dac55bc

Changes made

615b91e

Refactored testfile() function

b24620f

anmolgulati changed the title ~~[WIP] Lda models load/save backward compatibility across Python versions~~ Lda models load/save backward compatibility across Python versions Dec 15, 2016

tmylk merged commit e08af7b into piskvorky:develop Dec 22, 2016

This was referenced Dec 22, 2016

Unpickling models across python3 and python2 #853

Closed

Loading and Saving LDA Models across Python 2 and 3. #913

Closed

anmolgulati deleted the lda-pickle-worker branch February 14, 2017 21:51

tmylk mentioned this pull request May 15, 2017

RandomState (Fix to issue #113) breaks backwards compatibility with old LDA models #1082

Closed

menshikh-iv mentioned this pull request Oct 2, 2017

Switch to dill//cloudpickle #558

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lda models load/save backward compatibility across Python versions #1039

Lda models load/save backward compatibility across Python versions #1039

anmolgulati commented Dec 7, 2016 •

edited

Loading

piskvorky Dec 8, 2016

anmolgulati Dec 8, 2016

piskvorky Dec 8, 2016

anmolgulati Dec 8, 2016 •

edited

Loading

piskvorky Dec 8, 2016

anmolgulati Dec 8, 2016 •

edited

Loading

piskvorky Dec 8, 2016

piskvorky Dec 8, 2016 •

edited

Loading

anmolgulati Dec 15, 2016 •

edited

Loading

tmylk Dec 22, 2016

piskvorky Dec 27, 2016 •

edited

Loading

anmolgulati Dec 27, 2016 •

edited

Loading

JermellB Sep 13, 2017

anmolgulati commented Dec 15, 2016

tmylk commented Dec 22, 2016

		@@ -0,0 +1 @@
		{"0": "interface", "1": "computer", "2": "human", "3": "response", "4": "time", "5": "survey", "6": "system", "7": "user", "8": "eps", "9": "trees", "10": "graph", "11": "minors"}

Lda models load/save backward compatibility across Python versions #1039

Lda models load/save backward compatibility across Python versions #1039

Conversation

anmolgulati commented Dec 7, 2016 • edited Loading

piskvorky Dec 8, 2016

Choose a reason for hiding this comment

anmolgulati Dec 8, 2016

Choose a reason for hiding this comment

piskvorky Dec 8, 2016

Choose a reason for hiding this comment

anmolgulati Dec 8, 2016 • edited Loading

Choose a reason for hiding this comment

piskvorky Dec 8, 2016

Choose a reason for hiding this comment

anmolgulati Dec 8, 2016 • edited Loading

Choose a reason for hiding this comment

piskvorky Dec 8, 2016

Choose a reason for hiding this comment

piskvorky Dec 8, 2016 • edited Loading

Choose a reason for hiding this comment

anmolgulati Dec 15, 2016 • edited Loading

Choose a reason for hiding this comment

tmylk Dec 22, 2016

Choose a reason for hiding this comment

piskvorky Dec 27, 2016 • edited Loading

Choose a reason for hiding this comment

anmolgulati Dec 27, 2016 • edited Loading

Choose a reason for hiding this comment

JermellB Sep 13, 2017

Choose a reason for hiding this comment

anmolgulati commented Dec 15, 2016

tmylk commented Dec 22, 2016

anmolgulati commented Dec 7, 2016 •

edited

Loading

anmolgulati Dec 8, 2016 •

edited

Loading

anmolgulati Dec 8, 2016 •

edited

Loading

piskvorky Dec 8, 2016 •

edited

Loading

anmolgulati Dec 15, 2016 •

edited

Loading

piskvorky Dec 27, 2016 •

edited

Loading

anmolgulati Dec 27, 2016 •

edited

Loading