Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cloudpickle breaks isinstance #195

Closed
bluenote10 opened this issue Sep 4, 2018 · 6 comments
Closed

cloudpickle breaks isinstance #195

bluenote10 opened this issue Sep 4, 2018 · 6 comments

Comments

@bluenote10
Copy link

bluenote10 commented Sep 4, 2018

Follow up to dask/distributed#2228

In [1]: import cloudpickle

In [2]: class MyObject(object):
   ...:     pass
   ...: 
   ...: 

In [3]: x = MyObject()

In [4]: x2 = cloudpickle.loads(cloudpickle.dumps(x))

In [5]: isinstance(x2, MyObject)
Out[5]: False

Using cloudpickle==0.5.5, Python 2.7.12, Ubuntu 16.04.

@bluescarni
Copy link

Was about to post essentially the same issue:

In [1]: import cloudpickle as cp

In [2]: class foo:
   ...:     pass
   ...: 
   ...: 

In [3]: type(cp.loads(cp.dumps(foo()))) == foo
Out[3]: False

@ssanderson
Copy link

ssanderson commented Sep 6, 2018

I believe this is only an issue for classes defined in __main__ (which is the module that gets used for classes defined in a REPL or a Jupyter Notebook). Classes defined in a regular Python file don't exhibit this issue:

# example_module.py

class Foo(object):
    pass
# example_main.py
import cloudpickle

from example_module import Foo

assert cloudpickle.loads(cloudpickle.dumps(Foo())).__class__ is Foo

print("Finished successfully!")
$ python -m example_main
Finished successfully!

This works normally because pickle's default strategy for serializing classes (which cloudpickle inherits in most cases) is to serialize classes by just storing a module name and attribute, and to deserialize by importing the corresponding module and getting the attribute. This has the nice property that serializing and deserializing a type gives back the same type object, which is why isinstance works as you'd expect.

The reason this doesn't work for classes defined in __main__ is that many cloudpickle users use it for interactive distributed computing, where users want to be able to send instances of their interactively-defined objects to a remote worker. In that context, __main__ on the local machine isn't the same as __main__ on a remote machine, so we can't just tell the remote machine "import __main__ and grab its Foo attribute", because there's no Foo attribute on __main__ on the other machine. A similar failure happens in normal pickle if you save a pickle of an interactively-defined class and then try to reload that pickle in a new REPL session:

In [4]: class Foo(object):
   ...:     pass
   ...:

In [5]: f = open('Foo.pkl', 'w')

In [6]: pickle.dump(Foo(), f)

In [7]:
Do you really want to exit ([y]/n)? y
$ ipython  # Start a new session.

In [1]: import pickle

In [2]: pickle.load(open('Foo.pkl'))
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-2-190d58823df4> in <module>()
----> 1 pickle.load(open('Foo.pkl'))

/usr/lib/python2.7/pickle.pyc in load(file)
   1382
   1383 def load(file):
-> 1384     return Unpickler(file).load()
   1385
   1386 def loads(str):

/usr/lib/python2.7/pickle.pyc in load(self)
    862             while 1:
    863                 key = read(1)
--> 864                 dispatch[key](self)
    865         except _Stop, stopinst:
    866             return stopinst.value

/usr/lib/python2.7/pickle.pyc in load_global(self)
   1094         module = self.readline()[:-1]
   1095         name = self.readline()[:-1]
-> 1096         klass = self.find_class(module, name)
   1097         self.append(klass)
   1098     dispatch[GLOBAL] = load_global

/usr/lib/python2.7/pickle.pyc in find_class(self, module, name)
   1130         __import__(module)
   1131         mod = sys.modules[module]
-> 1132         klass = getattr(mod, name)
   1133         return klass
   1134

AttributeError: 'module' object has no attribute 'Foo'

@ssanderson
Copy link

ssanderson commented Sep 6, 2018

One way we could potentially try to solve this would be to change the deserialization algorithm for __main__-defined classes to first try to get an appropriately-named attribute from __main__, falling back to the current copying behavior. That would solve the confusion here while still supporting the distributed development case, but it would make our model for serializing classes more complex (and it's already pretty complex...). I'd be curious to hear thoughts from other cloudpickle devs on this.

Another option might be to try to trigger a warning if an isinstance call is made on a type that was fully-serialized/deserialized by cloudpickle.

@ogrisel
Copy link
Contributor

ogrisel commented Feb 5, 2019

The problem is that most of the time you actually want to redefine the class: cloudpickle is typically used to add support for cluster computing (PySpark / dask / ray) to interactive programming environment (e.g. a Python shell, script or Jupyter Notebook):

If the user iteratively refines the implementation of a function or a class, you actually want the new code to be taken into account in subsequent calls that involve new instances of the class whose code has been changed. I don't see how this is compatible with what is asked by OP.

@ogrisel
Copy link
Contributor

ogrisel commented Feb 16, 2019

@bluenote10 @bluescarni @ssanderson I implemented a fix for this in #246. Feel free to give it a try.

@ogrisel
Copy link
Contributor

ogrisel commented May 15, 2019

cloudpickle 1.1.0 is out with a fix for this problem.

@ogrisel ogrisel closed this as completed May 15, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants