Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Write a WeakValueDictionary with safer key removal #13394

Closed
nbruin opened this issue Aug 24, 2012 · 138 comments
Closed

Write a WeakValueDictionary with safer key removal #13394

nbruin opened this issue Aug 24, 2012 · 138 comments

Comments

@nbruin
Copy link
Contributor

nbruin commented Aug 24, 2012

On ticket #12313 we found that the use of WeakValueDictionaries as caches can cause removal callbacks in rather harsh environments. Normal WeakValueDictionaries remove keys with dead values by looking up the key. This involves Python equality testing on the key, which can cause any kind of operation in Sage. We need a dictionary where we can delete entries without key comparisons. See below for possible strategies.

To the release manager:

Apply attachment: trac13394-weak_value_dictionary.patch

Upstream: None of the above - read trac for reasoning.

CC: @simon-king-jena

Component: memleak

Author: Simon King, Nils Bruin

Reviewer: Nils Bruin, Simon King

Merged: sage-5.13.beta3

Issue created by migration from https://trac.sagemath.org/ticket/13394

@nbruin nbruin added this to the sage-5.11 milestone Aug 24, 2012
@nbruin

This comment has been minimized.

@nbruin
Copy link
Contributor Author

nbruin commented Aug 24, 2012

comment:2

The standard Python WeakValueDictionary is built on top of a normal dictionary, built on top of a normal dictionary. Let's say we have

W = WeakValueDictionary()

There is an underlying D = dict. If we do

W[k] = v

then it really executes

D[k] = KeyedRef(v, k, remove_callback)

This is a weak reference to v, so v can still get collected. If it does, the callback can ensure that

del D[k]

gets executed. This locates the entry by computing the hash of h, finding the right bucket and finding an equal key by comparison. When we create the entry, we have better lookup information, though: We already have hash(k) and id(v). These two don't necessarily pin down the entry in the dictionary, but in our applications they do and for a WeakValueDictionary identical v would be the same (dead) KeyedRef anyway, so it wouldn't hurt to remove them all. We need one more method on our dictionary:

D.remove_entry_by_hash_and_value_id(h,i):
    """
    remove all entries { k : v } from the dictionary that have
    hash(k) == h   and  id(v) == i
    """ 

Then the callback can be

function remove(ref):
    D.remove_entry_by_hash_and_value_id(hash(ref.key),id(ref))

and the KeyedRef that gets stored would be

KeyedRef(v,hash(k),remove)

(normal precautions to put the dict reference in a closure or a class should apply here in defining remove)

Or one could just immediately integrate the KeyedRefs in the design. See also #13387 for discussions about dictionary designs.

In either case, the main effect is that removal does not cause any hashing or comparison methods to be called on k anymore.

@jdemeyer jdemeyer modified the milestones: sage-5.11, sage-5.12 Aug 13, 2013
@simon-king-jena
Copy link
Member

Attachment: weak_value_dictionary.pyx.gz

Proof of concept

@simon-king-jena
Copy link
Member

comment:5

I have attached a proof of concept. I think the class WVD defined in the attachment is feature-complete. According to my benchmarks, it is faster than weakref.WeakValueDictionary, and it is safer.

Performance

sage: def test_dict(D,L):
....:     for k,v in L:
....:         D[k] = v
....:     for k,v in L:
....:         if k not in D:
....:             raise RuntimeError("containment")
....:     for k,v in L:
....:         D[k] = v
....:     for k in D.iterkeys():
....:         if k not in D:
....:             raise RuntimeError("second containment")
....:     for k,v in L:
....:         assert D.get(k)==v
....:         del D[k]
....:     for k,v in L:
....:         if k in D:
....:             raise RuntimeError("anti-containment")
....:         
sage: L = [(p,GF(p)) for p in prime_range(2*10^4)]
sage: from weakref import WeakValueDictionary
sage: D = WeakValueDictionary()
sage: %timeit test_dict(D,L)
10 loops, best of 3: 43.2 ms per loop
sage: D = WVD()
sage: %timeit test_dict(D,L)
10 loops, best of 3: 29.4 ms per loop

Hence, the WVD is faster than the weakref.WeakValueDictionary.

Safety

sage: class Vals(object): pass
sage: class Keys:                       
....:     def __init__(self, val):
....:         self.val = weakref.ref(val)
....:     def __hash__(self):
....:         return hash(self.val())
....:     def __eq__(self, other):
....:         return self.val() == other.val()
....: 
sage: ValList = [Vals() for _ in range(10)]
sage: D = WeakValueDictionary()
sage: for v in ValList:
....:     D[Keys(v)] = v
....:     
sage: len(D)
10
sage: del ValList
Exception KeyError: (<__main__.Keys instance at 0xdcfc96c>,) in <function remove at 0xdcb9df4> ignored
Exception KeyError: (<__main__.Keys instance at 0xdcfc90c>,) in <function remove at 0xdcb9df4> ignored
Exception KeyError: (<__main__.Keys instance at 0xdcfcc2c>,) in <function remove at 0xdcb9df4> ignored
Exception KeyError: (<__main__.Keys instance at 0xdcfc9ec>,) in <function remove at 0xdcb9df4> ignored
Exception KeyError: (<__main__.Keys instance at 0xdcfca4c>,) in <function remove at 0xdcb9df4> ignored
Exception KeyError: (<__main__.Keys instance at 0xdcfc9cc>,) in <function remove at 0xdcb9df4> ignored
Exception KeyError: (<__main__.Keys instance at 0xdcfc98c>,) in <function remove at 0xdcb9df4> ignored
Exception KeyError: (<__main__.Keys instance at 0xdcfccac>,) in <function remove at 0xdcb9df4> ignored
Exception KeyError: (<__main__.Keys instance at 0xdcfca0c>,) in <function remove at 0xdcb9df4> ignored
sage: len(D)
10

Hence, the weakref.WeakValueDictionary is unsafe.

sage: ValList = [Vals() for _ in range(10)]
sage: D = WVD()
sage: for v in ValList:
....:     D[Keys(v)] = v
....:     
sage: len(D)
10
sage: del ValList
sage: len(D)
1
sage: del v
sage: len(D)
0

Hence, the WVD is safe (or at least: safer...)

I suppose I should add tests and documentation to the proof of concept, put into sage.misc and rename WVD into sage.misc.weak_dict.WeakValueDictionary. And then we should see if we can really smoothly replace weakref.WeakValueDictionary in Sage.

@nbruin
Copy link
Contributor Author

nbruin commented Oct 28, 2013

comment:6

Excellent work! I like how you found a way to reuse python dictionaries. I think there is one fairly easy optimization that you can make that is fairly similar to how we ended up implementing the buckets for MonoDict and TripleDict: Presently, your bucket is a list of tuples. That provides an extra layer of indirection, meaning both slower access and more memory use and allocation.

If instead you "inline" the tuples by making the bucket a list with the layout [key1, weakref_to_value1, key2, weakref_to_value2,...] you can save yourself some overhead. Normally these lists should only contain one key-value pair anyway.

It may be worth browsing through MonoDict anyway. There may more more little tricks that we used there that may apply here and that I don't remember right now.

@simon-king-jena
Copy link
Member

comment:7

Replying to @nbruin:

If instead you "inline" the tuples by making the bucket a list with the layout [key1, weakref_to_value1, key2, weakref_to_value2,...] you can save yourself some overhead.

Good point.

In addition, I think that weakref_to_value_n.key used for the callback should be improved: Storing hash(key_n) should be enough. Namely, it allows to find the correct hash bucket, and then the hash bucket is searched to find weakref_to_value_n that is subject to the callback. The latter is simply comparing memory addresses.

@simon-king-jena
Copy link
Member

Branch: u/SimonKing/ticket/13394

@simon-king-jena
Copy link
Member

comment:9

I have posted an initial version, that still lacks documentation and tests (and with the next commit I will also fix some trailing whitespace).

I am afraid that the current version of a weak value dictionary is not faster than what I have described in the proof of concept, although I used the improved bucket layout and am using C-API functions rather thoroughly.

Anyway. I guess part of the documentation will be benchmarks for each separate task. Then we will see what should be further improved.

@simon-king-jena
Copy link
Member

Changed branch from u/SimonKing/ticket/13394 to none

@simon-king-jena
Copy link
Member

New commits:

[changeset:f0ed60f]Initial version of a safer and faster WeakValueDictionary

@simon-king-jena
Copy link
Member

Commit: f0ed60f

@simon-king-jena
Copy link
Member

Branch: u/SimonKing/ticket/13394

@simon-king-jena
Copy link
Member

comment:11

I wonder: Is it really a good idea to have a dictionary storing one list for each hash value? Granted, it does solve the problem of defunct keys in the callback of weak values. However, storing the hash buckets similar to what we do for TripleDict might turn out to be faster.

Well, I guess one should first have a version that works and is faster and safer than weakref.WeakValueDictionary. And then we can further optimise.

@nbruin
Copy link
Contributor Author

nbruin commented Oct 28, 2013

comment:12

Replying to @simon-king-jena:

I wonder: Is it really a good idea to have a dictionary storing one list for each hash value?

Well, it's not entirely optimal of course. Python's own dict already has a mechanism to deal with hash collisions, so having the lists stored is, strictly speaking, an unnecessary indirection. However, breaking open python's dict implementation to access the hash buckets there is going to be a very hard to maintain solution (you'd basically be patching python and implement an extra "delete by hash and value id" routine. Entirely doable, but we'd be stuck with a patched python to eternity).

There are two solutions:

  • benefit from Python's insanely optimized dict and swallow the cost of an extra indirection
  • abandon Python's dict entirely and implement your own hash table.

We followed the latter for MonoDict and TripleDict because the key semantics there are so different that borrowing python's dict wasn't really an option. For WeakValueDictionary it is an option to borrow from dict. I suspect you'd have to work pretty hard to come close to that performance.

Note that almost all of your lists are just going to be pairs. Python internally is making such things all the time: arguments tend to be packaged and copied as tuples.

I guess that brings us to a third possibility: Finish the proof of concept, show the python community what the problem is with WeakValueDictionary, and suggest the extra hook we need to make it safe. Then we might end up with a safe and fast WeakValueDictionary in python proper. That would only be in Python3+, though, so we'd have to backport to Python2.7.

There are already some issues reported: http://bugs.python.org/issue7105 http://bugs.python.org/issue17816. You might want to check if your implementation fixes those.

@simon-king-jena
Copy link
Member

comment:13

Replying to @nbruin:

There are already some issues reported:
http://bugs.python.org/issue7105

I did not address this one: I do not switch garbage collection off during iteration. I could, of course. Later perhaps.

http://bugs.python.org/issue17816.

This is fixed in my implementation. A slight variation of the example that issue 17816 is proposing, showing that the weakref version of WeakValueDict is a mess:

sage: import weakref
sage: import sage.misc.weak_dict
sage: class A(object):
....:     def __init__(self,n):
....:         self.n=n
....:     def __repr__(self):
....:         return "A(%d)"%self.n
....:     
sage: def mess(D, n):
....:     L=[A(i) for i in range(n)]
....:     for i in range(n-1):
....:         j=(i+10)%n
....:         D[L[i]]=L[j]
....:     return D
....: 
sage: D = weakref.WeakValueDictionary()
sage: D = mess(D,10000)
sage: len(D)
sage: D.clear()
Exception KeyError: (A(6760),) in <function remove at 0xbe5110c> ignored
Exception KeyError: (A(6761),) in <function remove at 0xbe5110c> ignored
...
Exception KeyError: (A(968),) in <function remove at 0xbe5110c> ignored
Exception KeyError: (A(958),) in <function remove at 0xbe5110c> ignored
sage: len(D)   # in spite of the errors, the items *are* removed
0

Instead, sage.misc.weak_dict.WeakValueDictionary works just fine:

sage: D = sage.misc.weak_dict.WeakValueDictionary()
sage: D = mess(D,10000)
sage: len(D)
9000
sage: D.clear()
sage: len(D)
0

@simon-king-jena
Copy link
Member

comment:14

PS: In a yet-to-be-pushed commit, I am replacing all weakref.WeakValueDictionary (e.g., used in CachedRepresentation) by sage.misc.weak_dict.WeakValueDictionary. I am now running doctests. Keep your fingers crossed...

@simon-king-jena
Copy link
Member

comment:15

Here is an example concerning "garbage collection during iteration". Features of the example:

  • Many hash collisions
  • Reference cycles, so that Python's cyclic garbage collector comes into play
  • The garbage collection is postponed until after the first iteration.

First, with weakref.WeakValueDictionary:

sage: class Cycle(object):
....:     def __init__(self, n):
....:         self.n = n
....:         self.ref = self
....:     def __cmp__(self, other):
....:         c = cmp(type(self),type(other))
....:         if c:
....:             return c
....:         return cmp(self.n, other.n)
....:     def __repr__(self):
....:         return "Cyc(%s)"%self.n
....:     
sage: class Keys(object):
....:     def __init__(self, n):
....:         self.n = n
....:     def __hash__(self):
....:         return self.n%5
....:     def __cmp__(self, other):
....:         c = cmp(type(self),type(other))
....:         if c:
....:             return c
....:         return cmp(self.n, other.n)
....:     def __repr__(self):
....:         return "Key(%s)"%self.n
....:     
sage: C = [Cycle(n) for n in range(100)]
sage: K = [Keys(n) for n in range(100)]
sage: import gc
sage: import sage.misc.weak_dict
sage: import weakref
sage: D = weakref.WeakValueDictionary()
sage: for k,c in zip(K,C):
....:     D[k] = c
....:     
sage: del c
sage: gc.disable()
sage: del C
sage: for k,c in D.iteritems():
....:     print k,c
....:     gc.enable()
....:     _ = gc.collect()
....:     
Key(0) Cyc(0)
Traceback (most recent call last):
...
RuntimeError: dictionary changed size during iteration

And with sage.misc.weak_dict.WeakValueDictionary:

sage: D = sage.misc.weak_dict.WeakValueDictionary(zip(K,C))
sage: len(D)
100
sage: gc.disable()
sage: del C
sage: for k,c in D.iteritems():
....:     print k,c
....:     gc.enable()
....:     _ = gc.collect()
....:     
Key(0) Cyc(0)
Traceback (most recent call last):
...
IndexError: list index out of range

So, there is an error, too. Perhaps one could improve the error message.

@sagetrac-git
Copy link
Mannequin

sagetrac-git mannequin commented Oct 28, 2013

Changed commit from f0ed60f to c3dba98

@sagetrac-git
Copy link
Mannequin

sagetrac-git mannequin commented Oct 28, 2013

Branch pushed to git repo; I updated commit sha1. New commits:

[changeset:c3dba98]Replace weakref.WeakValueDictionary by sage.misc.weak_dict.WeakValueDictionary
[changeset:17b0236]Documentation for WeakValueDictionary

@simon-king-jena
Copy link
Member

Author: Simon King

@simon-king-jena
Copy link
Member

Upstream: None of the above - read trac for reasoning.

@simon-king-jena
Copy link
Member

comment:17

With the current commits, weakref.WeakValueDict is replaced by sage.misc.weak_dict.WeakValueDict everywhere in Sage. All doc tests pass. The doctest coverage of the new module is 100%.

Hence, I make it "needs review", and we will see whether we will report upstream.

Next, I'll try to construct finer grained benchmarks, to see if there are aspects in which the implementation in weakref is still better.


New commits:

[changeset:c3dba98]Replace weakref.WeakValueDictionary by sage.misc.weak_dict.WeakValueDictionary
[changeset:17b0236]Documentation for WeakValueDictionary

@simon-king-jena simon-king-jena modified the milestones: sage-6.0, sage-5.13 Nov 2, 2013
@jdemeyer

This comment has been minimized.

@nbruin
Copy link
Contributor Author

nbruin commented Nov 2, 2013

comment:102

Replying to @simon-king-jena:

Nils, from the discussion at cython-users, it seems that the reason for defining dict.tp_print is that for large dicts one would not like to create a giant string and return it, but would like to print it bit by bit.

I'm pretty sure that's why it was originally designed. It seems that subsequently tp_print wasn't found to be all that useful and hence effectively removed. I haven't checked, but Stefan's comments suggest that it's not used in Python3. I can see how that would happen: If one has a huge data structure, one doesn't write it to a file in one go. One dumps it more carefully.

And this may give rise to a suggestion: Python could allow __str__ resp. __repr__ to become iterators returning a list of strings, and then it could print them bit by bit.

It's an elegant idea and probably how tp_print would be implemented if iterators had been around earlier, but I'm pretty sure one wouldn't want to slow down __str__ etc. (internally they are C-speed for many types, because they are slotted!) in general to gain a use-case that has been shown to very rarely occur.

@jdemeyer
Copy link
Contributor

jdemeyer commented Nov 2, 2013

comment:103

Please check that the documentation builds fine. I get:

docstring of sage.misc.weak_dict.WeakValueDictionary:45: WARNING: Literal block expected; none found.

(but I'm not entirely sure this was with the latest version of the patch)

@simon-king-jena
Copy link
Member

comment:104

Replying to @jdemeyer:

(but I'm not entirely sure this was with the latest version of the patch)

Well, there is only one version of the patch...
But you are right, I get this error as well.

@simon-king-jena
Copy link
Member

Work Issues: Docbuild

@simon-king-jena
Copy link
Member

Attachment: trac13394-weak_value_dictionary.patch.gz

Mercuarial patch in old folder layout

@sagetrac-git
Copy link
Mannequin

sagetrac-git mannequin commented Nov 2, 2013

Changed commit from 11bd210 to 1aedffe

@sagetrac-git
Copy link
Mannequin

sagetrac-git mannequin commented Nov 2, 2013

Branch pushed to git repo; I updated commit sha1. New commits:

[changeset:1aedffe]Fix doc format

@simon-king-jena
Copy link
Member

comment:106

I have updated both the hg patch and the git branch (it was only needed to remove one ":"). To me, the documentation looks good.

Apply trac13394-weak_value_dictionary.patch

@simon-king-jena
Copy link
Member

Changed work issues from Docbuild to none

@nthiery
Copy link
Contributor

nthiery commented Nov 3, 2013

comment:107

Thanks everybody for your work on this ticket. Since #10963 now depends on it, I hope it will get in soon!

@nbruin
Copy link
Contributor Author

nbruin commented Nov 3, 2013

comment:108

Patch applies to 5.12 and results in a functional weak_dict.pyx and diffs pretty close to something that was already reviewed, so back to positive.

@jdemeyer
Copy link
Contributor

jdemeyer commented Nov 3, 2013

comment:109

Removed git branch to reduce confusion.

@jdemeyer
Copy link
Contributor

jdemeyer commented Nov 3, 2013

Changed commit from 1aedffe to none

@jdemeyer
Copy link
Contributor

jdemeyer commented Nov 3, 2013

Changed branch from u/SimonKing/ticket/13394 to none

@jdemeyer
Copy link
Contributor

jdemeyer commented Nov 6, 2013

Merged: sage-5.13.beta3

@saraedum
Copy link
Member

comment:111

The WeakValueDictionary behaves differently from regular dicts for unhashable keys:

sage: {}[[]]
TypeError: unhashable type: 'list'
sage: sage.misc.weak_dict.WeakValueDictionary()[[]]
KeyError: []

This is because http://docs.python.org/2/c-api/dict.html#PyDict_GetItem does not throw a TypeError. Was this done on purpose, or should I open a ticket for it?

@saraedum
Copy link
Member

comment:112

This is now #15956.

Replying to @saraedum:

The WeakValueDictionary behaves differently from regular dicts for unhashable keys:

sage: {}[[]]
TypeError: unhashable type: 'list'
sage: sage.misc.weak_dict.WeakValueDictionary()[[]]
KeyError: []

This is because http://docs.python.org/2/c-api/dict.html#PyDict_GetItem does not throw a TypeError. Was this done on purpose, or should I open a ticket for it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants