Sending a Map across a process may degrade hashing performance #283
Labels
bug
Defects, unintended behaviour, etc
performance
Changes related to improving performance
std
Changes related to the standard library
Inko's Map implementation caches the hash value of its pairs, and reuses these
hash codes in several places. When the keys of a Map are regular objects (e.g. a
custom defined
User
object), the hash code is derived from the object's memoryaddress.
When a process A sends a map M to process B, all data is copied into B's heap.
As part of this process, the addresses of the copied objects change. Because the
addresses change, so will the hash values.
This may degrade performance of Map objects sent across processes, as the cached
hash values will no longer match the true hash values.
This may not be a problem in practise though. Hash collisions would require
process B to at some point reuse the memory (that once contained the source of
the copied object) released by process A. But when this does happen, there may
be an increase in hash collisions.
The cached hash value is used in two places:
Map.insert_pair
Map.rehash_pair
The method
Map.insert_pair
is only used when adding new keys, so we can passthe hash value as an additional argument.
Map.rehash_pair
is used forrehashing every pair in the Map. Calculating hash values again may end up being
expensive, especially for large maps.
One possible solution is to store the owning process in a Map. When rehashing,
we check if this equals the current process. If not, we first recompute all hash
values, then update this field to point to the current process. For this to
work, we'd have to optimise
std::process.current
to not allocate a new processobject on every call; instead it should use the same object within the same
process. Without this optimisation, the check to perform (
cached process == current process
) would be too expensive.With that said, this solution is very specific to Map objects. I can see there
being more cases where copying an object may require recomputing/invalidating
some data. We could introduce some sort of trait (e.g.
AfterCopy
) thatprovides a method, called when receiving a message. This method can then be used
to do whatever is necessary. This however will introduce additional overhead for
every message received, and many (if not almost all) will not make use of this
hook; meaning the overhead is a waste.
I'm currently not sure what the best (and least complex/over-engineered)
solution to this problem is.
The text was updated successfully, but these errors were encountered: