-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize _dfs_convert by exiting early for common types. #429
Conversation
Note to self: this PR needs a few updates, along with tests/changelog. Tests are failing because of #431, which should be solved in #433.
|
Codecov Report
@@ Coverage Diff @@
## master #429 +/- ##
=======================================
Coverage 76.59% 76.59%
=======================================
Files 45 45
Lines 7079 7081 +2
=======================================
+ Hits 5422 5424 +2
Misses 1657 1657
Continue to review full report at Codecov.
|
The issue of numpy float64s also being floats is definitely something to look into. If that's a problem, then you should use |
@vyasr I agree with your comments in general. A review of the specific implementation in this PR would be helpful. The short answer on NumPy float64 was this:
|
@bdice makes sense, thanks. I didn't look at the code at all, I just responded to your previous comment without realizing that you had already addressed it. I'll aim to review in the next couple of days (before the weekend). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Description
The internal function
_dfs_convert
recursively converts containers into their synced equivalents. That is, when reading a state point or document, the_dfs_convert
function will change adict
into a synced dictionary.Observation: Most signac jobs have state points and documents that are relatively flat. That is, most values in a state point or document are common non-container types (string, float, int, bool) and only a few are (dict, list).
I noticed that the code path taken for non-container types must exhaustively test whether the value is a
Mapping
, tuple, list, or NumPy type. Only at the very end it returns the raw value.Matching the "fast path" to the expected distribution of data is a micro-optimization, but it pays off. Benchmarks are below.
Motivation and Context
Speeds up all operations in signac that involved synced structures (e.g. reading data from a state point or document).
Benchmark:
I measured the total time for checking status, running, and submitting a sample workflow of 1000 jobs in signac flow with 3 operations with pre/post conditions. The state points and documents had no nested dicts/lists.
Before: 86.0 seconds total, 7.749 seconds spent in
_dfs_convert
.After: 78.3 seconds total, 1.378 seconds spent in
_dfs_convert
.This suggests a speedup of ~5x in that function, but it will depend heavily on the data sizes and types used.
Types of Changes
1The change breaks (or has the potential to break) existing functionality.
Checklist:
If necessary: