Optimize _dfs_convert by exiting early for common types. #429

bdice · 2020-12-13T21:56:25Z

Description

The internal function _dfs_convert recursively converts containers into their synced equivalents. That is, when reading a state point or document, the _dfs_convert function will change a dict into a synced dictionary.

Observation: Most signac jobs have state points and documents that are relatively flat. That is, most values in a state point or document are common non-container types (string, float, int, bool) and only a few are (dict, list).

I noticed that the code path taken for non-container types must exhaustively test whether the value is a Mapping, tuple, list, or NumPy type. Only at the very end it returns the raw value.

Matching the "fast path" to the expected distribution of data is a micro-optimization, but it pays off. Benchmarks are below.

Motivation and Context

Speeds up all operations in signac that involved synced structures (e.g. reading data from a state point or document).

Benchmark:

I measured the total time for checking status, running, and submitting a sample workflow of 1000 jobs in signac flow with 3 operations with pre/post conditions. The state points and documents had no nested dicts/lists.

Before: 86.0 seconds total, 7.749 seconds spent in _dfs_convert.
After: 78.3 seconds total, 1.378 seconds spent in _dfs_convert.

This suggests a speedup of ~5x in that function, but it will depend heavily on the data sizes and types used.

Types of Changes

Documentation update
Bug fix
New feature
Breaking change¹

¹The change breaks (or has the potential to break) existing functionality.

Checklist:

I am familiar with the Contributing Guidelines.
I agree with the terms of the Contributor Agreement.
My name is on the list of contributors.
My code follows the code style guideline of this project.
The changes introduced by this pull request are covered by existing or newly introduced tests.

If necessary:

I have updated the API documentation as part of the package doc-strings.
I have created a separate pull request to update the framework documentation on signac-docs and linked it here.
I have updated the changelog and added all related issue and pull request numbers for future reference (if applicable). See example below.

bdice · 2020-12-14T07:27:15Z

Note to self: this PR needs a few updates, along with tests/changelog. Tests are failing because of #431, which should be solved in #433.

NumPy float64s are registered as subclasses of float. Need to check if that is a problem or not.
Should the early-exit check use isinstance(x, float) or type(x) == float? Even though isinstance is typically preferred, it might not be the right choice here, depending on the NumPy situation previously mentioned and/or performance considerations.

…ppropriate.

codecov · 2020-12-15T01:46:24Z

Codecov Report

Merging #429 (43605a9) into master (7db7d57) will increase coverage by 0.00%.
The diff coverage is 90.00%.

@@           Coverage Diff           @@
##           master     #429   +/-   ##
=======================================
  Coverage   76.59%   76.59%           
=======================================
  Files          45       45           
  Lines        7079     7081    +2     
=======================================
+ Hits         5422     5424    +2     
  Misses       1657     1657

Impacted Files	Coverage Δ
signac/core/synceddict.py	`96.01% <90.00%> (+0.03%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7db7d57...43605a9. Read the comment docs.

vyasr · 2020-12-16T16:21:33Z

Note to self: this PR needs a few updates, along with tests/changelog. Tests are failing because of #431, which should be solved in #433.

* NumPy float64s are registered as subclasses of `float`. Need to check if that is a problem or not.

* Should the early-exit check use `isinstance(x, float)` or `type(x) == float`? Even though `isinstance` is typically preferred, it might not be the right choice here, depending on the NumPy situation previously mentioned and/or performance considerations.

The issue of numpy float64s also being floats is definitely something to look into. If that's a problem, then you should use type(x) is float (rather than ==). If it's not, then you'll probably want to stick with isinstance, you probably won't get performance benefits from directly checking types and if anything it might slow the code down. isinstance being slow is generally only a problem for abcs, which implement their instance checks in pure Python.

bdice · 2020-12-16T18:38:46Z

@vyasr I agree with your comments in general. A review of the specific implementation in this PR would be helpful. The short answer on NumPy float64 was this:

Reading data from the disk will always be type float (because that's how the JSON is parsed)
Using a method like "reset state point" will use whatever types are in the dict provided by the user and it's the job of _dfs_convert to convert them to desired (synced) types. I used type(x) in (bool, float, int, type(None), str) because I didn't want to capture the numpy.float64 class (I wanted to pass it through the "normal" path that will call x.item() to extract the Python float)

vyasr · 2020-12-16T21:28:15Z

@bdice makes sense, thanks. I didn't look at the code at all, I just responded to your previous comment without realizing that you had already addressed it. I'll aim to review in the next couple of days (before the weekend).

vyasr

LGTM

Optimize _dfs_convert by exiting early for common types.

167bfed

bdice requested review from a team as code owners December 13, 2020 21:56

bdice requested review from kidrahahjo and vishav1771 and removed request for a team December 13, 2020 21:56

bdice marked this pull request as draft December 14, 2020 07:24

bdice added 3 commits December 14, 2020 19:21

Use explicit type checks.

5bce008

Be more verbose with keyword arguments.

d33b4be

Also exit early in _convert_to_dict, use "is" instead of "==" where a…

8f142e9

…ppropriate.

Merge branch 'master' into feature/optimize-dfs-convert

d74992e

bdice marked this pull request as ready for review December 15, 2020 02:39

bdice self-assigned this Dec 15, 2020

bdice added the enhancement New feature or request label Dec 15, 2020

Update changelog.

e722eac

Merge branch 'master' into feature/optimize-dfs-convert

8b23240

atravitz mentioned this pull request Dec 16, 2020

Raise DestinationExistsError from EEXIST. #433

Merged

12 tasks

klywang self-requested a review December 16, 2020 19:56

Merge branch 'master' into feature/optimize-dfs-convert

43605a9

bdice mentioned this pull request Dec 17, 2020

Discussion: Performance improvements and aggregation glotzerlab/signac-flow#401

Closed

csadorf approved these changes Dec 17, 2020

View reviewed changes

vyasr approved these changes Dec 17, 2020

View reviewed changes

vyasr merged commit 313e5e2 into master Dec 17, 2020

vyasr deleted the feature/optimize-dfs-convert branch December 17, 2020 23:19

bdice added this to the v1.5.1 milestone Dec 19, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize _dfs_convert by exiting early for common types. #429

Optimize _dfs_convert by exiting early for common types. #429

bdice commented Dec 13, 2020 •

edited

Loading

bdice commented Dec 14, 2020

codecov bot commented Dec 15, 2020 •

edited

Loading

vyasr commented Dec 16, 2020 •

edited

Loading

bdice commented Dec 16, 2020 •

edited

Loading

vyasr commented Dec 16, 2020

vyasr left a comment

Optimize _dfs_convert by exiting early for common types. #429

Optimize _dfs_convert by exiting early for common types. #429

Conversation

bdice commented Dec 13, 2020 • edited Loading

Description

Motivation and Context

Benchmark:

Types of Changes

Checklist:

bdice commented Dec 14, 2020

codecov bot commented Dec 15, 2020 • edited Loading

Codecov Report

vyasr commented Dec 16, 2020 • edited Loading

bdice commented Dec 16, 2020 • edited Loading

vyasr commented Dec 16, 2020

vyasr left a comment

Choose a reason for hiding this comment

bdice commented Dec 13, 2020 •

edited

Loading

codecov bot commented Dec 15, 2020 •

edited

Loading

vyasr commented Dec 16, 2020 •

edited

Loading

bdice commented Dec 16, 2020 •

edited

Loading