Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize _dfs_convert by exiting early for common types. #429

Merged
merged 8 commits into from
Dec 17, 2020

Conversation

bdice
Copy link
Member

@bdice bdice commented Dec 13, 2020

Description

The internal function _dfs_convert recursively converts containers into their synced equivalents. That is, when reading a state point or document, the _dfs_convert function will change a dict into a synced dictionary.

Observation: Most signac jobs have state points and documents that are relatively flat. That is, most values in a state point or document are common non-container types (string, float, int, bool) and only a few are (dict, list).

I noticed that the code path taken for non-container types must exhaustively test whether the value is a Mapping, tuple, list, or NumPy type. Only at the very end it returns the raw value.

Matching the "fast path" to the expected distribution of data is a micro-optimization, but it pays off. Benchmarks are below.

Motivation and Context

Speeds up all operations in signac that involved synced structures (e.g. reading data from a state point or document).

Benchmark:

I measured the total time for checking status, running, and submitting a sample workflow of 1000 jobs in signac flow with 3 operations with pre/post conditions. The state points and documents had no nested dicts/lists.

Before: 86.0 seconds total, 7.749 seconds spent in _dfs_convert.
After: 78.3 seconds total, 1.378 seconds spent in _dfs_convert.

This suggests a speedup of ~5x in that function, but it will depend heavily on the data sizes and types used.

Types of Changes

  • Documentation update
  • Bug fix
  • New feature
  • Breaking change1

1The change breaks (or has the potential to break) existing functionality.

Checklist:

If necessary:

  • I have updated the API documentation as part of the package doc-strings.
  • I have created a separate pull request to update the framework documentation on signac-docs and linked it here.
  • I have updated the changelog and added all related issue and pull request numbers for future reference (if applicable). See example below.

@bdice bdice requested review from a team as code owners December 13, 2020 21:56
@bdice bdice requested review from kidrahahjo and vishav1771 and removed request for a team December 13, 2020 21:56
@bdice bdice marked this pull request as draft December 14, 2020 07:24
@bdice
Copy link
Member Author

bdice commented Dec 14, 2020

Note to self: this PR needs a few updates, along with tests/changelog. Tests are failing because of #431, which should be solved in #433.

  • NumPy float64s are registered as subclasses of float. Need to check if that is a problem or not.
  • Should the early-exit check use isinstance(x, float) or type(x) == float? Even though isinstance is typically preferred, it might not be the right choice here, depending on the NumPy situation previously mentioned and/or performance considerations.

@codecov
Copy link

codecov bot commented Dec 15, 2020

Codecov Report

Merging #429 (43605a9) into master (7db7d57) will increase coverage by 0.00%.
The diff coverage is 90.00%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master     #429   +/-   ##
=======================================
  Coverage   76.59%   76.59%           
=======================================
  Files          45       45           
  Lines        7079     7081    +2     
=======================================
+ Hits         5422     5424    +2     
  Misses       1657     1657           
Impacted Files Coverage Δ
signac/core/synceddict.py 96.01% <90.00%> (+0.03%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7db7d57...43605a9. Read the comment docs.

@bdice bdice marked this pull request as ready for review December 15, 2020 02:39
@bdice bdice self-assigned this Dec 15, 2020
@bdice bdice added the enhancement New feature or request label Dec 15, 2020
@vyasr
Copy link
Contributor

vyasr commented Dec 16, 2020

Note to self: this PR needs a few updates, along with tests/changelog. Tests are failing because of #431, which should be solved in #433.

* NumPy float64s are registered as subclasses of `float`. Need to check if that is a problem or not.

* Should the early-exit check use `isinstance(x, float)` or `type(x) == float`? Even though `isinstance` is typically preferred, it might not be the right choice here, depending on the NumPy situation previously mentioned and/or performance considerations.

The issue of numpy float64s also being floats is definitely something to look into. If that's a problem, then you should use type(x) is float (rather than ==). If it's not, then you'll probably want to stick with isinstance, you probably won't get performance benefits from directly checking types and if anything it might slow the code down. isinstance being slow is generally only a problem for abcs, which implement their instance checks in pure Python.

@bdice
Copy link
Member Author

bdice commented Dec 16, 2020

@vyasr I agree with your comments in general. A review of the specific implementation in this PR would be helpful. The short answer on NumPy float64 was this:

  • Reading data from the disk will always be type float (because that's how the JSON is parsed)
  • Using a method like "reset state point" will use whatever types are in the dict provided by the user and it's the job of _dfs_convert to convert them to desired (synced) types. I used type(x) in (bool, float, int, type(None), str) because I didn't want to capture the numpy.float64 class (I wanted to pass it through the "normal" path that will call x.item() to extract the Python float)

@klywang klywang self-requested a review December 16, 2020 19:56
@vyasr
Copy link
Contributor

vyasr commented Dec 16, 2020

@bdice makes sense, thanks. I didn't look at the code at all, I just responded to your previous comment without realizing that you had already addressed it. I'll aim to review in the next couple of days (before the weekend).

Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@vyasr vyasr merged commit 313e5e2 into master Dec 17, 2020
@vyasr vyasr deleted the feature/optimize-dfs-convert branch December 17, 2020 23:19
@bdice bdice added this to the v1.5.1 milestone Dec 19, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants