Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify internal indexing #652

Merged
merged 7 commits into from
Jan 4, 2022
Merged

Conversation

bdice
Copy link
Member

@bdice bdice commented Dec 5, 2021

Description

This PR simplifies the internal indexing behavior by eliminating some functions (Project._index, Project._sp_index) and internal caches (Project._index_cache) with behavior that is nearly-equivalent to other functions (Project._build_index) and internal caches (Project._sp_cache). See comments below. I'm opening this for review but need to make a final pass for changelog, etc.

Motivation and Context

Types of Changes

  • Documentation update
  • Bug fix
  • New feature
  • Breaking change1

1The change breaks (or has the potential to break) existing functionality.

Checklist:

If necessary:

  • I have updated the API documentation as part of the package doc-strings.
  • I have created a separate pull request to update the framework documentation on signac-docs and linked it here.
  • I have updated the changelog and added all related issue and pull request numbers for future reference (if applicable). See example below.

@bdice bdice added this to the v2.0.0 milestone Dec 5, 2021
@codecov
Copy link

codecov bot commented Dec 5, 2021

Codecov Report

Merging #652 (9457215) into next (b29e048) will decrease coverage by 0.06%.
The diff coverage is 81.81%.

Impacted file tree graph

@@            Coverage Diff             @@
##             next     #652      +/-   ##
==========================================
- Coverage   85.62%   85.55%   -0.07%     
==========================================
  Files          51       51              
  Lines        5023     4999      -24     
  Branches     1005     1000       -5     
==========================================
- Hits         4301     4277      -24     
  Misses        519      519              
  Partials      203      203              
Impacted Files Coverage Δ
signac/contrib/project.py 89.00% <81.81%> (-0.43%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b29e048...9457215. Read the comment docs.

@bdice bdice force-pushed the feature/simplify-internal-indexing branch from c6edf74 to f8952db Compare December 5, 2021 21:51
@bdice
Copy link
Member Author

bdice commented Dec 11, 2021

Interesting observation while I was experimenting with this PR. Compare the following snippets:

  • list(project.find_jobs(...)) (3.24 seconds for commit 7f448c5 with 100k jobs)
  • list((job for job in project.find_jobs(...))) (2.09 seconds for commit 7f448c5 with 100k jobs)

The first one is slower because the JobsCursor.__len__ method is called as an optimization for pre-allocating lists. However, in our case, calling JobsCursor.__len__ and then calling JobsCursor.__iter__ is expensive compared to calling JobsCursor.__iter__ alone. Both __len__ and __iter__ methods call _find_job_ids. Wrapping the JobsCursor returned by project.find_jobs(...) in the generator expression (job for job in cursor) hides the __len__ method while having a pass-through __iter__ implementation.

While there is a way to provide a length hint to CPython for iterables without a __len__, there is no corresponding way to disable this optimization when calling __len__ is more expensive than the benefit of pre-allocation. It is sensible for JobsCursor to have a length defined, so I don't recommend any changes here, but wanted to note this peculiarity.

@bdice
Copy link
Member Author

bdice commented Dec 11, 2021

Benchmarks look fine between fbb86df (current head of next) and 7f448c5. I saw a tiny bit of regression (slower by maybe 10%) but I couldn't definitively identify what piece got slower. However, I found an optimization for Collection in 7f448c5 while profiling that made up the difference in performance.

   before           after         ratio
 [fbb86df5]       [7f448c5d]
 <next>           <feature/simplify-internal-indexing>
     83.9±3μs       82.6±0.8μs     0.98  benchmarks.ProjectBench.time_determine_len(100, 10, 0, 100, 0)
      861±8μs         855±10μs     0.99  benchmarks.ProjectBench.time_determine_len(1000, 10, 0, 100, 0)
  5.82±0.09ms      5.73±0.05ms     0.99  benchmarks.ProjectBench.time_iterate(100, 10, 0, 100, 0)
   56.7±0.7ms       57.5±0.9ms     1.01  benchmarks.ProjectBench.time_iterate(1000, 10, 0, 100, 0)
   36.9±0.4ms       36.9±0.6ms     1.00  benchmarks.ProjectBench.time_iterate_load_sp(100, 10, 0, 100, 0)
      365±9ms          361±6ms     0.99  benchmarks.ProjectBench.time_iterate_load_sp(1000, 10, 0, 100, 0)
     587±20μs          576±4μs     0.98  benchmarks.ProjectBench.time_iterate_single_pass(100, 10, 0, 100, 0)
  6.16±0.08ms      6.14±0.06ms     1.00  benchmarks.ProjectBench.time_iterate_single_pass(1000, 10, 0, 100, 0)
     373±10μs          385±6μs     1.03  benchmarks.ProjectRandomJobBench.time_search_lean_filter(100, 10, 0, 100, 0)
   6.33±0.2ms       6.37±0.3ms     1.01  benchmarks.ProjectRandomJobBench.time_search_lean_filter(1000, 10, 0, 100, 0)
  1.60±0.03ms      1.62±0.02ms     1.01  benchmarks.ProjectRandomJobBench.time_search_rich_filter(100, 10, 0, 100, 0)
   27.1±0.6ms         26.3±1ms     0.97  benchmarks.ProjectRandomJobBench.time_search_rich_filter(1000, 10, 0, 100, 0)
   27.2±0.4μs       26.5±0.2μs     0.97  benchmarks.ProjectRandomJobBench.time_select_by_id(100, 10, 0, 100, 0)
     25.9±1μs       26.2±0.5μs     1.01  benchmarks.ProjectRandomJobBench.time_select_by_id(1000, 10, 0, 100, 0)

@bdice bdice marked this pull request as ready for review December 11, 2021 23:17
@bdice bdice requested review from a team as code owners December 11, 2021 23:17
@bdice bdice requested review from b-butler and klywang and removed request for a team December 11, 2021 23:17
@bdice bdice requested a review from a team as a code owner December 25, 2021 22:43
@vyasr
Copy link
Contributor

vyasr commented Dec 27, 2021

@bdice can you rebase this on next/merge in next again? I'm not sure if it's just because of my most recent rebase of next a couple of days ago or if you originally branched this off earlier, but the diff is very noisy and it's hard to find the relevant changes. I can review once you do. Apologies if the conflicts are caused by my last rebase.

@bdice bdice force-pushed the feature/simplify-internal-indexing branch from 7f448c5 to 13f2a8f Compare December 28, 2021 22:09
@bdice
Copy link
Member Author

bdice commented Dec 28, 2021

@vyasr Just rebased. Take a peek now and see what you think.

@bdice
Copy link
Member Author

bdice commented Dec 28, 2021

For posterity: another broad push for optimization would be to reduce the amount of code we use in the Collection class when calling Project.find_jobs See:

return list(Collection(index, _trust=True)._find(filter))

We copy a large amount of data used from the Project's internal caches during the construction of a Collection and calling its find method. At the very least, we never use the Collection's ability to interact with data on disk or its ability to automatically generate ids (primary keys) for new records in the context of a signac Project, so we could eliminate some logic there in a cut-down class.

Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall these changes look good, the removals in the Project class all seem sensible to me. One important question about the Collection change, let me know how confident you feel about it and I can re-review in more depth accordingly.

signac/contrib/project.py Outdated Show resolved Hide resolved
signac/contrib/collection.py Outdated Show resolved Hide resolved
Copy link
Member

@b-butler b-butler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look clean and good.

@bdice bdice added the refactor Code refactoring label Jan 3, 2022
@bdice bdice mentioned this pull request Jan 4, 2022
5 tasks
@bdice bdice requested a review from vyasr January 4, 2022 02:40
@vyasr vyasr merged commit 9f82e33 into next Jan 4, 2022
@vyasr vyasr deleted the feature/simplify-internal-indexing branch January 4, 2022 18:26
vyasr pushed a commit that referenced this pull request Jan 4, 2022
* Simplify _index function.

* Refactor _build_index for clarity/speed.

* Remove Project._sp_index and Project._index_cache.

* Remove Project._index.

* Avoid unnecessary data copying when updating Collection indexes.

* Revert optimizations to Collection.

* Inline include_job_document.
vyasr pushed a commit that referenced this pull request Jan 27, 2022
* Simplify _index function.

* Refactor _build_index for clarity/speed.

* Remove Project._sp_index and Project._index_cache.

* Remove Project._index.

* Avoid unnecessary data copying when updating Collection indexes.

* Revert optimizations to Collection.

* Inline include_job_document.
vyasr pushed a commit that referenced this pull request Feb 4, 2022
* Simplify _index function.

* Refactor _build_index for clarity/speed.

* Remove Project._sp_index and Project._index_cache.

* Remove Project._index.

* Avoid unnecessary data copying when updating Collection indexes.

* Revert optimizations to Collection.

* Inline include_job_document.
vyasr pushed a commit that referenced this pull request Feb 21, 2022
* Simplify _index function.

* Refactor _build_index for clarity/speed.

* Remove Project._sp_index and Project._index_cache.

* Remove Project._index.

* Avoid unnecessary data copying when updating Collection indexes.

* Revert optimizations to Collection.

* Inline include_job_document.
vyasr pushed a commit that referenced this pull request Feb 21, 2022
* Simplify _index function.

* Refactor _build_index for clarity/speed.

* Remove Project._sp_index and Project._index_cache.

* Remove Project._index.

* Avoid unnecessary data copying when updating Collection indexes.

* Revert optimizations to Collection.

* Inline include_job_document.
vyasr pushed a commit that referenced this pull request Mar 14, 2022
* Simplify _index function.

* Refactor _build_index for clarity/speed.

* Remove Project._sp_index and Project._index_cache.

* Remove Project._index.

* Avoid unnecessary data copying when updating Collection indexes.

* Revert optimizations to Collection.

* Inline include_job_document.
vyasr pushed a commit that referenced this pull request Apr 19, 2022
* Simplify _index function.

* Refactor _build_index for clarity/speed.

* Remove Project._sp_index and Project._index_cache.

* Remove Project._index.

* Avoid unnecessary data copying when updating Collection indexes.

* Revert optimizations to Collection.

* Inline include_job_document.
vyasr pushed a commit that referenced this pull request Apr 21, 2022
* Simplify _index function.

* Refactor _build_index for clarity/speed.

* Remove Project._sp_index and Project._index_cache.

* Remove Project._index.

* Avoid unnecessary data copying when updating Collection indexes.

* Revert optimizations to Collection.

* Inline include_job_document.
vyasr pushed a commit that referenced this pull request May 2, 2022
* Simplify _index function.

* Refactor _build_index for clarity/speed.

* Remove Project._sp_index and Project._index_cache.

* Remove Project._index.

* Avoid unnecessary data copying when updating Collection indexes.

* Revert optimizations to Collection.

* Inline include_job_document.
bdice added a commit that referenced this pull request Jun 14, 2022
* Simplify _index function.

* Refactor _build_index for clarity/speed.

* Remove Project._sp_index and Project._index_cache.

* Remove Project._index.

* Avoid unnecessary data copying when updating Collection indexes.

* Revert optimizations to Collection.

* Inline include_job_document.
bdice added a commit that referenced this pull request Aug 1, 2022
* Simplify _index function.

* Refactor _build_index for clarity/speed.

* Remove Project._sp_index and Project._index_cache.

* Remove Project._index.

* Avoid unnecessary data copying when updating Collection indexes.

* Revert optimizations to Collection.

* Inline include_job_document.
bdice added a commit that referenced this pull request Oct 7, 2022
* Simplify _index function.

* Refactor _build_index for clarity/speed.

* Remove Project._sp_index and Project._index_cache.

* Remove Project._index.

* Avoid unnecessary data copying when updating Collection indexes.

* Revert optimizations to Collection.

* Inline include_job_document.
bdice added a commit that referenced this pull request Oct 27, 2022
* Simplify _index function.

* Refactor _build_index for clarity/speed.

* Remove Project._sp_index and Project._index_cache.

* Remove Project._index.

* Avoid unnecessary data copying when updating Collection indexes.

* Revert optimizations to Collection.

* Inline include_job_document.
vyasr pushed a commit that referenced this pull request Oct 30, 2022
* Simplify _index function.

* Refactor _build_index for clarity/speed.

* Remove Project._sp_index and Project._index_cache.

* Remove Project._index.

* Avoid unnecessary data copying when updating Collection indexes.

* Revert optimizations to Collection.

* Inline include_job_document.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
refactor Code refactoring
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants