-
Notifications
You must be signed in to change notification settings - Fork 186
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dense reader: read next batch of tiles as other get processed. #3965
Conversation
This pull request has been linked to Shortcut Story #25396: Read tiles as others get unfiltered.. |
clear_tiles(name, result_tiles); | ||
compute_task = storage_manager_->compute_tp()->execute( | ||
[&, | ||
filtered_data = std::move(filtered_data), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
filtered_data doesn't appear to actually be used in here, maybe irrelevant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is used indirectly. The memory that the filtered_data object contains needs to be kept alive until unfiltering is completed, which happens at the end of this task.
Editorializing, there is an important general principle here that can be recognized independent of this (or any) particular application, namely overlapping I/O with computation. As long as the same processor (or a single thread) isn't doing both I/O and compute, it is generally a win and is especially useful for hiding latency. The general pattern is
If multiple I/O calls are required, one can do
The advantage of following this kind of a pattern is that the concurrency between I/O and computation is localized and fairly easy to follow -- structured, in other words. In the PR, it seems like the compute is being made asynchronous and also being passed around so it isn't clear where it is completing or where it is getting launched again. So we should try to figure out a way of making it more structured. (As an aside, the I/O is usually the thing that is made asynchronous because most operating systems have I/O subsystems that support asynchronous operations. In our situation, the I/O is much more complicated than just doing an OS call.) (As an aside aside -- we should audit the code to find other overlap opportunities like this -- but also develop a formula to realize the overlap in a more structured way.) |
This change moves tile processing to another task so that the read can continue until another read operation is encountered. The reader then does the read, after which it will wait for the running process operation to complete before kicking off the new one. This will make it so that most reads after the first one come for free. For large queries, rough benchmarking shows that we reduce query time by 30%. --- TYPE: IMPROVEMENT DESC: Dense reader: read next batch of tiles as other get processed.
f346b4b
to
76ff3da
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All of my immediate concerns were addressed. There are a few things (the pattern for creating can passing around compute_task
) that need to be planned and executed more carefully, but will ansi need to be part of a larger restructuring. I don't think we need to implement those things for this PR as that restructuring will be large task on its own.
} | ||
|
||
// Process all tiles in parallel. | ||
auto status = parallel_for_2d( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to use num_range_threads-1
here to count for the extra compute thread?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The range threads are not actually the number of threads in the threadpool. They are only used when a read will be a few large tiles. At that point, we will split the work for a tile across threads. Also, by the time the work of the parallel for in the compute task gets processed, the compute task should already mostly be in a waiting/yielding state.
…opy-parallel/ch25396
This change moves tile processing to another task so that the read can continue until another read operation is encountered. The reader then does the read, after which it will wait for the running process operation to complete before kicking off the new one. This will make it so that most reads after the first one come for free. For large queries, rough benchmarking shows that we reduce query time by 30%.
TYPE: IMPROVEMENT
DESC: Dense reader: read next batch of tiles as other get processed.