-
Notifications
You must be signed in to change notification settings - Fork 185
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve readers by parallelizing I/O and compute operations #5401
base: main
Are you sure you want to change the base?
Conversation
This removes the read from waiting on all I/O operations and instead moves the I/O task to be owned by the datablock itself. If the I/O threadpool task is valid, we block on data access. This lets I/O and compute be interleaved by only blocking on data when its ready to be processed and allows for better background data loading.
This allows for copying the task/future an enabled multiple threads to check the status of the task in a thread-safe manner.
…checking. While the ThreadPool::SharedTask is designed to be used by multiple threads, its designed for copying. The data structure itself is not thread safe. A recursive mutext is needed because some functions like load_chunk_data call back into filtered_data() and would deadlock. This could be handled by also release the locking in load_chunk_data(), but a recursive_mutex is used for better safety against deadlocks.
… forward declaration issues currently
This is needed because we need to access the data buffer from inside the unfiltering task to unfilter into. We can't block on unfiltering being done from inside the unfiltering task so we need different accessors which let us bypass the check on if the unfiltering task is completed.
This is needed because zip_coordinates is called from the unfilter task itself.
73a2245
to
88c0ecb
Compare
b1d8be7
to
d391375
Compare
ed9b334
to
446700a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've left a couple comments/questions, but overall this looks good to me!
@@ -653,7 +653,7 @@ TEST_CASE_METHOD( | |||
|
|||
// specific relationship for failure not known, but these values | |||
// will result in failure with data being written. | |||
total_budget_ = "15000"; | |||
total_budget_ = "40000"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a pretty big bump. It's just due to the changes in the various sizeof
s right? Not much that can be done about it I suppose. Can you quantify the change in memory usage?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not expecting this to affect memory usage overall as the memory budget we put constraints in theory the peak usage to that budget at any time. However It's a 3x in the size of the result tile, so it's very much affecting the number of result tiles we can load at the same time in memory given that constrained budget, so the number of iterations we'll need to do to load everything we need from disk, so performance :(
I need to understand why this bump is so big a bit better, come back with some good explanation and see if we can reduce.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First of all, I reverted changes in a couple of tests in this file that were marked as [.regression]
so not run, as they were meant to fail even before my changes. This particular one was one of them.
In general, though, I still needed to adapt the budgets in a couple of tests. My analysis showed that the different new member variables we needed to add as part of this work into TileBase
, Tile
, and ResultTile
classes made the ResultTile
class to grow almost double in size. In particular, sizeof(GlobalOrderResultTile<BitmapType>)
, a derived class of ResultTile
, grew from 1576
to 3216
. A big part of this increase was due to the recursive_mutex
we added in Tile
class (64 bytes in my Mac).. ResultTile
stores vectors of TileTuples
of Tiles
so that easily adds up.
Another side-effect of the members we added is that they might have a different size on different architectures and this lead to an inconsistent number of internal loops in our test code that was meant to count them, so I had to disable that test. That test though showed something interesting on my Mac, which is that loop_num
increased to 20 from 9 in that case. This hints that we might end up looping more internally in environments with restricted memory budget (like the TileDB Server) because of our increase in size of ResultTile
.
I couldn't think of an obvious way to reduce the size of ResultTile
in the current design of this PR, so we definitely need to be vigilant for the performance impact on different readers and queries of that increase.
@rroelke @Shelnutt2 for awareness
// taking the lock on tile_queue_mutex_. This is to avoid a deadlock where a | ||
// lock is held forever while waiting for a result to be available, while | ||
// the next scheduled task is deadlocking on that lock | ||
rc.tile_->wait_all_coords(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel a bit like this is the wrong place to call this, it will call it once per cell but as I understand the changes it should be fine to call it just once per tile.
Hence you can add this after // Find a cell in the current result tile
and then also once after GlobalOrderResultCoords rc(&*(rt_it[f]), cell_idx);
in the // For all fragments
loop of merge_result_cell_slabs
.
I don't quite follow you about the deadlock situation - this seems to me like a correctness question. We can't put result tiles into the priority queue until the data is loaded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can't put result tiles into the priority queue until the data is loaded.
Why not? I thought the whole point of this work is to wait for a piece of data to be ready (either loaded or unfiltered) exactly at the point when we need to use it so that we are not busy waiting as much as possible and we can make progress while reading/unfiltering asynchronously.
I don't quite follow you about the deadlock situation
I am sorry the comment wasn't clear enough. Let me try to explain the deadlock scenario I've hit spuriously in testing which is due to the current design of our ThreadPool:
- Overall this PR introduces 2 kinds of async tasks: I/O tasks and unfiltering tasks. An unfiltering task will block waiting on the I/O task to be ready, so that it can start unfiltering the
result_tile
that was read. Any processing that follows, such asmerge_result_cell_slabs
will then wait for the unfiltering task to be ready at the point in time that it'll need to use the data. - Our current rather naive
ThreadPool
uses a sort of stacking of tasks algorithm, in the sense that when a task is scheduled on a thread but not yet ready, it will try to execute another one from the pool of available tasks on the same thread, so that we don't end up in a deadlock situation where all threads are blocked waiting. When that other task is done, it will check if the previous task is now ready to continue executing.
Now the deadlock scenario I was experiencing with this mutex is the following:
- Say we have 3 threads in our pool.
Thread 1
is executing the main reader code.Thread 2
is running the async I/O task for ResultTile N.Thread 3
is running the async unfiltering task for ResultTile N, which is blocked waiting for the corresponding I/O task to be ready so that it can get the data. In the meantimeThread 3
execution reaches theparallel_for
inmerge_result_cell_slabs
. There, a few async tasks are created that need to acquire the lock ontile_queue_mutex_
and then access the unfiltered data of -among others - ResultTile N. - Now it's possible that the async task that acquires the lock, wants to access ResultTile N and gets scheduled on
Thread 3
that is blocked. This is our deadlock. The reason is that whenThread 1
will be done, unfiltering onThread 3
shoud be unblocked, but this will not be possible because it will be stacked under that async task that took the mutex, which will be actually blocked waiting for that unfiltering task to be done :(
IMO this is a design flow of our ThreadPool and it's not the only deadlock situation I've had to cope with in this work. I had to make compromises in terms of performance elsewhere as well, for example when unfiltering a tile we used to do that in parallel chunks, but a similar deadlock situation would arise in that case to, exclusively due to that stacking of tasks in the threadpool that has unexpected side-effects as this one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To destroy any ambiguity in my initial comment - I definitely agree that it is correct to call wait_all_coords
somewhere in the vicinity of this function. My comment is really about the granularity. The task/future based approach does not give us feedback about when the tile is partially filled in, it either is not safely filled in or it is 100% filled in. Hence we only actually need to wait once per tile, not once per cell per tile, and I think the code should reflect that.
We can't put result tiles into the priority queue until the data is loaded.
Why not? I thought the whole point of this work is to wait for a piece of data to be ready (either loaded or unfiltered) exactly at the point when we need to use it so that we are not busy waiting as much as possible and we can make progress while reading/unfiltering asynchronously.
What I should have written instead of "We can't we result tiles into the priority queue until the data is loaded" was "This change strikes me as one for correctness, not one to do with deadlock".
The change in call site for this should be aligned with the goal of this PR. Both forms will wait for the first result tile from each fragment to be loaded before starting anything, and then subsequent tiles in each fragment will be awaited only once we reach them in the queue.
As for the deadlock itself -
it will try to execute another one from the pool of available tasks on the same thread
Oh yeah, I did see that in a stack trace I ran into earlier today. It just calls the function pointer on the same stack, doesn't it.
The scenario you describe looks like it would be pretty commonplace.
Proper coroutines would be great, wouldn't they!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes, with coroutines we could suspend and resume tasks so problem solved 😍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much for the review, it was very useful. I think I addressed most of the comments but I'll have to follow up on one of them after taking a deeper look.
@@ -653,7 +653,7 @@ TEST_CASE_METHOD( | |||
|
|||
// specific relationship for failure not known, but these values | |||
// will result in failure with data being written. | |||
total_budget_ = "15000"; | |||
total_budget_ = "40000"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not expecting this to affect memory usage overall as the memory budget we put constraints in theory the peak usage to that budget at any time. However It's a 3x in the size of the result tile, so it's very much affecting the number of result tiles we can load at the same time in memory given that constrained budget, so the number of iterations we'll need to do to load everything we need from disk, so performance :(
I need to understand why this bump is so big a bit better, come back with some good explanation and see if we can reduce.
// taking the lock on tile_queue_mutex_. This is to avoid a deadlock where a | ||
// lock is held forever while waiting for a result to be available, while | ||
// the next scheduled task is deadlocking on that lock | ||
rc.tile_->wait_all_coords(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can't put result tiles into the priority queue until the data is loaded.
Why not? I thought the whole point of this work is to wait for a piece of data to be ready (either loaded or unfiltered) exactly at the point when we need to use it so that we are not busy waiting as much as possible and we can make progress while reading/unfiltering asynchronously.
I don't quite follow you about the deadlock situation
I am sorry the comment wasn't clear enough. Let me try to explain the deadlock scenario I've hit spuriously in testing which is due to the current design of our ThreadPool:
- Overall this PR introduces 2 kinds of async tasks: I/O tasks and unfiltering tasks. An unfiltering task will block waiting on the I/O task to be ready, so that it can start unfiltering the
result_tile
that was read. Any processing that follows, such asmerge_result_cell_slabs
will then wait for the unfiltering task to be ready at the point in time that it'll need to use the data. - Our current rather naive
ThreadPool
uses a sort of stacking of tasks algorithm, in the sense that when a task is scheduled on a thread but not yet ready, it will try to execute another one from the pool of available tasks on the same thread, so that we don't end up in a deadlock situation where all threads are blocked waiting. When that other task is done, it will check if the previous task is now ready to continue executing.
Now the deadlock scenario I was experiencing with this mutex is the following:
- Say we have 3 threads in our pool.
Thread 1
is executing the main reader code.Thread 2
is running the async I/O task for ResultTile N.Thread 3
is running the async unfiltering task for ResultTile N, which is blocked waiting for the corresponding I/O task to be ready so that it can get the data. In the meantimeThread 3
execution reaches theparallel_for
inmerge_result_cell_slabs
. There, a few async tasks are created that need to acquire the lock ontile_queue_mutex_
and then access the unfiltered data of -among others - ResultTile N. - Now it's possible that the async task that acquires the lock, wants to access ResultTile N and gets scheduled on
Thread 3
that is blocked. This is our deadlock. The reason is that whenThread 1
will be done, unfiltering onThread 3
shoud be unblocked, but this will not be possible because it will be stacked under that async task that took the mutex, which will be actually blocked waiting for that unfiltering task to be done :(
IMO this is a design flow of our ThreadPool and it's not the only deadlock situation I've had to cope with in this work. I had to make compromises in terms of performance elsewhere as well, for example when unfiltering a tile we used to do that in parallel chunks, but a similar deadlock situation would arise in that case to, exclusively due to that stacking of tasks in the threadpool that has unexpected side-effects as this one.
c5899b7
to
260fc22
Compare
Thanks for the responses, I will approve once we have a little more info on the memory budget change. |
b98c536
to
eacdc79
Compare
a81fa30
to
a2fb4e1
Compare
c07ec20
to
c8b6de6
Compare
This long standing PR has suffered from 2 pain points:
I have therefore decided to finally split this work in 2 parts:
This PR implements part1, CI is passing reliably and benchmarks show no regression, but no significant speedup either. I think it can be safely merged. |
Today when a reader issues the I/O request to VFS, we block waiting for all I/O to finish before moving to unfiltering. We then block again waiting for unfiltering to be done for all tiles and then continue to processing the results.
This PR is part1 of the effort to minimize wait all points in reader code : it removes the need to wait for all I/O to be done, and uses async tasks to signal when a tile is done reading so that it can proceed to unfiltering.
Part2 will come in a future PR for using async tasks for unfiltering as well in order to remove then need to wait for a tile is done unfiltering so that it can proceed to result processing before copying to the user buffers.
TYPE: IMPROVEMENT
DESC: Improve readers by parallelizing I/O and compute operations