Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve readers by parallelizing I/O and compute operations #5401

Open
wants to merge 51 commits into
base: main
Choose a base branch
from

Conversation

ypatia
Copy link
Member

@ypatia ypatia commented Dec 6, 2024

Today when a reader issues the I/O request to VFS, we block waiting for all I/O to finish before moving to unfiltering. We then block again waiting for unfiltering to be done for all tiles and then continue to processing the results.

This PR is part1 of the effort to minimize wait all points in reader code : it removes the need to wait for all I/O to be done, and uses async tasks to signal when a tile is done reading so that it can proceed to unfiltering.

Part2 will come in a future PR for using async tasks for unfiltering as well in order to remove then need to wait for a tile is done unfiltering so that it can proceed to result processing before copying to the user buffers.


TYPE: IMPROVEMENT
DESC: Improve readers by parallelizing I/O and compute operations

ypatia and others added 30 commits November 29, 2024 15:28
This removes the read from waiting on all I/O operations and instead
moves the I/O task to be owned by the datablock itself. If the I/O
threadpool task is valid, we block on data access. This lets I/O and
compute be interleaved by only blocking on data when its ready to be
processed and allows for better background data loading.
This allows for copying the task/future an enabled multiple threads to
check the status of the task in a thread-safe manner.
…checking.

While the ThreadPool::SharedTask is designed to be used by multiple
threads, its designed for copying. The data structure itself is not
thread safe.

A recursive mutext is needed because some functions like load_chunk_data
call back into filtered_data() and would deadlock. This could be handled by
also release the locking in load_chunk_data(), but a recursive_mutex is
used for better safety against deadlocks.
This is needed because we need to access the data buffer from inside the
unfiltering task to unfilter into. We can't block on unfiltering being
done from inside the unfiltering task so we need different accessors
which let us bypass the check on if the unfiltering task is completed.
This is needed because zip_coordinates is called from the unfilter task
itself.
@ypatia ypatia marked this pull request as draft December 9, 2024 10:06
@ypatia ypatia force-pushed the yt/sc-59605/dont_block_io branch from 73a2245 to 88c0ecb Compare December 18, 2024 09:53
@ypatia ypatia force-pushed the yt/sc-59605/dont_block_io branch from b1d8be7 to d391375 Compare December 18, 2024 15:30
@ypatia ypatia force-pushed the yt/sc-59605/dont_block_io branch from ed9b334 to 446700a Compare December 19, 2024 13:11
@ypatia ypatia changed the title WIP: Improve readers by parallelizing I/O and compute operations Improve readers by parallelizing I/O and compute operations Dec 20, 2024
@ypatia ypatia marked this pull request as ready for review December 20, 2024 07:59
@ypatia ypatia requested review from Shelnutt2 and rroelke December 20, 2024 08:04
Copy link
Contributor

@rroelke rroelke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've left a couple comments/questions, but overall this looks good to me!

tiledb/sm/query/readers/filtered_data.h Outdated Show resolved Hide resolved
tiledb/sm/query/readers/filtered_data.h Show resolved Hide resolved
tiledb/sm/query/readers/filtered_data.h Show resolved Hide resolved
tiledb/sm/query/readers/reader_base.h Outdated Show resolved Hide resolved
tiledb/sm/query/readers/result_tile.h Outdated Show resolved Hide resolved
tiledb/sm/query/readers/result_tile.h Outdated Show resolved Hide resolved
tiledb/sm/query/readers/result_tile.h Outdated Show resolved Hide resolved
@@ -653,7 +653,7 @@ TEST_CASE_METHOD(

// specific relationship for failure not known, but these values
// will result in failure with data being written.
total_budget_ = "15000";
total_budget_ = "40000";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a pretty big bump. It's just due to the changes in the various sizeofs right? Not much that can be done about it I suppose. Can you quantify the change in memory usage?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not expecting this to affect memory usage overall as the memory budget we put constraints in theory the peak usage to that budget at any time. However It's a 3x in the size of the result tile, so it's very much affecting the number of result tiles we can load at the same time in memory given that constrained budget, so the number of iterations we'll need to do to load everything we need from disk, so performance :(

I need to understand why this bump is so big a bit better, come back with some good explanation and see if we can reduce.

Copy link
Member Author

@ypatia ypatia Jan 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First of all, I reverted changes in a couple of tests in this file that were marked as [.regression] so not run, as they were meant to fail even before my changes. This particular one was one of them.

In general, though, I still needed to adapt the budgets in a couple of tests. My analysis showed that the different new member variables we needed to add as part of this work into TileBase, Tile, and ResultTile classes made the ResultTile class to grow almost double in size. In particular, sizeof(GlobalOrderResultTile<BitmapType>) , a derived class of ResultTile, grew from 1576 to 3216. A big part of this increase was due to the recursive_mutex we added in Tile class (64 bytes in my Mac).. ResultTile stores vectors of TileTuples of Tiles so that easily adds up.

Another side-effect of the members we added is that they might have a different size on different architectures and this lead to an inconsistent number of internal loops in our test code that was meant to count them, so I had to disable that test. That test though showed something interesting on my Mac, which is that loop_num increased to 20 from 9 in that case. This hints that we might end up looping more internally in environments with restricted memory budget (like the TileDB Server) because of our increase in size of ResultTile.

I couldn't think of an obvious way to reduce the size of ResultTile in the current design of this PR, so we definitely need to be vigilant for the performance impact on different readers and queries of that increase.

@rroelke @Shelnutt2 for awareness

// taking the lock on tile_queue_mutex_. This is to avoid a deadlock where a
// lock is held forever while waiting for a result to be available, while
// the next scheduled task is deadlocking on that lock
rc.tile_->wait_all_coords();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel a bit like this is the wrong place to call this, it will call it once per cell but as I understand the changes it should be fine to call it just once per tile.
Hence you can add this after // Find a cell in the current result tile and then also once after GlobalOrderResultCoords rc(&*(rt_it[f]), cell_idx); in the // For all fragments loop of merge_result_cell_slabs.

I don't quite follow you about the deadlock situation - this seems to me like a correctness question. We can't put result tiles into the priority queue until the data is loaded.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't put result tiles into the priority queue until the data is loaded.

Why not? I thought the whole point of this work is to wait for a piece of data to be ready (either loaded or unfiltered) exactly at the point when we need to use it so that we are not busy waiting as much as possible and we can make progress while reading/unfiltering asynchronously.

I don't quite follow you about the deadlock situation

I am sorry the comment wasn't clear enough. Let me try to explain the deadlock scenario I've hit spuriously in testing which is due to the current design of our ThreadPool:

  • Overall this PR introduces 2 kinds of async tasks: I/O tasks and unfiltering tasks. An unfiltering task will block waiting on the I/O task to be ready, so that it can start unfiltering the result_tile that was read. Any processing that follows, such as merge_result_cell_slabs will then wait for the unfiltering task to be ready at the point in time that it'll need to use the data.
  • Our current rather naive ThreadPool uses a sort of stacking of tasks algorithm, in the sense that when a task is scheduled on a thread but not yet ready, it will try to execute another one from the pool of available tasks on the same thread, so that we don't end up in a deadlock situation where all threads are blocked waiting. When that other task is done, it will check if the previous task is now ready to continue executing.

Now the deadlock scenario I was experiencing with this mutex is the following:

  • Say we have 3 threads in our pool. Thread 1 is executing the main reader code. Thread 2 is running the async I/O task for ResultTile N. Thread 3 is running the async unfiltering task for ResultTile N, which is blocked waiting for the corresponding I/O task to be ready so that it can get the data. In the meantime Thread 3 execution reaches the parallel_for in merge_result_cell_slabs. There, a few async tasks are created that need to acquire the lock on tile_queue_mutex_ and then access the unfiltered data of -among others - ResultTile N.
  • Now it's possible that the async task that acquires the lock, wants to access ResultTile N and gets scheduled on Thread 3 that is blocked. This is our deadlock. The reason is that when Thread 1 will be done, unfiltering on Thread 3 shoud be unblocked, but this will not be possible because it will be stacked under that async task that took the mutex, which will be actually blocked waiting for that unfiltering task to be done :(

IMO this is a design flow of our ThreadPool and it's not the only deadlock situation I've had to cope with in this work. I had to make compromises in terms of performance elsewhere as well, for example when unfiltering a tile we used to do that in parallel chunks, but a similar deadlock situation would arise in that case to, exclusively due to that stacking of tasks in the threadpool that has unexpected side-effects as this one.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To destroy any ambiguity in my initial comment - I definitely agree that it is correct to call wait_all_coords somewhere in the vicinity of this function. My comment is really about the granularity. The task/future based approach does not give us feedback about when the tile is partially filled in, it either is not safely filled in or it is 100% filled in. Hence we only actually need to wait once per tile, not once per cell per tile, and I think the code should reflect that.

We can't put result tiles into the priority queue until the data is loaded.

Why not? I thought the whole point of this work is to wait for a piece of data to be ready (either loaded or unfiltered) exactly at the point when we need to use it so that we are not busy waiting as much as possible and we can make progress while reading/unfiltering asynchronously.

What I should have written instead of "We can't we result tiles into the priority queue until the data is loaded" was "This change strikes me as one for correctness, not one to do with deadlock".

The change in call site for this should be aligned with the goal of this PR. Both forms will wait for the first result tile from each fragment to be loaded before starting anything, and then subsequent tiles in each fragment will be awaited only once we reach them in the queue.

As for the deadlock itself -

it will try to execute another one from the pool of available tasks on the same thread

Oh yeah, I did see that in a stack trace I ran into earlier today. It just calls the function pointer on the same stack, doesn't it.

The scenario you describe looks like it would be pretty commonplace.

Proper coroutines would be great, wouldn't they!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, with coroutines we could suspend and resume tasks so problem solved 😍

Copy link
Member Author

@ypatia ypatia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for the review, it was very useful. I think I addressed most of the comments but I'll have to follow up on one of them after taking a deeper look.

@@ -653,7 +653,7 @@ TEST_CASE_METHOD(

// specific relationship for failure not known, but these values
// will result in failure with data being written.
total_budget_ = "15000";
total_budget_ = "40000";
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not expecting this to affect memory usage overall as the memory budget we put constraints in theory the peak usage to that budget at any time. However It's a 3x in the size of the result tile, so it's very much affecting the number of result tiles we can load at the same time in memory given that constrained budget, so the number of iterations we'll need to do to load everything we need from disk, so performance :(

I need to understand why this bump is so big a bit better, come back with some good explanation and see if we can reduce.

tiledb/sm/query/readers/filtered_data.h Outdated Show resolved Hide resolved
tiledb/sm/query/readers/filtered_data.h Show resolved Hide resolved
tiledb/sm/query/readers/filtered_data.h Show resolved Hide resolved
tiledb/sm/query/readers/reader_base.h Outdated Show resolved Hide resolved
tiledb/sm/query/readers/result_tile.h Outdated Show resolved Hide resolved
tiledb/sm/query/readers/result_tile.h Outdated Show resolved Hide resolved
// taking the lock on tile_queue_mutex_. This is to avoid a deadlock where a
// lock is held forever while waiting for a result to be available, while
// the next scheduled task is deadlocking on that lock
rc.tile_->wait_all_coords();
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't put result tiles into the priority queue until the data is loaded.

Why not? I thought the whole point of this work is to wait for a piece of data to be ready (either loaded or unfiltered) exactly at the point when we need to use it so that we are not busy waiting as much as possible and we can make progress while reading/unfiltering asynchronously.

I don't quite follow you about the deadlock situation

I am sorry the comment wasn't clear enough. Let me try to explain the deadlock scenario I've hit spuriously in testing which is due to the current design of our ThreadPool:

  • Overall this PR introduces 2 kinds of async tasks: I/O tasks and unfiltering tasks. An unfiltering task will block waiting on the I/O task to be ready, so that it can start unfiltering the result_tile that was read. Any processing that follows, such as merge_result_cell_slabs will then wait for the unfiltering task to be ready at the point in time that it'll need to use the data.
  • Our current rather naive ThreadPool uses a sort of stacking of tasks algorithm, in the sense that when a task is scheduled on a thread but not yet ready, it will try to execute another one from the pool of available tasks on the same thread, so that we don't end up in a deadlock situation where all threads are blocked waiting. When that other task is done, it will check if the previous task is now ready to continue executing.

Now the deadlock scenario I was experiencing with this mutex is the following:

  • Say we have 3 threads in our pool. Thread 1 is executing the main reader code. Thread 2 is running the async I/O task for ResultTile N. Thread 3 is running the async unfiltering task for ResultTile N, which is blocked waiting for the corresponding I/O task to be ready so that it can get the data. In the meantime Thread 3 execution reaches the parallel_for in merge_result_cell_slabs. There, a few async tasks are created that need to acquire the lock on tile_queue_mutex_ and then access the unfiltered data of -among others - ResultTile N.
  • Now it's possible that the async task that acquires the lock, wants to access ResultTile N and gets scheduled on Thread 3 that is blocked. This is our deadlock. The reason is that when Thread 1 will be done, unfiltering on Thread 3 shoud be unblocked, but this will not be possible because it will be stacked under that async task that took the mutex, which will be actually blocked waiting for that unfiltering task to be done :(

IMO this is a design flow of our ThreadPool and it's not the only deadlock situation I've had to cope with in this work. I had to make compromises in terms of performance elsewhere as well, for example when unfiltering a tile we used to do that in parallel chunks, but a similar deadlock situation would arise in that case to, exclusively due to that stacking of tasks in the threadpool that has unexpected side-effects as this one.

@ypatia ypatia force-pushed the yt/sc-59605/dont_block_io branch from c5899b7 to 260fc22 Compare January 7, 2025 14:23
@rroelke
Copy link
Contributor

rroelke commented Jan 7, 2025

Thanks for the responses, I will approve once we have a little more info on the memory budget change.

@ypatia ypatia force-pushed the yt/sc-59605/dont_block_io branch from b98c536 to eacdc79 Compare January 21, 2025 12:53
@ypatia ypatia force-pushed the yt/sc-59605/dont_block_io branch from a81fa30 to a2fb4e1 Compare January 23, 2025 10:42
@ypatia ypatia force-pushed the yt/sc-59605/dont_block_io branch from c07ec20 to c8b6de6 Compare January 23, 2025 14:57
@ypatia
Copy link
Member Author

ypatia commented Jan 24, 2025

This long standing PR has suffered from 2 pain points:

  • Instability in CI: ASAN failures and hangs in windows tests would come and go, which means there were lifetime and deadlock issues still to be addressed.
  • Performance regression in legacy readers and overall no improvement in the overall picture of SOMA benchmarks.

I have therefore decided to finally split this work in 2 parts:

  • Part1 (this PR): removes the need to wait for all I/O to be done, and uses async tasks to signal when a tile is done reading so that it can proceed to unfiltering.
  • Part2 (future PR): removes then need to wait for a tile is done unfiltering. by using async tasks for unfiltering as well in order to proceed to result processing and then copying to the user buffers.

This PR implements part1, CI is passing reliably and benchmarks show no regression, but no significant speedup either. I think it can be safely merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants