Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Currently, reading/unfiltering/copying tiles are performed serially among attribute names within Reader::copy_attribute_values(): ``` for (const auto& name : names) { read_tiles(name, result_tiles); unfilter_tiles(name, result_tiles, &cs_ranges); if (!array_schema_->var_size(name)) copy_fixed_cells(name, stride, result_cell_slabs, &fixed_ctx_cache); else copy_var_cells(name, stride, result_cell_slabs, &var_ctx_cache); clear_tiles(name, result_tiles); } ``` Parallelizing the entire for-loop has a few problems: 1. Each iteration requires a large amount of temporary memory to buffer reads into before they are copied into `buffers_` and cleared within clear_tiles(). This memory can not be shared among parallel iterations, so memory bloat scales linearly with the number of parallel iterations. 2. `read_tiles`, `unfilter_tiles`, `copy_fixed_cells`, `copy_var_cells`, and `clear_tiles` are thread-unsafe because of unprotected access to the individual `ResultTile` elements within `result_tiles`. 3. `copy_fixed_cells` and `copy_var_cells` are thread-unsafe because the `fixed_ctx_cache` and `var_ctx_cache` are thread-unsafe. Problem #1: This patch restricts the number of parallel iterations to 2. This is a hard-coded value that I will justify later in this commit message. Problem #2: We can achieve thread-safety with a mutex for each ResultTile element. Even if we tolerate the memory overhead associated with allocating a mutex for each ResultTile, the CPU time spent locking at this granularity is unacceptably high within the `copy_fixed_cells` and `copy_var_cells` paths. However, it is experimentally negligible for the other paths. This is enough of a penalty that I have decided to use a single lock to protect all elements. This effectively serializes all routines, except for the IO tasks within `read_tiles`. Problem #3: I have made the context caches thread safe, although they are still accessed in serial. This is mostly because I had to write this code before I was able to test that parallelizing `copy_*_cells` was too expensive with per-ResulTile mutexes. I have decided to leave it in this patch because it is zero-harm and required for future improvements. I have been testing this in an IO-bound workload (using an S3 backend). Due to the large variance in S3 request times, I ran my test many times and collected the best-case numbers. // Execution times: Without this patch: 4.5s With this patch, 1 parallel iteration: 4.5s With this patch, 2 parallel iteration: 3.4s With this patch, 3 parallel iteration: 3.4s With this patch, 4 parallel iteration: 3.2s With this patch, 6 parallel iteration: 3.2s With this patch, 8 parallel iteration: 3.3s With this patch, 12 parallel iteration: 3.7s Given the above, it does seem that the number of parallel iterations should be configurable. For more than 2 parallel iterations, we linearly scale memory bloat with little-to-gain in execution time. However, with just 2 parallel iterations, we receive a 32% speed-up. Future Improvments: - Modify ThreadPool::execute() to allow limiting the number of parallel iterations for an arbitrary number of tasks. This patch uses a step-wise execution of parallel reads. In other words: both parallel iterations must complete before the next pair of `read_tiles()` can execute. Ideally, the 3rd iteration should start immediately after one of the first two complete. - Increase the granularity of protection on ResultTiles. I have a couple ideas but they would be large and out of scope for this patch. - Revisit the read path for dimensions. Right now, there is no limit to the number of dimensions read at one time. If we have a large number of dimensions, we will have a memory bottleneck.
- Loading branch information