Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tile id bug #1

Merged
merged 4 commits into from
Apr 27, 2017
Merged

Tile id bug #1

merged 4 commits into from
Apr 27, 2017

Conversation

TileDB-Inc
Copy link
Collaborator

Somewhere in the code the tile_id variable was defined as int, whereas it should have been int64_t. This created problems for very small tile extents in a large domain, which led to tile id overflows due to the numerous space tiles.

@TileDB-Inc TileDB-Inc merged commit b857d6c into dev Apr 27, 2017
@jakebolewski jakebolewski deleted the tile_id_bug branch July 11, 2018 15:26
joe-maley pushed a commit that referenced this pull request Mar 27, 2019
Closes #1068

1. Consolidates all state for a single upload multipart request into a single
   struct 'MultiPartUploadState'.
2. Refactored to support #1.
3. Introduced a shared Status that will be set to the last non-OK status
   from an individual part upload request. Defaults to OK.
4. When disconnecting or flushing an object, we'll invoke an 'Abort' request
   instead of a 'Complete' request if the shared state from #3 is non-OK.

Tests run: ./tiledb_unit *S3*
joe-maley pushed a commit that referenced this pull request Mar 28, 2019
Closes #1068

1. Consolidates all state for a single upload multipart request into a single
   struct 'MultiPartUploadState'.
2. Refactored to support #1.
3. Introduced a shared Status that will be set to the last non-OK status
   from an individual part upload request. Defaults to OK.
4. When disconnecting or flushing an object, we'll invoke an 'Abort' request
   instead of a 'Complete' request if the shared state from #3 is non-OK.

Tests run: ./tiledb_unit *S3*
joe-maley pushed a commit that referenced this pull request Mar 29, 2019
Closes #1068

1. Consolidates all state for a single upload multipart request into a single
   struct 'MultiPartUploadState'.
2. Refactored to support #1.
3. Introduced a shared Status that will be set to the last non-OK status
   from an individual part upload request. Defaults to OK.
4. When disconnecting or flushing an object, we'll invoke an 'Abort' request
   instead of a 'Complete' request if the shared state from #3 is non-OK.
5. Added a unit test for this failure path
6. Introduced the 'UnitTestConfig' singleton for mutating the state of
   classes under test.
7. Introduced 'macros.h' for common preprocessor macros.

Tests run: ./tiledb_unit *S3*
joe-maley pushed a commit that referenced this pull request Mar 29, 2019
Closes #1068

1. Consolidates all state for a single upload multipart request into a single
   struct 'MultiPartUploadState'.
2. Refactored to support #1.
3. Introduced a shared Status that will be set to the last non-OK status
   from an individual part upload request. Defaults to OK.
4. When disconnecting or flushing an object, we'll invoke an 'Abort' request
   instead of a 'Complete' request if the shared state from #3 is non-OK.
5. Added a unit test for this failure path
6. Introduced the 'UnitTestConfig' singleton for mutating the state of
   classes under test.
7. Introduced 'macros.h' for common preprocessor macros.

Tests run: ./tiledb_unit *S3*
joe-maley pushed a commit that referenced this pull request Mar 29, 2019
Closes #1068

1. Consolidates all state for a single upload multipart request into a single
   struct 'MultiPartUploadState'.
2. Refactored to support #1.
3. Introduced a shared Status that will be set to the last non-OK status
   from an individual part upload request. Defaults to OK.
4. When disconnecting or flushing an object, we'll invoke an 'Abort' request
   instead of a 'Complete' request if the shared state from #3 is non-OK.
5. Added a unit test for this failure path
6. Introduced the 'UnitTestConfig' singleton for mutating the state of
   classes under test.
7. Introduced 'macros.h' for common preprocessor macros.

Tests run: ./tiledb_unit *S3*
joe-maley pushed a commit that referenced this pull request Mar 29, 2019
Closes #1068

1. Consolidates all state for a single upload multipart request into a single
   struct 'MultiPartUploadState'.
2. Refactored to support #1.
3. Introduced a shared Status that will be set to the last non-OK status
   from an individual part upload request. Defaults to OK.
4. When disconnecting or flushing an object, we'll invoke an 'Abort' request
   instead of a 'Complete' request if the shared state from #3 is non-OK.
5. Added a unit test for this failure path
6. Introduced the 'UnitTestConfig' singleton for mutating the state of
   classes under test.
7. Introduced 'macros.h' for common preprocessor macros.

Tests run: ./tiledb_unit *S3*
joe-maley pushed a commit that referenced this pull request Mar 29, 2019
Closes #1068

1. Consolidates all state for a single upload multipart request into a single
   struct 'MultiPartUploadState'.
2. Refactored to support #1.
3. Introduced a shared Status that will be set to the last non-OK status
   from an individual part upload request. Defaults to OK.
4. When disconnecting or flushing an object, we'll invoke an 'Abort' request
   instead of a 'Complete' request if the shared state from #3 is non-OK.
5. Added a unit test for this failure path
6. Introduced the 'UnitTestConfig' singleton for mutating the state of
   classes under test.
7. Introduced 'macros.h' for common preprocessor macros.

Tests run: ./tiledb_unit *S3*
joe-maley pushed a commit that referenced this pull request Nov 26, 2019
This fixes two issues:

Issue #1:
"Cannot deserialize; kj::Exception: expected segment != nullptr &&
 segment->checkObject(segment->getStartPtr(), ONE * WORDS); Message
 did not contain a root pointer."}

This a one-line fix where we must reset the original buffers offset
after writing to it.

Issue #2:
Before attempting to deserialize a query, we serialize the current
state as a backup. If the deserialization of the new query fails,
we deserialize the backup to ensure that the query object leaves
the 'query_deserialize()' in its original state. Deserializing
the backup is failing on a 'memcpy' when we attempt to copy the
first incoming attribute buffer into the backup query. This fails
because we are only deserializing the query object and do not
provide any concatenated data buffers.

The easiest fix is to add a path to ignore attribute buffer processing
entirely in the 'do_query_deserialize' path. This allows us to revert
the Query object to its original state, leaving the data buffers untouched.
This is OK because we also revert the 'copy_state', so we can have extra,
incomplete data at the end of the data buffers.
joe-maley pushed a commit that referenced this pull request Nov 27, 2019
This fixes two issues:

Issue #1:
"Cannot deserialize; kj::Exception: expected segment != nullptr &&
 segment->checkObject(segment->getStartPtr(), ONE * WORDS); Message
 did not contain a root pointer."}

This a one-line fix where we must reset the original buffers offset
after writing to it.

Issue #2:
Before attempting to deserialize a query, we serialize the current
state as a backup. If the deserialization of the new query fails,
we deserialize the backup to ensure that the query object leaves
the 'query_deserialize()' in its original state. Deserializing
the backup is failing on a 'memcpy' when we attempt to copy the
first incoming attribute buffer into the backup query. This fails
because we are only deserializing the query object and do not
provide any concatenated data buffers.

The easiest fix is to add a path to ignore attribute buffer processing
entirely in the 'do_query_deserialize' path. This allows us to revert
the Query object to its original state, leaving the data buffers untouched.
This is OK because we also revert the 'copy_state', so we can have extra,
incomplete data at the end of the data buffers.
joe-maley pushed a commit that referenced this pull request Nov 27, 2019
This fixes two issues:

Issue #1:
"Cannot deserialize; kj::Exception: expected segment != nullptr &&
 segment->checkObject(segment->getStartPtr(), ONE * WORDS); Message
 did not contain a root pointer."}

This a one-line fix where we must reset the original buffers offset
after writing to it.

Issue #2:
Before attempting to deserialize a query, we serialize the current
state as a backup. If the deserialization of the new query fails,
we deserialize the backup to ensure that the query object leaves
the 'query_deserialize()' in its original state. Deserializing
the backup is failing on a 'memcpy' when we attempt to copy the
first incoming attribute buffer into the backup query. This fails
because we are only deserializing the query object and do not
provide any concatenated data buffers.

The easiest fix is to add a path to ignore attribute buffer processing
entirely in the 'do_query_deserialize' path. This allows us to revert
the Query object to its original state, leaving the data buffers untouched.
This is OK because we also revert the 'copy_state', so we can have extra,
incomplete data at the end of the data buffers.
joe-maley pushed a commit that referenced this pull request Dec 2, 2019
This fixes two issues:

Issue #1:
"Cannot deserialize; kj::Exception: expected segment != nullptr &&
 segment->checkObject(segment->getStartPtr(), ONE * WORDS); Message
 did not contain a root pointer."}

This a one-line fix where we must reset the original buffers offset
after writing to it.

Issue #2:
Before attempting to deserialize a query, we serialize the current
state as a backup. If the deserialization of the new query fails,
we deserialize the backup to ensure that the query object leaves
the 'query_deserialize()' in its original state. Deserializing
the backup is failing on a 'memcpy' when we attempt to copy the
first incoming attribute buffer into the backup query. This fails
because we are only deserializing the query object and do not
provide any concatenated data buffers.

The easiest fix is to add a path to ignore attribute buffer processing
entirely in the 'do_query_deserialize' path. This allows us to revert
the Query object to its original state, leaving the data buffers untouched.
This is OK because we also revert the 'copy_state', so we can have extra,
incomplete data at the end of the data buffers.
ihnorton pushed a commit that referenced this pull request Dec 4, 2019
This fixes two issues:

Issue #1:
"Cannot deserialize; kj::Exception: expected segment != nullptr &&
 segment->checkObject(segment->getStartPtr(), ONE * WORDS); Message
 did not contain a root pointer."}

This a one-line fix where we must reset the original buffers offset
after writing to it.

Issue #2:
Before attempting to deserialize a query, we serialize the current
state as a backup. If the deserialization of the new query fails,
we deserialize the backup to ensure that the query object leaves
the 'query_deserialize()' in its original state. Deserializing
the backup is failing on a 'memcpy' when we attempt to copy the
first incoming attribute buffer into the backup query. This fails
because we are only deserializing the query object and do not
provide any concatenated data buffers.

The easiest fix is to add a path to ignore attribute buffer processing
entirely in the 'do_query_deserialize' path. This allows us to revert
the Query object to its original state, leaving the data buffers untouched.
This is OK because we also revert the 'copy_state', so we can have extra,
incomplete data at the end of the data buffers.

(cherry picked from commit 8fe1427
  #1444)
joe-maley pushed a commit that referenced this pull request Dec 20, 2019
For early review: ChunkBuffers and associated unit tests.

This patch contains chunk_buffers.cc/h for use in selective
decompression. This also contains a passing unit-ChunkBuffers.cc test
and a modified unit-Tile.cc test that passes with the associated
changes to Tile that are NOT included in this review.

I have included a detailed explanation of the new class in its
header file (chunk_buffers.h), but for easy reference, I'll paste it here:

 * The ChunkBuffers class is used to represent a logically contigious buffer
 * as a vector of individual buffers. These individual buffers are referred to
 * as "chunk buffers". Each chunk buffer may be allocated individually, which
 * will save memory in scenarios where the logically contigious buffer is
 * sparsley allocated.
 *
 * After construction, the class must be initialized before performing IO. The
 * initialization determines the following, independent usage paradigms:
 *
 * #1: Chunk Sizes: Fixed/Variable
 * The chunk sizes must be either fixed or variable. An instance with fixed
 * chunk sizes ensures that all chunk buffers are of equal size. The size of the
 * last chunk buffer may be equal-to or less-than the other chunk sizes.
 * Instances with fixed size chunks have a smaller memory footprint and have a
 * smaller algorithmic complexity when performing IO. For variable sized chunks,
 * each chunk size is independent from the others.
 *
 * #2: Chunk Buffer Addressing: Discrete/Contigious
 * The addresses of the individual chunk buffers may or may not be virtually
 * contigious. For example, the chunk addresses within a virtually contigious
 * instance may be allocated at address 1024 and 1028, where the first chunk is
 * of size 4. Non-contigious chunks (referred to as "discrete") may be allocated
 * at any address. The trade-off is that the memory of each discrete chunk is
 * managed individually, where contigious chunk buffers can be managed by the
 * first chunk alone.
 *
 * #3: Memory Management: Internal/External
 * The chunk buffers may be allocated and freed internally or externally.
 * Internal memory management is exposed through the alloc_*() and free_*()
 * routines. External memory management is exposed through the set_*() routines.
 * Currently, this only supports external memory management for contigiously
 * addressed buffers and internal memory management for discretely addressed
 * buffers.
 *
 * Note that the ChunkBuffers class does NOT support any concept of ownership.
 * It is up to the caller to free the instance before destruction.
@joe-maley joe-maley mentioned this pull request Dec 20, 2019
ihnorton referenced this pull request in ihnorton/TileDB Mar 17, 2020
Initialize VFS with context config
joe-maley pushed a commit that referenced this pull request Mar 17, 2020
Change #1: Fix non-multipart uploads

  This fixes an issue where we free the buffer before writing it in the
  non-multipart upload path, which can be used for quicker small writes.

  Adds an associated unit test.

Change #2: Missing error return path in blob flush

  In the blob flush path for multipart uploads, the last flush of the write
  cache may silently fail. This ensures the user receives a non-OK status
  from flush_blob() in this scenario.

Change #3: Multipart uploads fail when they have > 10 chunks

  In a the multipart upload (aka blocklist upload), each chunk is uploaded
  a block with a string-id unique to the overall object (aka blob). There is
  an undocumented requirement that all string-ids must be of equal length.

  Currently, the string-ids are generated from an incrementing integer. For
  example, the id of the first chunk is "0", the second chunk is "1", the
  tenth chunk is "9", and the eleventh chunk is "10". The eleventh chunk
  has a string-id size of 2 while all the others have a size of 1. The
  eleventh chunk (and all other future chunks) fail with the following
  error message:

    "The specified blob or block content is invalid"

  This patches pads the string-ids to ensure they are all 5-characters long.
  The maximum number of chunks is 50,000 so 5-characters is sufficient to
  contain all chunk ids.

  More info here:
  https://gauravmantri.com/2013/05/18/windows-azure-blob-storage-dealing-with-the-specified-blob-or-block-content-is-invalid-error/
@joe-maley joe-maley mentioned this pull request Mar 17, 2020
joe-maley pushed a commit that referenced this pull request Mar 17, 2020
Change #1: Fix non-multipart uploads

  This fixes an issue where we free the buffer before writing it in the
  non-multipart upload path, which can be used for quicker small writes.

  Adds an associated unit test.

Change #2: Missing error return path in blob flush

  In the blob flush path for multipart uploads, the last flush of the write
  cache may silently fail. This ensures the user receives a non-OK status
  from flush_blob() in this scenario.

Change #3: Multipart uploads fail when they have > 10 chunks

  In a the multipart upload (aka blocklist upload), each chunk is uploaded
  a block with a string-id unique to the overall object (aka blob). There is
  an undocumented requirement that all string-ids must be of equal length.

  Currently, the string-ids are generated from an incrementing integer. For
  example, the id of the first chunk is "0", the second chunk is "1", the
  tenth chunk is "9", and the eleventh chunk is "10". The eleventh chunk
  has a string-id size of 2 while all the others have a size of 1. The
  eleventh chunk (and all other future chunks) fail with the following
  error message:

    "The specified blob or block content is invalid"

  This patches pads the string-ids to ensure they are all 5-characters long.
  The maximum number of chunks is 50,000 so 5-characters is sufficient to
  contain all chunk ids.

  More info here:
  https://gauravmantri.com/2013/05/18/windows-azure-blob-storage-dealing-with-the-specified-blob-or-block-content-is-invalid-error/
joe-maley pushed a commit that referenced this pull request Mar 17, 2020
Change #1: Fix non-multipart uploads

  This fixes an issue where we free the buffer before writing it in the
  non-multipart upload path, which can be used for quicker small writes.

  Adds an associated unit test.

Change #2: Missing error return path in blob flush

  In the blob flush path for multipart uploads, the last flush of the write
  cache may silently fail. This ensures the user receives a non-OK status
  from flush_blob() in this scenario.

Change #3: Multipart uploads fail when they have > 10 chunks

  In a the multipart upload (aka blocklist upload), each chunk is uploaded
  a block with a string-id unique to the overall object (aka blob). There is
  an undocumented requirement that all string-ids must be of equal length.

  Currently, the string-ids are generated from an incrementing integer. For
  example, the id of the first chunk is "0", the second chunk is "1", the
  tenth chunk is "9", and the eleventh chunk is "10". The eleventh chunk
  has a string-id size of 2 while all the others have a size of 1. The
  eleventh chunk (and all other future chunks) fail with the following
  error message:

    "The specified blob or block content is invalid"

  This patches pads the string-ids to ensure they are all 5-characters long.
  The maximum number of chunks is 50,000 so 5-characters is sufficient to
  contain all chunk ids.

  More info here:
  https://gauravmantri.com/2013/05/18/windows-azure-blob-storage-dealing-with-the-specified-blob-or-block-content-is-invalid-error/
joe-maley pushed a commit that referenced this pull request Jun 16, 2020
This fixes the following within StorageManager::load_array_metadata():
1. A direct memory leak of a `Buffer` object introduced with the chunked buffers
  patch.
2. An long-standing indirect memory leak of the metadata buffers.

For #1, an auxiliary `Buffer` object is created to store the contents of a
read from a chunked buffer. The `Buffer` object is never deleted.

For #2, the raw buffers inside of the `Buffer` object are passed into a
`ConstBuffer`. However, the `ConstBuffer` does not take ownership or free them.
The buffers are never freed while the `ConstBuffer` references them because of
leak #1. For this patch, I've replaced the `ConstBuffer` objects with `Buffer`
objects so that the `OpenArray` instance can take ownership of the metadata
buffers and free them appropriately.

Note that before the chunked buffer patch, we had a similar pattern where the
`Tile` buffer was disowned, but never freed by the `ConstBuffer`. This may be
affecting older core builds.
joe-maley pushed a commit that referenced this pull request Jun 16, 2020
This fixes the following within StorageManager::load_array_metadata():
1. A direct memory leak of a `Buffer` object introduced with the chunked buffers
  patch.
2. An long-standing indirect memory leak of the metadata buffers.

For #1, an auxiliary `Buffer` object is created to store the contents of a
read from a chunked buffer. The `Buffer` object is never deleted.

For #2, the raw buffers inside of the `Buffer` object are passed into a
`ConstBuffer`. However, the `ConstBuffer` does not take ownership or free them.
The buffers are never freed while the `ConstBuffer` references them because of
leak #1. For this patch, I've replaced the `ConstBuffer` objects with `Buffer`
objects so that the `OpenArray` instance can take ownership of the metadata
buffers and free them appropriately.

Note that before the chunked buffer patch, we had a similar pattern where the
`Tile` buffer was disowned, but never freed by the `ConstBuffer`. This may be
affecting older core builds.
joe-maley pushed a commit that referenced this pull request Jun 16, 2020
This fixes the following within StorageManager::load_array_metadata():
1. A direct memory leak of a `Buffer` object introduced with the chunked buffers
  patch.
2. An long-standing indirect memory leak of the metadata buffers.

For #1, an auxiliary `Buffer` object is created to store the contents of a
read from a chunked buffer. The `Buffer` object is never deleted.

For #2, the raw buffers inside of the `Buffer` object are passed into a
`ConstBuffer`. However, the `ConstBuffer` does not take ownership or free them.
The buffers are never freed while the `ConstBuffer` references them because of
leak #1. For this patch, I've replaced the `ConstBuffer` objects with `Buffer`
objects so that the `OpenArray` instance can take ownership of the metadata
buffers and free them appropriately.

Note that before the chunked buffer patch, we had a similar pattern where the
`Tile` buffer was disowned, but never freed by the `ConstBuffer`. This may be
affecting older core builds.
joe-maley pushed a commit that referenced this pull request Jun 17, 2020
This fixes the following within StorageManager::load_array_metadata():
1. A direct memory leak of a `Buffer` object introduced with the chunked buffers
  patch.
2. An long-standing indirect memory leak of the metadata buffers.

For #1, an auxiliary `Buffer` object is created to store the contents of a
read from a chunked buffer. The `Buffer` object is never deleted.

For #2, the raw buffers inside of the `Buffer` object are passed into a
`ConstBuffer`. However, the `ConstBuffer` does not take ownership or free them.
The buffers are never freed while the `ConstBuffer` references them because of
leak #1. For this patch, I've replaced the `ConstBuffer` objects with `Buffer`
objects so that the `OpenArray` instance can take ownership of the metadata
buffers and free them appropriately.

Note that before the chunked buffer patch, we had a similar pattern where the
`Tile` buffer was disowned, but never freed by the `ConstBuffer`. This may be
affecting older core builds.
joe-maley pushed a commit that referenced this pull request Jun 17, 2020
This fixes the following within StorageManager::load_array_metadata():
1. A direct memory leak of a `Buffer` object introduced with the chunked buffers
  patch.
2. An long-standing indirect memory leak of the metadata buffers.

For #1, an auxiliary `Buffer` object is created to store the contents of a
read from a chunked buffer. The `Buffer` object is never deleted.

For #2, the raw buffers inside of the `Buffer` object are passed into a
`ConstBuffer`. However, the `ConstBuffer` does not take ownership or free them.
The buffers are never freed while the `ConstBuffer` references them because of
leak #1. For this patch, I've replaced the `ConstBuffer` objects with `Buffer`
objects so that the `OpenArray` instance can take ownership of the metadata
buffers and free them appropriately.

Note that before the chunked buffer patch, we had a similar pattern where the
`Tile` buffer was disowned, but never freed by the `ConstBuffer`. This may be
affecting older core builds.
joe-maley pushed a commit that referenced this pull request Jul 22, 2020
This fixes the following within StorageManager::load_array_metadata():
1. A direct memory leak of a `Buffer` object introduced with the chunked buffers
  patch.
2. An long-standing indirect memory leak of the metadata buffers.

For #1, an auxiliary `Buffer` object is created to store the contents of a
read from a chunked buffer. The `Buffer` object is never deleted.

For #2, the raw buffers inside of the `Buffer` object are passed into a
`ConstBuffer`. However, the `ConstBuffer` does not take ownership or free them.
The buffers are never freed while the `ConstBuffer` references them because of
leak #1. For this patch, I've replaced the `ConstBuffer` objects with `Buffer`
objects so that the `OpenArray` instance can take ownership of the metadata
buffers and free them appropriately.

Note that before the chunked buffer patch, we had a similar pattern where the
`Tile` buffer was disowned, but never freed by the `ConstBuffer`. This may be
affecting older core builds.
Shelnutt2 pushed a commit that referenced this pull request Jul 22, 2020
This fixes the following within StorageManager::load_array_metadata():
1. A direct memory leak of a `Buffer` object introduced with the chunked buffers
  patch.
2. An long-standing indirect memory leak of the metadata buffers.

For #1, an auxiliary `Buffer` object is created to store the contents of a
read from a chunked buffer. The `Buffer` object is never deleted.

For #2, the raw buffers inside of the `Buffer` object are passed into a
`ConstBuffer`. However, the `ConstBuffer` does not take ownership or free them.
The buffers are never freed while the `ConstBuffer` references them because of
leak #1. For this patch, I've replaced the `ConstBuffer` objects with `Buffer`
objects so that the `OpenArray` instance can take ownership of the metadata
buffers and free them appropriately.

Note that before the chunked buffer patch, we had a similar pattern where the
`Tile` buffer was disowned, but never freed by the `ConstBuffer`. This may be
affecting older core builds.
joe-maley pushed a commit that referenced this pull request Aug 7, 2020
This introduces a global thread pool for use when TBB is disabled. The
performance has not been exhaustively benchmarked against TBB. However, I did
test this on two readily available scenarios that I had been recently performance
benchmarking for other reasons. One scenario makes heavy use of the parallel_sort
path while the other does not. Surprisingly, disabling TBB performs about 10%
quicker with this pach.

// Scenario #1
TBB: 3.4s
TBB disabled, this patch: 3.0s
TBB disabled, on dev: 10.0s

// Scenario #2
TBB: 3.1
TBB disabled, this patch: 2.7s
TBB disabled, on dev: 9.1s

For now, this patch uses the threadpool at the same scope as the TBB scheduler.
It is a global thread pool, shared among a single process, and conditionally
compiled. The concurrency level that the thread pool is configured with is
determined from the "sm.num_tbb_threads" config.

This patch does not disable TBB by default.
joe-maley pushed a commit that referenced this pull request Aug 7, 2020
This introduces a global thread pool for use when TBB is disabled. The
performance has not been exhaustively benchmarked against TBB. However, I did
test this on two readily available scenarios that I had been recently performance
benchmarking for other reasons. One scenario makes heavy use of the parallel_sort
path while the other does not. Surprisingly, disabling TBB performs about 10%
quicker with this pach.

// Scenario #1
TBB: 3.4s
TBB disabled, this patch: 3.0s
TBB disabled, on dev: 10.0s

// Scenario #2
TBB: 3.1
TBB disabled, this patch: 2.7s
TBB disabled, on dev: 9.1s

For now, this patch uses the threadpool at the same scope as the TBB scheduler.
It is a global thread pool, shared among a single process, and conditionally
compiled. The concurrency level that the thread pool is configured with is
determined from the "sm.num_tbb_threads" config.

This patch does not disable TBB by default.
joe-maley pushed a commit that referenced this pull request Aug 7, 2020
This introduces a global thread pool for use when TBB is disabled. The
performance has not been exhaustively benchmarked against TBB. However, I did
test this on two readily available scenarios that I had been recently performance
benchmarking for other reasons. One scenario makes heavy use of the parallel_sort
path while the other does not. Surprisingly, disabling TBB performs about 10%
quicker with this pach.

// Scenario #1
TBB: 3.4s
TBB disabled, this patch: 3.0s
TBB disabled, on dev: 10.0s

// Scenario #2
TBB: 3.1
TBB disabled, this patch: 2.7s
TBB disabled, on dev: 9.1s

For now, this patch uses the threadpool at the same scope as the TBB scheduler.
It is a global thread pool, shared among a single process, and conditionally
compiled. The concurrency level that the thread pool is configured with is
determined from the "sm.num_tbb_threads" config.

This patch does not disable TBB by default.
joe-maley pushed a commit that referenced this pull request Aug 7, 2020
This introduces a global thread pool for use when TBB is disabled. The
performance has not been exhaustively benchmarked against TBB. However, I did
test this on two readily available scenarios that I had been recently performance
benchmarking for other reasons. One scenario makes heavy use of the parallel_sort
path while the other does not. Surprisingly, disabling TBB performs about 10%
quicker with this pach.

// Scenario #1
TBB: 3.4s
TBB disabled, this patch: 3.0s
TBB disabled, on dev: 10.0s

// Scenario #2
TBB: 3.1
TBB disabled, this patch: 2.7s
TBB disabled, on dev: 9.1s

For now, this patch uses the threadpool at the same scope as the TBB scheduler.
It is a global thread pool, shared among a single process, and conditionally
compiled. The concurrency level that the thread pool is configured with is
determined from the "sm.num_tbb_threads" config.

This patch does not disable TBB by default.

Co-authored-by: Joe Maley <[email protected]>
joe-maley pushed a commit that referenced this pull request Sep 17, 2020
Currently, reading/unfiltering/copying tiles are performed serially among
attribute names within Reader::copy_attribute_values():
```
for (const auto& name : names) {
  read_tiles(name, result_tiles);
  unfilter_tiles(name, result_tiles, &cs_ranges);
  if (!array_schema_->var_size(name))
    copy_fixed_cells(name, stride, result_cell_slabs, &fixed_ctx_cache);
  else
    copy_var_cells(name, stride, result_cell_slabs, &var_ctx_cache);
  clear_tiles(name, result_tiles);
}
```

Parallelizing the entire for-loop has a few problems:
1. Each iteration requires a large amount of temporary memory to buffer reads
   into before they are copied into `buffers_` and cleared within clear_tiles().
   This memory can not be shared among parallel iterations, so memory bloat
   scales linearly with the number of parallel iterations.
2. `read_tiles`, `unfilter_tiles`, `copy_fixed_cells`, `copy_var_cells`, and
   `clear_tiles` are thread-unsafe because of unprotected access to the
    individual `ResultTile` elements within `result_tiles`.
3. `copy_fixed_cells` and `copy_var_cells` are thread-unsafe because the
   `fixed_ctx_cache` and `var_ctx_cache` are thread-unsafe.

Problem #1: This patch restricts the number of parallel iterations to 2. This
is a hard-coded value that I will justify later in this commit message.

Problem #2: We can achieve thread-safety with a mutex for each ResultTile
element. Even if we tolerate the memory overhead associated with allocating a
mutex for each ResultTile, the CPU time spent locking at this granularity
is unacceptably high within the `copy_fixed_cells` and `copy_var_cells` paths.
However, it is experimentally negligible for the other paths. This is enough
of a penalty that I have decided to use a single lock to protect all elements.
This effectively serializes all routines, except for the IO tasks within
`read_tiles`.

Problem #3: I have made the context caches thread safe, although they are still
accessed in serial. This is mostly because I had to write this code before I
was able to test that parallelizing `copy_*_cells` was too expensive with
per-ResulTile mutexes. I have decided to leave it in this patch because it is
zero-harm and required for future improvements.

I have been testing this in an IO-bound workload (using an S3 backend). Due to
the large variance in S3 request times, I ran my test many times and collected
the best-case numbers.

// Execution times:
Without this patch: 4.5s
With this patch, 1 parallel iteration: 4.5s
With this patch, 2 parallel iteration: 3.4s
With this patch, 3 parallel iteration: 3.4s
With this patch, 4 parallel iteration: 3.2s
With this patch, 6 parallel iteration: 3.2s
With this patch, 8 parallel iteration: 3.3s
With this patch, 12 parallel iteration: 3.7s

Given the above, it does seem that the number of parallel iterations should
be configurable. For more than 2 parallel iterations, we linearly scale memory
bloat with little-to-gain in execution time. However, with just 2 parallel
iterations, we receive a 32% speed-up.

Future Improvments:
- Modify ThreadPool::execute() to allow limiting the number of parallel iterations
for an arbitrary number of tasks. This patch uses a step-wise execution of
parallel reads. In other words: both parallel iterations must complete before
the next pair of `read_tiles()` can execute. Ideally, the 3rd iteration should
start immediately after one of the first two complete.

- Increase the granularity of protection on ResultTiles. I have a couple ideas
but they would be large and out of scope for this patch.

- Revisit the read path for dimensions. Right now, there is no limit to the
number of dimensions read at one time. If we have a large number of dimensions,
we will have a memory bottleneck.
joe-maley pushed a commit that referenced this pull request Sep 17, 2020
Currently, reading/unfiltering/copying tiles are performed serially among
attribute names within Reader::copy_attribute_values():
```
for (const auto& name : names) {
  read_tiles(name, result_tiles);
  unfilter_tiles(name, result_tiles, &cs_ranges);
  if (!array_schema_->var_size(name))
    copy_fixed_cells(name, stride, result_cell_slabs, &fixed_ctx_cache);
  else
    copy_var_cells(name, stride, result_cell_slabs, &var_ctx_cache);
  clear_tiles(name, result_tiles);
}
```

Parallelizing the entire for-loop has a few problems:
1. Each iteration requires a large amount of temporary memory to buffer reads
   into before they are copied into `buffers_` and cleared within clear_tiles().
   This memory can not be shared among parallel iterations, so memory bloat
   scales linearly with the number of parallel iterations.
2. `read_tiles`, `unfilter_tiles`, `copy_fixed_cells`, `copy_var_cells`, and
   `clear_tiles` are thread-unsafe because of unprotected access to the
    individual `ResultTile` elements within `result_tiles`.
3. `copy_fixed_cells` and `copy_var_cells` are thread-unsafe because the
   `fixed_ctx_cache` and `var_ctx_cache` are thread-unsafe.

Problem #1: This patch restricts the number of parallel iterations to 2. This
is a hard-coded value that I will justify later in this commit message.

Problem #2: We can achieve thread-safety with a mutex for each ResultTile
element. Even if we tolerate the memory overhead associated with allocating a
mutex for each ResultTile, the CPU time spent locking at this granularity
is unacceptably high within the `copy_fixed_cells` and `copy_var_cells` paths.
However, it is experimentally negligible for the other paths. This is enough
of a penalty that I have decided to use a single lock to protect all elements.
This effectively serializes all routines, except for the IO tasks within
`read_tiles`.

Problem #3: I have made the context caches thread safe, although they are still
accessed in serial. This is mostly because I had to write this code before I
was able to test that parallelizing `copy_*_cells` was too expensive with
per-ResulTile mutexes. I have decided to leave it in this patch because it is
zero-harm and required for future improvements.

I have been testing this in an IO-bound workload (using an S3 backend). Due to
the large variance in S3 request times, I ran my test many times and collected
the best-case numbers.

// Execution times:
Without this patch: 4.5s
With this patch, 1 parallel iteration: 4.5s
With this patch, 2 parallel iteration: 3.4s
With this patch, 3 parallel iteration: 3.4s
With this patch, 4 parallel iteration: 3.2s
With this patch, 6 parallel iteration: 3.2s
With this patch, 8 parallel iteration: 3.3s
With this patch, 12 parallel iteration: 3.7s

Given the above, it does seem that the number of parallel iterations should
be configurable. For more than 2 parallel iterations, we linearly scale memory
bloat with little-to-gain in execution time. However, with just 2 parallel
iterations, we receive a 32% speed-up.

Future Improvments:
- Modify ThreadPool::execute() to allow limiting the number of parallel iterations
for an arbitrary number of tasks. This patch uses a step-wise execution of
parallel reads. In other words: both parallel iterations must complete before
the next pair of `read_tiles()` can execute. Ideally, the 3rd iteration should
start immediately after one of the first two complete.

- Increase the granularity of protection on ResultTiles. I have a couple ideas
but they would be large and out of scope for this patch.

- Revisit the read path for dimensions. Right now, there is no limit to the
number of dimensions read at one time. If we have a large number of dimensions,
we will have a memory bottleneck.
joe-maley pushed a commit that referenced this pull request Sep 17, 2020
Currently, reading/unfiltering/copying tiles are performed serially among
attribute names within Reader::copy_attribute_values():
```
for (const auto& name : names) {
  read_tiles(name, result_tiles);
  unfilter_tiles(name, result_tiles, &cs_ranges);
  if (!array_schema_->var_size(name))
    copy_fixed_cells(name, stride, result_cell_slabs, &fixed_ctx_cache);
  else
    copy_var_cells(name, stride, result_cell_slabs, &var_ctx_cache);
  clear_tiles(name, result_tiles);
}
```

Parallelizing the entire for-loop has a few problems:
1. Each iteration requires a large amount of temporary memory to buffer reads
   into before they are copied into `buffers_` and cleared within clear_tiles().
   This memory can not be shared among parallel iterations, so memory bloat
   scales linearly with the number of parallel iterations.
2. `read_tiles`, `unfilter_tiles`, `copy_fixed_cells`, `copy_var_cells`, and
   `clear_tiles` are thread-unsafe because of unprotected access to the
    individual `ResultTile` elements within `result_tiles`.
3. `copy_fixed_cells` and `copy_var_cells` are thread-unsafe because the
   `fixed_ctx_cache` and `var_ctx_cache` are thread-unsafe.

Problem #1: This patch restricts the number of parallel iterations to 2. This
is a hard-coded value that I will justify later in this commit message.

Problem #2: We can achieve thread-safety with a mutex for each ResultTile
element. Even if we tolerate the memory overhead associated with allocating a
mutex for each ResultTile, the CPU time spent locking at this granularity
is unacceptably high within the `copy_fixed_cells` and `copy_var_cells` paths.
However, it is experimentally negligible for the other paths. This is enough
of a penalty that I have decided to use a single lock to protect all elements.
This effectively serializes all routines, except for the IO tasks within
`read_tiles`.

Problem #3: I have made the context caches thread safe, although they are still
accessed in serial. This is mostly because I had to write this code before I
was able to test that parallelizing `copy_*_cells` was too expensive with
per-ResulTile mutexes. I have decided to leave it in this patch because it is
zero-harm and required for future improvements.

I have been testing this in an IO-bound workload (using an S3 backend). Due to
the large variance in S3 request times, I ran my test many times and collected
the best-case numbers.

// Execution times:
Without this patch: 4.5s
With this patch, 1 parallel iteration: 4.5s
With this patch, 2 parallel iteration: 3.4s
With this patch, 3 parallel iteration: 3.4s
With this patch, 4 parallel iteration: 3.2s
With this patch, 6 parallel iteration: 3.2s
With this patch, 8 parallel iteration: 3.3s
With this patch, 12 parallel iteration: 3.7s

Given the above, it does seem that the number of parallel iterations should
be configurable. For more than 2 parallel iterations, we linearly scale memory
bloat with little-to-gain in execution time. However, with just 2 parallel
iterations, we receive a 32% speed-up.

Future Improvments:
- Modify ThreadPool::execute() to allow limiting the number of parallel iterations
for an arbitrary number of tasks. This patch uses a step-wise execution of
parallel reads. In other words: both parallel iterations must complete before
the next pair of `read_tiles()` can execute. Ideally, the 3rd iteration should
start immediately after one of the first two complete.

- Increase the granularity of protection on ResultTiles. I have a couple ideas
but they would be large and out of scope for this patch.

- Revisit the read path for dimensions. Right now, there is no limit to the
number of dimensions read at one time. If we have a large number of dimensions,
we will have a memory bottleneck.
joe-maley pushed a commit that referenced this pull request Sep 18, 2020
Currently, reading/unfiltering/copying tiles are performed serially among
attribute names within Reader::copy_attribute_values():
```
for (const auto& name : names) {
  read_tiles(name, result_tiles);
  unfilter_tiles(name, result_tiles, &cs_ranges);
  if (!array_schema_->var_size(name))
    copy_fixed_cells(name, stride, result_cell_slabs, &fixed_ctx_cache);
  else
    copy_var_cells(name, stride, result_cell_slabs, &var_ctx_cache);
  clear_tiles(name, result_tiles);
}
```

Parallelizing the entire for-loop has a few problems:
1. Each iteration requires a large amount of temporary memory to buffer reads
   into before they are copied into `buffers_` and cleared within clear_tiles().
   This memory can not be shared among parallel iterations, so memory bloat
   scales linearly with the number of parallel iterations.
2. `read_tiles`, `unfilter_tiles`, `copy_fixed_cells`, `copy_var_cells`, and
   `clear_tiles` are thread-unsafe because of unprotected access to the
    individual `ResultTile` elements within `result_tiles`.
3. `copy_fixed_cells` and `copy_var_cells` are thread-unsafe because the
   `fixed_ctx_cache` and `var_ctx_cache` are thread-unsafe.

Problem #1: This patch restricts the number of parallel iterations to 2. This
is a hard-coded value that I will justify later in this commit message.

Problem #2: We can achieve thread-safety with a mutex for each ResultTile
element. Even if we tolerate the memory overhead associated with allocating a
mutex for each ResultTile, the CPU time spent locking at this granularity
is unacceptably high within the `copy_fixed_cells` and `copy_var_cells` paths.
However, it is experimentally negligible for the other paths. This is enough
of a penalty that I have decided to use a single lock to protect all elements.
This effectively serializes all routines, except for the IO tasks within
`read_tiles`.

Problem #3: I have made the context caches thread safe, although they are still
accessed in serial. This is mostly because I had to write this code before I
was able to test that parallelizing `copy_*_cells` was too expensive with
per-ResulTile mutexes. I have decided to leave it in this patch because it is
zero-harm and required for future improvements.

I have been testing this in an IO-bound workload (using an S3 backend). Due to
the large variance in S3 request times, I ran my test many times and collected
the best-case numbers.

// Execution times:
Without this patch: 4.5s
With this patch, 1 parallel iteration: 4.5s
With this patch, 2 parallel iteration: 3.4s
With this patch, 3 parallel iteration: 3.4s
With this patch, 4 parallel iteration: 3.2s
With this patch, 6 parallel iteration: 3.2s
With this patch, 8 parallel iteration: 3.3s
With this patch, 12 parallel iteration: 3.7s

Given the above, it does seem that the number of parallel iterations should
be configurable. For more than 2 parallel iterations, we linearly scale memory
bloat with little-to-gain in execution time. However, with just 2 parallel
iterations, we receive a 32% speed-up.

Future Improvments:
- Modify ThreadPool::execute() to allow limiting the number of parallel iterations
for an arbitrary number of tasks. This patch uses a step-wise execution of
parallel reads. In other words: both parallel iterations must complete before
the next pair of `read_tiles()` can execute. Ideally, the 3rd iteration should
start immediately after one of the first two complete.

- Increase the granularity of protection on ResultTiles. I have a couple ideas
but they would be large and out of scope for this patch.

- Revisit the read path for dimensions. Right now, there is no limit to the
number of dimensions read at one time. If we have a large number of dimensions,
we will have a memory bottleneck.
joe-maley pushed a commit that referenced this pull request Sep 18, 2020
Currently, reading/unfiltering/copying tiles are performed serially among
attribute names within Reader::copy_attribute_values():
```
for (const auto& name : names) {
  read_tiles(name, result_tiles);
  unfilter_tiles(name, result_tiles, &cs_ranges);
  if (!array_schema_->var_size(name))
    copy_fixed_cells(name, stride, result_cell_slabs, &fixed_ctx_cache);
  else
    copy_var_cells(name, stride, result_cell_slabs, &var_ctx_cache);
  clear_tiles(name, result_tiles);
}
```

Parallelizing the entire for-loop has a few problems:
1. Each iteration requires a large amount of temporary memory to buffer reads
   into before they are copied into `buffers_` and cleared within clear_tiles().
   This memory can not be shared among parallel iterations, so memory bloat
   scales linearly with the number of parallel iterations.
2. `read_tiles`, `unfilter_tiles`, `copy_fixed_cells`, `copy_var_cells`, and
   `clear_tiles` are thread-unsafe because of unprotected access to the
    individual `ResultTile` elements within `result_tiles`.
3. `copy_fixed_cells` and `copy_var_cells` are thread-unsafe because the
   `fixed_ctx_cache` and `var_ctx_cache` are thread-unsafe.

Problem #1: This patch restricts the number of parallel iterations to 2. This
is a hard-coded value that I will justify later in this commit message.

Problem #2: We can achieve thread-safety with a mutex for each ResultTile
element. Even if we tolerate the memory overhead associated with allocating a
mutex for each ResultTile, the CPU time spent locking at this granularity
is unacceptably high within the `copy_fixed_cells` and `copy_var_cells` paths.
However, it is experimentally negligible for the other paths. This is enough
of a penalty that I have decided to use a single lock to protect all elements.
This effectively serializes all routines, except for the IO tasks within
`read_tiles`.

Problem #3: I have made the context caches thread safe, although they are still
accessed in serial. This is mostly because I had to write this code before I
was able to test that parallelizing `copy_*_cells` was too expensive with
per-ResulTile mutexes. I have decided to leave it in this patch because it is
zero-harm and required for future improvements.

I have been testing this in an IO-bound workload (using an S3 backend). Due to
the large variance in S3 request times, I ran my test many times and collected
the best-case numbers.

// Execution times:
Without this patch: 4.5s
With this patch, 1 parallel iteration: 4.5s
With this patch, 2 parallel iteration: 3.4s
With this patch, 3 parallel iteration: 3.4s
With this patch, 4 parallel iteration: 3.2s
With this patch, 6 parallel iteration: 3.2s
With this patch, 8 parallel iteration: 3.3s
With this patch, 12 parallel iteration: 3.7s

Given the above, it does seem that the number of parallel iterations should
be configurable. For more than 2 parallel iterations, we linearly scale memory
bloat with little-to-gain in execution time. However, with just 2 parallel
iterations, we receive a 32% speed-up.

Future Improvments:
- Modify ThreadPool::execute() to allow limiting the number of parallel iterations
for an arbitrary number of tasks. This patch uses a step-wise execution of
parallel reads. In other words: both parallel iterations must complete before
the next pair of `read_tiles()` can execute. Ideally, the 3rd iteration should
start immediately after one of the first two complete.

- Increase the granularity of protection on ResultTiles. I have a couple ideas
but they would be large and out of scope for this patch.

- Revisit the read path for dimensions. Right now, there is no limit to the
number of dimensions read at one time. If we have a large number of dimensions,
we will have a memory bottleneck.
joe-maley pushed a commit that referenced this pull request Sep 21, 2020
Currently, reading/unfiltering/copying tiles are performed serially among
attribute names within Reader::copy_attribute_values():
```
for (const auto& name : names) {
  read_tiles(name, result_tiles);
  unfilter_tiles(name, result_tiles, &cs_ranges);
  if (!array_schema_->var_size(name))
    copy_fixed_cells(name, stride, result_cell_slabs, &fixed_ctx_cache);
  else
    copy_var_cells(name, stride, result_cell_slabs, &var_ctx_cache);
  clear_tiles(name, result_tiles);
}
```

Parallelizing the entire for-loop has a few problems:
1. Each iteration requires a large amount of temporary memory to buffer reads
   into before they are copied into `buffers_` and cleared within clear_tiles().
   This memory can not be shared among parallel iterations, so memory bloat
   scales linearly with the number of parallel iterations.
2. `read_tiles`, `unfilter_tiles`, `copy_fixed_cells`, `copy_var_cells`, and
   `clear_tiles` are thread-unsafe because of unprotected access to the
    individual `ResultTile` elements within `result_tiles`.
3. `copy_fixed_cells` and `copy_var_cells` are thread-unsafe because the
   `fixed_ctx_cache` and `var_ctx_cache` are thread-unsafe.

Problem #1: This patch restricts the number of parallel iterations to 2. This
is a hard-coded value that I will justify later in this commit message.

Problem #2: We can achieve thread-safety with a mutex for each ResultTile
element. Even if we tolerate the memory overhead associated with allocating a
mutex for each ResultTile, the CPU time spent locking at this granularity
is unacceptably high within the `copy_fixed_cells` and `copy_var_cells` paths.
However, it is experimentally negligible for the other paths. This is enough
of a penalty that I have decided to use a single lock to protect all elements.
This effectively serializes all routines, except for the IO tasks within
`read_tiles`.

Problem #3: I have made the context caches thread safe, although they are still
accessed in serial. This is mostly because I had to write this code before I
was able to test that parallelizing `copy_*_cells` was too expensive with
per-ResulTile mutexes. I have decided to leave it in this patch because it is
zero-harm and required for future improvements.

I have been testing this in an IO-bound workload (using an S3 backend). Due to
the large variance in S3 request times, I ran my test many times and collected
the best-case numbers.

// Execution times:
Without this patch: 4.5s
With this patch, 1 parallel iteration: 4.5s
With this patch, 2 parallel iteration: 3.4s
With this patch, 3 parallel iteration: 3.4s
With this patch, 4 parallel iteration: 3.2s
With this patch, 6 parallel iteration: 3.2s
With this patch, 8 parallel iteration: 3.3s
With this patch, 12 parallel iteration: 3.7s

Given the above, it does seem that the number of parallel iterations should
be configurable. For more than 2 parallel iterations, we linearly scale memory
bloat with little-to-gain in execution time. However, with just 2 parallel
iterations, we receive a 32% speed-up.

Future Improvments:
- Modify ThreadPool::execute() to allow limiting the number of parallel iterations
for an arbitrary number of tasks. This patch uses a step-wise execution of
parallel reads. In other words: both parallel iterations must complete before
the next pair of `read_tiles()` can execute. Ideally, the 3rd iteration should
start immediately after one of the first two complete.

- Increase the granularity of protection on ResultTiles. I have a couple ideas
but they would be large and out of scope for this patch.

- Revisit the read path for dimensions. Right now, there is no limit to the
number of dimensions read at one time. If we have a large number of dimensions,
we will have a memory bottleneck.
joe-maley pushed a commit that referenced this pull request Sep 21, 2020
Currently, reading/unfiltering/copying tiles are performed serially among
attribute names within Reader::copy_attribute_values():
```
for (const auto& name : names) {
  read_tiles(name, result_tiles);
  unfilter_tiles(name, result_tiles, &cs_ranges);
  if (!array_schema_->var_size(name))
    copy_fixed_cells(name, stride, result_cell_slabs, &fixed_ctx_cache);
  else
    copy_var_cells(name, stride, result_cell_slabs, &var_ctx_cache);
  clear_tiles(name, result_tiles);
}
```

Parallelizing the entire for-loop has a few problems:
1. Each iteration requires a large amount of temporary memory to buffer reads
   into before they are copied into `buffers_` and cleared within clear_tiles().
   This memory can not be shared among parallel iterations, so memory bloat
   scales linearly with the number of parallel iterations.
2. `read_tiles`, `unfilter_tiles`, `copy_fixed_cells`, `copy_var_cells`, and
   `clear_tiles` are thread-unsafe because of unprotected access to the
    individual `ResultTile` elements within `result_tiles`.
3. `copy_fixed_cells` and `copy_var_cells` are thread-unsafe because the
   `fixed_ctx_cache` and `var_ctx_cache` are thread-unsafe.

Problem #1: This patch restricts the number of parallel iterations to 2. This
is a hard-coded value that I will justify later in this commit message.

Problem #2: We can achieve thread-safety with a mutex for each ResultTile
element. Even if we tolerate the memory overhead associated with allocating a
mutex for each ResultTile, the CPU time spent locking at this granularity
is unacceptably high within the `copy_fixed_cells` and `copy_var_cells` paths.
However, it is experimentally negligible for the other paths. This is enough
of a penalty that I have decided to use a single lock to protect all elements.
This effectively serializes all routines, except for the IO tasks within
`read_tiles`.

Problem #3: I have made the context caches thread safe, although they are still
accessed in serial. This is mostly because I had to write this code before I
was able to test that parallelizing `copy_*_cells` was too expensive with
per-ResulTile mutexes. I have decided to leave it in this patch because it is
zero-harm and required for future improvements.

I have been testing this in an IO-bound workload (using an S3 backend). Due to
the large variance in S3 request times, I ran my test many times and collected
the best-case numbers.

// Execution times:
Without this patch: 4.5s
With this patch, 1 parallel iteration: 4.5s
With this patch, 2 parallel iteration: 3.4s
With this patch, 3 parallel iteration: 3.4s
With this patch, 4 parallel iteration: 3.2s
With this patch, 6 parallel iteration: 3.2s
With this patch, 8 parallel iteration: 3.3s
With this patch, 12 parallel iteration: 3.7s

Given the above, it does seem that the number of parallel iterations should
be configurable. For more than 2 parallel iterations, we linearly scale memory
bloat with little-to-gain in execution time. However, with just 2 parallel
iterations, we receive a 32% speed-up.

Future Improvments:
- Modify ThreadPool::execute() to allow limiting the number of parallel iterations
for an arbitrary number of tasks. This patch uses a step-wise execution of
parallel reads. In other words: both parallel iterations must complete before
the next pair of `read_tiles()` can execute. Ideally, the 3rd iteration should
start immediately after one of the first two complete.

- Increase the granularity of protection on ResultTiles. I have a couple ideas
but they would be large and out of scope for this patch.

- Revisit the read path for dimensions. Right now, there is no limit to the
number of dimensions read at one time. If we have a large number of dimensions,
we will have a memory bottleneck.

Co-authored-by: Joe Maley <[email protected]>
joe-maley pushed a commit that referenced this pull request Nov 13, 2020
The offsets in each buffer correspond to the values in its
data buffer. To build a single contigious buffer, we must
ensure the offset values continue in ascending order from the
previous buffer. For example, consider the following example
storing int32 values.

Buffer #1:
  offsets: [0, 8, 16]
  values:  [1, 2, 3, 4, 5, 6, 7]

Buffer #2:
  offsets: [0, 12]
  values:  [100, 200, 300, 400, 500]

The final, contigious buffer will be:
  offsets: [0, 8, 16, 28, 40]
  values:  [1, 2, 3, 4, 5, 6, 7, 100, 200, 300, 400, 500]

The last two offsets, `28, 40` were calculated by adding `28`
to offsets of Buffer #2: [0, 12]. The `28` was calculated as
the byte size of the values in Buffer #1.
joe-maley pushed a commit that referenced this pull request Nov 16, 2020
The offsets in each buffer correspond to the values in its
data buffer. To build a single contigious buffer, we must
ensure the offset values continue in ascending order from the
previous buffer. For example, consider the following example
storing int32 values.

Buffer #1:
  offsets: [0, 8, 16]
  values:  [1, 2, 3, 4, 5, 6, 7]

Buffer #2:
  offsets: [0, 12]
  values:  [100, 200, 300, 400, 500]

The final, contigious buffer will be:
  offsets: [0, 8, 16, 28, 40]
  values:  [1, 2, 3, 4, 5, 6, 7, 100, 200, 300, 400, 500]

The last two offsets, `28, 40` were calculated by adding `28`
to offsets of Buffer #2: [0, 12]. The `28` was calculated as
the byte size of the values in Buffer #1.

Co-authored-by: Joe Maley <[email protected]>
github-actions bot pushed a commit that referenced this pull request Nov 16, 2020
The offsets in each buffer correspond to the values in its
data buffer. To build a single contigious buffer, we must
ensure the offset values continue in ascending order from the
previous buffer. For example, consider the following example
storing int32 values.

Buffer #1:
  offsets: [0, 8, 16]
  values:  [1, 2, 3, 4, 5, 6, 7]

Buffer #2:
  offsets: [0, 12]
  values:  [100, 200, 300, 400, 500]

The final, contigious buffer will be:
  offsets: [0, 8, 16, 28, 40]
  values:  [1, 2, 3, 4, 5, 6, 7, 100, 200, 300, 400, 500]

The last two offsets, `28, 40` were calculated by adding `28`
to offsets of Buffer #2: [0, 12]. The `28` was calculated as
the byte size of the values in Buffer #1.

Co-authored-by: Joe Maley <[email protected]>
joe-maley pushed a commit that referenced this pull request Nov 17, 2020
The offsets in each buffer correspond to the values in its
data buffer. To build a single contigious buffer, we must
ensure the offset values continue in ascending order from the
previous buffer. For example, consider the following example
storing int32 values.

Buffer #1:
  offsets: [0, 8, 16]
  values:  [1, 2, 3, 4, 5, 6, 7]

Buffer #2:
  offsets: [0, 12]
  values:  [100, 200, 300, 400, 500]

The final, contigious buffer will be:
  offsets: [0, 8, 16, 28, 40]
  values:  [1, 2, 3, 4, 5, 6, 7, 100, 200, 300, 400, 500]

The last two offsets, `28, 40` were calculated by adding `28`
to offsets of Buffer #2: [0, 12]. The `28` was calculated as
the byte size of the values in Buffer #1.

Co-authored-by: Joe Maley <[email protected]>

Co-authored-by: joe maley <[email protected]>
Co-authored-by: Joe Maley <[email protected]>
joe-maley pushed a commit that referenced this pull request May 3, 2021
Fixes #2201

This patch introduces a global lock on the `S3Client` constructor to prevent
the following race reported from TSAN:
```
WARNING: ThreadSanitizer: data race (pid=12)
  Write of size 8 at 0x7fe93658a1b0 by thread T16 (mutexes: write M144250874682678808, write M150443169551677688):
    #0 cJSON_ParseWithOpts external/aws/aws-cpp-sdk-core/source/external/cjson/cJSON.cpp:1006 (libexternal_Saws_Slibaws.so+0x299bef)
    #1 Aws::Utils::Json::JsonValue::JsonValue(std::__cxx11::basic_string<char, std::char_traits<char>, Aws::Allocator<char> > const&) external/aws/aws-cpp-sdk-core/source/utils/json/JsonSerializer.cpp:39 (libexternal_Saws_Slibaws.so+0x32d09c)
    #2 Aws::Config::EC2InstanceProfileConfigLoader::LoadInternal() external/aws/aws-cpp-sdk-core/source/config/AWSProfileConfigLoader.cpp:341 (libexternal_Saws_Slibaws.so+0x286d88)
    #3 Aws::Config::AWSProfileConfigLoader::Load() external/aws/aws-cpp-sdk-core/source/config/AWSProfileConfigLoader.cpp:47 (libexternal_Saws_Slibaws.so+0x282b4f)
    #4 Aws::Auth::InstanceProfileCredentialsProvider::Reload() external/aws/aws-cpp-sdk-core/source/auth/AWSCredentialsProvider.cpp:265 (libexternal_Saws_Slibaws.so+0x235987)
    #5 Aws::Auth::InstanceProfileCredentialsProvider::RefreshIfExpired() external/aws/aws-cpp-sdk-core/source/auth/AWSCredentialsProvider.cpp:283 (libexternal_Saws_Slibaws.so+0x23613b)
    #6 Aws::Auth::InstanceProfileCredentialsProvider::GetAWSCredentials() external/aws/aws-cpp-sdk-core/source/auth/AWSCredentialsProvider.cpp:250 (libexternal_Saws_Slibaws.so+0x23be9a)
    #7 Aws::Auth::AWSCredentialsProviderChain::GetAWSCredentials() external/aws/aws-cpp-sdk-core/source/auth/AWSCredentialsProviderChain.cpp:35 (libexternal_Saws_Slibaws.so+0x244550)
    #8 Aws::Client::AWSAuthV4Signer::AWSAuthV4Signer(std::shared_ptr<Aws::Auth::AWSCredentialsProvider> const&, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, Aws::Allocator<char> > const&, Aws::Client::AWSAuthV4Signer::PayloadSigningPolicy, bool) external/aws/aws-cpp-sdk-core/source/auth/AWSAuthSigner.cpp:170 (libexternal_Saws_Slibaws.so+0x2262ed)
    #9 std::shared_ptr<Aws::Client::AWSAuthV4Signer> Aws::MakeShared<Aws::Client::AWSAuthV4Signer, std::shared_ptr<Aws::Auth::DefaultAWSCredentialsProviderChain>, char const*&, std::__cxx11::basic_string<char, std::char_traits<char>, Aws::Allocator<char> > const&, Aws::Client::AWSAuthV4Signer::PayloadSigningPolicy&, bool>(char const*, std::shared_ptr<Aws::Auth::DefaultAWSCredentialsProviderChain>&&, char const*&, std::__cxx11::basic_string<char, std::char_traits<char>, Aws::Allocator<char> > const&, Aws::Client::AWSAuthV4Signer::PayloadSigningPolicy&, bool&&) /usr/include/c++/9/ext/new_allocator.h:147 (libexternal_Saws_Slibaws.so+0x39a78b)
    #10 Aws::S3::S3Client::S3Client(Aws::Client::ClientConfiguration const&, Aws::Client::AWSAuthV4Signer::PayloadSigningPolicy, bool, Aws::S3::US_EAST_1_REGIONAL_ENDPOINT_OPTION) external/aws/aws-cpp-sdk-s3/source/S3Client.cpp:134 (libexternal_Saws_Slibaws.so+0x39a78b)
    #11 std::shared_ptr<Aws::S3::S3Client>::shared_ptr<Aws::Allocator<Aws::S3::S3Client>, Aws::Client::ClientConfiguration&, Aws::Client::AWSAuthV4Signer::PayloadSigningPolicy, bool const&>(std::_Sp_alloc_shared_tag<Aws::Allocator<Aws::S3::S3Client> >, Aws::Client::ClientConfiguration&, Aws::Client::AWSAuthV4Signer::PayloadSigningPolicy&&, bool const&) /usr/include/c++/9/ext/new_allocator.h:147 (libexternal_Scom_Ugithub_Utiledb_Utiledb_Slibtiledb.so+0x2879bd)
    #12 std::shared_ptr<Aws::S3::S3Client> std::allocate_shared<Aws::S3::S3Client, Aws::Allocator<Aws::S3::S3Client>, Aws::Client::ClientConfiguration&, Aws::Client::AWSAuthV4Signer::PayloadSigningPolicy, bool const&>(Aws::Allocator<Aws::S3::S3Client> const&, Aws::Client::ClientConfiguration&, Aws::Client::AWSAuthV4Signer::PayloadSigningPolicy&&, bool const&) /usr/include/c++/9/bits/shared_ptr.h:702 (libexternal_Scom_Ugithub_Utiledb_Utiledb_Slibtiledb.so+0x2879bd)
    #13 std::shared_ptr<Aws::S3::S3Client> Aws::MakeShared<Aws::S3::S3Client, Aws::Client::ClientConfiguration&, Aws::Client::AWSAuthV4Signer::PayloadSigningPolicy, bool const&>(char const*, Aws::Client::ClientConfiguration&, Aws::Client::AWSAuthV4Signer::PayloadSigningPolicy&&, bool const&) external/aws/aws-cpp-sdk-core/include/aws/core/utils/memory/stl/AWSAllocator.h:106 (libexternal_Scom_Ugithub_Utiledb_Utiledb_Slibtiledb.so+0x2879bd)
    #14 tiledb::sm::S3::init_client() const external/com_github_tiledb_tiledb/tiledb/sm/filesystem/s3.cc:1104 (libexternal_Scom_Ugithub_Utiledb_Utiledb_Slibtiledb.so+0x2879bd)
```

---

TYPE: BUG
DESC: Fix race within `S3::init_client`
joe-maley pushed a commit that referenced this pull request May 3, 2021
Fixes #2201

This patch introduces a global lock on the `S3Client` constructor to prevent
the following race reported from TSAN:
```
WARNING: ThreadSanitizer: data race (pid=12)
  Write of size 8 at 0x7fe93658a1b0 by thread T16 (mutexes: write M144250874682678808, write M150443169551677688):
    #0 cJSON_ParseWithOpts external/aws/aws-cpp-sdk-core/source/external/cjson/cJSON.cpp:1006 (libexternal_Saws_Slibaws.so+0x299bef)
    #1 Aws::Utils::Json::JsonValue::JsonValue(std::__cxx11::basic_string<char, std::char_traits<char>, Aws::Allocator<char> > const&) external/aws/aws-cpp-sdk-core/source/utils/json/JsonSerializer.cpp:39 (libexternal_Saws_Slibaws.so+0x32d09c)
    #2 Aws::Config::EC2InstanceProfileConfigLoader::LoadInternal() external/aws/aws-cpp-sdk-core/source/config/AWSProfileConfigLoader.cpp:341 (libexternal_Saws_Slibaws.so+0x286d88)
    #3 Aws::Config::AWSProfileConfigLoader::Load() external/aws/aws-cpp-sdk-core/source/config/AWSProfileConfigLoader.cpp:47 (libexternal_Saws_Slibaws.so+0x282b4f)
    #4 Aws::Auth::InstanceProfileCredentialsProvider::Reload() external/aws/aws-cpp-sdk-core/source/auth/AWSCredentialsProvider.cpp:265 (libexternal_Saws_Slibaws.so+0x235987)
    #5 Aws::Auth::InstanceProfileCredentialsProvider::RefreshIfExpired() external/aws/aws-cpp-sdk-core/source/auth/AWSCredentialsProvider.cpp:283 (libexternal_Saws_Slibaws.so+0x23613b)
    #6 Aws::Auth::InstanceProfileCredentialsProvider::GetAWSCredentials() external/aws/aws-cpp-sdk-core/source/auth/AWSCredentialsProvider.cpp:250 (libexternal_Saws_Slibaws.so+0x23be9a)
    #7 Aws::Auth::AWSCredentialsProviderChain::GetAWSCredentials() external/aws/aws-cpp-sdk-core/source/auth/AWSCredentialsProviderChain.cpp:35 (libexternal_Saws_Slibaws.so+0x244550)
    #8 Aws::Client::AWSAuthV4Signer::AWSAuthV4Signer(std::shared_ptr<Aws::Auth::AWSCredentialsProvider> const&, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, Aws::Allocator<char> > const&, Aws::Client::AWSAuthV4Signer::PayloadSigningPolicy, bool) external/aws/aws-cpp-sdk-core/source/auth/AWSAuthSigner.cpp:170 (libexternal_Saws_Slibaws.so+0x2262ed)
    #9 std::shared_ptr<Aws::Client::AWSAuthV4Signer> Aws::MakeShared<Aws::Client::AWSAuthV4Signer, std::shared_ptr<Aws::Auth::DefaultAWSCredentialsProviderChain>, char const*&, std::__cxx11::basic_string<char, std::char_traits<char>, Aws::Allocator<char> > const&, Aws::Client::AWSAuthV4Signer::PayloadSigningPolicy&, bool>(char const*, std::shared_ptr<Aws::Auth::DefaultAWSCredentialsProviderChain>&&, char const*&, std::__cxx11::basic_string<char, std::char_traits<char>, Aws::Allocator<char> > const&, Aws::Client::AWSAuthV4Signer::PayloadSigningPolicy&, bool&&) /usr/include/c++/9/ext/new_allocator.h:147 (libexternal_Saws_Slibaws.so+0x39a78b)
    #10 Aws::S3::S3Client::S3Client(Aws::Client::ClientConfiguration const&, Aws::Client::AWSAuthV4Signer::PayloadSigningPolicy, bool, Aws::S3::US_EAST_1_REGIONAL_ENDPOINT_OPTION) external/aws/aws-cpp-sdk-s3/source/S3Client.cpp:134 (libexternal_Saws_Slibaws.so+0x39a78b)
    #11 std::shared_ptr<Aws::S3::S3Client>::shared_ptr<Aws::Allocator<Aws::S3::S3Client>, Aws::Client::ClientConfiguration&, Aws::Client::AWSAuthV4Signer::PayloadSigningPolicy, bool const&>(std::_Sp_alloc_shared_tag<Aws::Allocator<Aws::S3::S3Client> >, Aws::Client::ClientConfiguration&, Aws::Client::AWSAuthV4Signer::PayloadSigningPolicy&&, bool const&) /usr/include/c++/9/ext/new_allocator.h:147 (libexternal_Scom_Ugithub_Utiledb_Utiledb_Slibtiledb.so+0x2879bd)
    #12 std::shared_ptr<Aws::S3::S3Client> std::allocate_shared<Aws::S3::S3Client, Aws::Allocator<Aws::S3::S3Client>, Aws::Client::ClientConfiguration&, Aws::Client::AWSAuthV4Signer::PayloadSigningPolicy, bool const&>(Aws::Allocator<Aws::S3::S3Client> const&, Aws::Client::ClientConfiguration&, Aws::Client::AWSAuthV4Signer::PayloadSigningPolicy&&, bool const&) /usr/include/c++/9/bits/shared_ptr.h:702 (libexternal_Scom_Ugithub_Utiledb_Utiledb_Slibtiledb.so+0x2879bd)
    #13 std::shared_ptr<Aws::S3::S3Client> Aws::MakeShared<Aws::S3::S3Client, Aws::Client::ClientConfiguration&, Aws::Client::AWSAuthV4Signer::PayloadSigningPolicy, bool const&>(char const*, Aws::Client::ClientConfiguration&, Aws::Client::AWSAuthV4Signer::PayloadSigningPolicy&&, bool const&) external/aws/aws-cpp-sdk-core/include/aws/core/utils/memory/stl/AWSAllocator.h:106 (libexternal_Scom_Ugithub_Utiledb_Utiledb_Slibtiledb.so+0x2879bd)
    #14 tiledb::sm::S3::init_client() const external/com_github_tiledb_tiledb/tiledb/sm/filesystem/s3.cc:1104 (libexternal_Scom_Ugithub_Utiledb_Utiledb_Slibtiledb.so+0x2879bd)
```

---

TYPE: BUG
DESC: Fix race within `S3::init_client`

Co-authored-by: Joe Maley <[email protected]>
ihnorton pushed a commit that referenced this pull request Aug 19, 2022
ihnorton pushed a commit that referenced this pull request Aug 24, 2022
davisp added a commit that referenced this pull request Mar 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants