Skip to content

Commit

Permalink
Nullable Attribute Support (#1895)
Browse files Browse the repository at this point in the history
Attributes can be defined as nullable.  Nullable attributes require a "validity
vector" buffer for both read and write queries, similar to how var-sized
attributes require an additional "offsets" buffer. Both fixed and var-sized
attributes may be nullable.

Using the C API, attributes must be set nullable before adding them to the
schema, e.g.:
```
tiledb_attribute_t* attr;
tiledb_attribute_alloc(ctx, "my_attr", TILEDB_INT32, &attr);
tiledb_attribute_set_nullable(ctx, attr, 1 /* nullable */);

tiledb_array_schema_t* array_schema;
tiledb_array_schema_alloc(ctx_, TILEDB_DENSE, &array_schema);
tiledb_array_schema_add_attribute(ctx_, array_schema, attr);
```

Write queries require a validity vector (bytemap) for nullable attributes. In
the below example, values "200" and "300" are null. These values may or may not
be written to the disk. TileDB may treat them as garbage.
```
int32_t buffer = {100, 200, 300, 400};
uint64_t buffer_size = sizeof(buffer);
uint8_t buffer_validity = {1, 0, 0, 1};
uint64_t buffer_validity_size = sizeof(buffer_validity);
tiledb_query_set_buffer_nullable(
  ctx,
  query,
  "my_attr",
  buffer,
  buffer_size,
  buffer_validity,
  buffer_validity_size);
```

Overview:
- Format version bumped from 6 to 7.

- Validity vector buffers are written to their own tile, similar to how offset
  buffers are written to their own tile, separate from the value tile.

- Currently, the "validity vector" is a bytemap in all usage (APIs, in-memory,
  and on-disk). In the future, we could like to store the validity vector as
  a bitmap in-memory and on-disk, but allowing the user to use an API that
  uses either a bitmap or bytemap.

- A new, internal `ValidityVector` class has been introduced to store the
  validity vector in-memory. This may seem extraneous because it wraps a simple
  buffer, but this will change in the future when we support bitmaps.

- Similar to the existing "sm.memory_budget" and "sm.memory_budget_var" config
  parameters, there is now a "sm.memory_budget_validity" for budgeting the
  validity vector buffers.

- Similar to offset tiles, validity tiles have their own compressor that is
  independent of the user-defined attribute filter. I have tentatively chosen
  RLE compression.

- C/C++ APIs has been added.

- The `QueryBuffer` class has been moved from `misc/query_buffer.h` to
  `query/query_buffer.h` because it now depends on `query/validity_vector`,
  which is outside of the `misc` directory.

- Many of the internal classes are now nullable-aware (`Reader`, `Writer`,
  `Query`, `FilterPipeline`, `Subarray`, `SubarrayPartitioner`).

Co-authored-by: Joe Maley <[email protected]>
  • Loading branch information
joe maley and Joe Maley authored Nov 10, 2020
1 parent 0319f02 commit a7fd8d6
Show file tree
Hide file tree
Showing 53 changed files with 6,291 additions and 500 deletions.
5 changes: 3 additions & 2 deletions HISTORY.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,9 +10,10 @@

## New features

* Support for nullable attributes. [#1895](https://github.com/TileDB-Inc/TileDB/pull/1895)
* Support for Hilbert order sorting for sparse arrays. [#1880](https://github.com/TileDB-Inc/TileDB/pull/1880)
* Support for AWS S3 "AssumeRole" temporary credentials [#1882](https://github.com/TileDB-Inc/TileDB/pull/1882)
* Added support for Hilbert order sorting for sparse arrays. [#1880](https://github.com/TileDB-Inc/TileDB/pull/1880)
* Added experimental support for an in-memory backend used with bootstrap option "--enable-memfs" [#1873](https://github.com/TileDB-Inc/TileDB/pull/1873)
* Experimental support for an in-memory backend used with bootstrap option "--enable-memfs" [#1873](https://github.com/TileDB-Inc/TileDB/pull/1873)

## Improvements

Expand Down
12 changes: 12 additions & 0 deletions doc/source/c-api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -319,6 +319,8 @@ Attribute
:project: TileDB-C
.. doxygenfunction:: tiledb_attribute_free
:project: TileDB-C
.. doxygenfunction:: tiledb_attribute_set_nullable
:project: TileDB-C
.. doxygenfunction:: tiledb_attribute_set_filter_list
:project: TileDB-C
.. doxygenfunction:: tiledb_attribute_set_cell_val_num
Expand All @@ -327,6 +329,8 @@ Attribute
:project: TileDB-C
.. doxygenfunction:: tiledb_attribute_get_type
:project: TileDB-C
.. doxygenfunction:: tiledb_attribute_get_nullable
:project: TileDB-C
.. doxygenfunction:: tiledb_attribute_get_filter_list
:project: TileDB-C
.. doxygenfunction:: tiledb_attribute_get_cell_val_num
Expand Down Expand Up @@ -396,10 +400,18 @@ Query
:project: TileDB-C
.. doxygenfunction:: tiledb_query_set_buffer_var
:project: TileDB-C
.. doxygenfunction:: tiledb_query_set_buffer_nullable
:project: TileDB-C
.. doxygenfunction:: tiledb_query_set_buffer_var_nullable
:project: TileDB-C
.. doxygenfunction:: tiledb_query_get_buffer
:project: TileDB-C
.. doxygenfunction:: tiledb_query_get_buffer_var
:project: TileDB-C
.. doxygenfunction:: tiledb_query_get_buffer_nullable
:project: TileDB-C
.. doxygenfunction:: tiledb_query_get_buffer_var_nullable
:project: TileDB-C
.. doxygenfunction:: tiledb_query_set_layout
:project: TileDB-C
.. doxygenfunction:: tiledb_query_free
Expand Down
4 changes: 2 additions & 2 deletions format_spec/FORMAT_SPEC.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@


:information_source: **Notes:**
- The current TileDB format version number is **5** (`uint32_t`).
- The current TileDB format version number is **7** (`uint32_t`).
- All data written by TileDB and referenced in this document is **little-endian**.

## Table of Contents
Expand All @@ -15,4 +15,4 @@
* [Tile](./tile.md)
* [Generic Tile](./generic_tile.md)
* **Group**
* [File hierarchy](./group_file_hierarchy.md)
* [File hierarchy](./group_file_hierarchy.md)
5 changes: 3 additions & 2 deletions format_spec/array_schema.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Array Schema File
#Array Schema File

The array schema file has name `__array_schema.tdb` and is located here:

Expand All @@ -23,6 +23,7 @@ The array schema file consists of a single [generic tile](./generic_tile.md), wi
| Capacity | `uint64_t` | For sparse fragments, the data tile capacity |
| Coords filters | [Filter Pipeline](./filter_pipeline.md) | The filter pipeline used as default for coordinate tiles |
| Offsets filters | [Filter Pipeline](./filter_pipeline.md) | The filter pipeline used for cell var-len offset tiles |
| Validity filters | [Filter Pipeline](./filter_pipeline.md) | The filter pipeline used for cell validity tiles |
| Domain | [Domain](#domain) | The array domain |
| Num attributes | `uint32_t` | Number of attributes in the array |
| Attribute 1 | [Attribute](#attribute) | First attribute |
Expand Down Expand Up @@ -69,4 +70,4 @@ The attribute has internal format:
| Filters | [Filter Pipeline](./filter_pipeline.md) | The filter pipeline used on attribute value tiles |
| Fill value size | `uint64_t` | The size in bytes of the fill value |
| Fill value | `uint8_t[]` | The fill value |

| Nullable | `bool` | Whether or not the attribute can be null |
6 changes: 5 additions & 1 deletion format_spec/fragment.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ my_array # array folder

There can be any number of fragments in an array. The fragment folder contains:
* A single [fragment metadata file](#fragment-metadata-file) named `__fragment_metadata.tdb`.
* Any number of [data files](#data-file). For each fixed-sized attribute `a1` (or dimension `d1`), there is a single data file `a1.tdb` (`d1.tdb`) containing the values along this attribute (dimension). For every var-sized attribute `a2` (or dimensions `d2`), there are two data files; `a2_var.tdb` (`d2_var.tdb`) containing the var-sized values of the attribute (dimension) and `a2.tdb` (`d2.tdb`) containing the starting offsets of each value in `a2_var.tdb` (`d2_var.rdb`).
* Any number of [data files](#data-file). For each fixed-sized attribute `a1` (or dimension `d1`), there is a single data file `a1.tdb` (`d1.tdb`) containing the values along this attribute (dimension). For every var-sized attribute `a2` (or dimensions `d2`), there are two data files; `a2_var.tdb` (`d2_var.tdb`) containing the var-sized values of the attribute (dimension) and `a2.tdb` (`d2.tdb`) containing the starting offsets of each value in `a2_var.tdb` (`d2_var.rdb`). Both fixed-sized and var-sized attributes can be nullable. A nullable attribute, `a3`, will have an additional file `a3_validity.tdb` that contains its validity vector.

## Fragment Metadata File

Expand Down Expand Up @@ -127,6 +127,7 @@ The footer is a simple blob \(i.e., _not a generic tile_\) with the following in
| Last tile cell num | `uint64_t` | For sparse arrays, the number of cells in the last tile in the fragment |
| File sizes | `uint64_t[]` | The size in bytes of each attribute/dimension file in the fragment. For var-length attributes/dimensions, this is the size of the offsets file. |
| File var sizes | `uint64_t[]` | The size in bytes of each var-length attribute/dimension file in the fragment. |
| File validity sizes | `uint64_t[]` | The size in bytes of each attribute/dimension validity vector file in the fragment. |
| R-Tree offset | `uint64_t` | The offset to the generic tile storing the R-Tree in the metadata file. |
| Tile offset for attribute/dimension 1 | `uint64_t` | The offset to the generic tile storing the tile offsets for attribute/dimension 1. |
||||
Expand All @@ -137,6 +138,9 @@ The footer is a simple blob \(i.e., _not a generic tile_\) with the following in
| Tile var sizes offset for attribute/dimension 1 | `uint64_t` | The offset to the generic tile storing the variable tile sizes for attribute/dimension 1. |
||||
| Tile var sizes offset for attribute/dimension N | `uint64_t` | The offset to the generic tile storing the variable tile sizes for attribute/dimension N. |
| Tile validity offset for attribute/dimension 1 | `uint64_t` | The offset to the generic tile storing the tile validity offsets for attribute/dimension 1. |
||||
| Tile validity offset for attribute/dimension N | `uint64_t` | The offset to the generic tile storing the tile validity offsets for attribute/dimension N |
| Footer length | `uint64_t` | Sum of bytes of the above fields. Only present when there is at least one var-sized dimension. |

## Data File
Expand Down
13 changes: 8 additions & 5 deletions test/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -93,25 +93,25 @@ set(TILEDB_TEST_SOURCES
src/unit-buffer.cc
src/unit-bufferlist.cc
src/unit-capi-any.cc
src/unit-capi-array.cc
src/unit-capi-array_schema.cc
src/unit-capi-async.cc
src/unit-capi-array.cc
src/unit-capi-buffer.cc
src/unit-capi-config.cc
src/unit-capi-consolidation.cc
src/unit-capi-dense_array.cc
src/unit-capi-dense_array_2.cc
src/unit-capi-dense_neg.cc
src/unit-capi-dense_vector.cc
src/unit-duplicates.cc
src/unit-capi-empty-var-length.cc
src/unit-capi-enum_values.cc
src/unit-capi-error.cc
src/unit-capi-fill_values.cc
src/unit-capi-filter.cc
src/unit-capi-incomplete.cc
src/unit-capi-incomplete-2.cc
src/unit-capi-incomplete.cc
src/unit-capi-metadata.cc
src/unit-capi-nullable.cc
src/unit-capi-object_mgmt.cc
src/unit-capi-query.cc
src/unit-capi-query_2.cc
Expand All @@ -120,12 +120,13 @@ set(TILEDB_TEST_SOURCES
src/unit-capi-sparse_neg.cc
src/unit-capi-sparse_neg_2.cc
src/unit-capi-sparse_real.cc
src/unit-capi-string_dims.cc
src/unit-capi-sparse_real_2.cc
src/unit-capi-string.cc
src/unit-capi-string_dims.cc
src/unit-capi-uri.cc
src/unit-capi-version.cc
src/unit-capi-vfs.cc
src/unit-duplicates.cc
src/unit-CellSlabIter.cc
src/unit-compression-dd.cc
src/unit-compression-rle.cc
Expand Down Expand Up @@ -156,6 +157,7 @@ set(TILEDB_TEST_SOURCES
src/unit-uri.cc
src/unit-utils.cc
src/unit-uuid.cc
src/unit-ValidityVector.cc
src/unit-vfs.cc
src/unit-win-filesystem.cc
src/unit.cc
Expand All @@ -166,13 +168,14 @@ if (TILEDB_CPP_API)
src/unit-cppapi-array.cc
src/unit-cppapi-checksum.cc
src/unit-cppapi-config.cc
src/unit-cppapi-consolidation.cc
src/unit-cppapi-consolidation-sparse.cc
src/unit-cppapi-consolidation.cc
src/unit-cppapi-datetimes.cc
src/unit-cppapi-fill_values.cc
src/unit-cppapi-filter.cc
src/unit-cppapi-hilbert.cc
src/unit-cppapi-metadata.cc
src/unit-cppapi-nullable.cc
src/unit-cppapi-query.cc
src/unit-cppapi-schema.cc
src/unit-cppapi-subarray.cc
Expand Down
42 changes: 21 additions & 21 deletions test/src/unit-ReadCellSlabIter.cc
Original file line number Diff line number Diff line change
Expand Up @@ -445,9 +445,9 @@ TEST_CASE_METHOD(
.ok());
Tile tile_2_0(
Datatype::UINT64, sizeof(uint64_t), 0, &chunked_buffer_2_0, false);
auto tile_pair = result_tile_2_0.tile_pair("d");
REQUIRE(tile_pair != nullptr);
tile_pair->first = tile_2_0;
auto tile_tuple = result_tile_2_0.tile_tuple("d");
REQUIRE(tile_tuple != nullptr);
std::get<0>(*tile_tuple) = tile_2_0;

std::vector<uint64_t> vec_3_0 = {1000, 1000, 8, 9};
Buffer buff_3_0(&vec_3_0[0], vec_3_0.size() * sizeof(uint64_t));
Expand All @@ -457,9 +457,9 @@ TEST_CASE_METHOD(
.ok());
Tile tile_3_0(
Datatype::UINT64, sizeof(uint64_t), 0, &chunked_buffer_3_0, false);
tile_pair = result_tile_3_0.tile_pair("d");
REQUIRE(tile_pair != nullptr);
tile_pair->first = tile_3_0;
tile_tuple = result_tile_3_0.tile_tuple("d");
REQUIRE(tile_tuple != nullptr);
std::get<0>(*tile_tuple) = tile_3_0;

std::vector<uint64_t> vec_3_1 = {1000, 12, 19, 1000};
Buffer buff_3_1(&vec_3_1[0], vec_3_1.size() * sizeof(uint64_t));
Expand All @@ -469,9 +469,9 @@ TEST_CASE_METHOD(
.ok());
Tile tile_3_1(
Datatype::UINT64, sizeof(uint64_t), 0, &chunked_buffer_3_1, false);
tile_pair = result_tile_3_1.tile_pair("d");
REQUIRE(tile_pair != nullptr);
tile_pair->first = tile_3_1;
tile_tuple = result_tile_3_1.tile_tuple("d");
REQUIRE(tile_tuple != nullptr);
std::get<0>(*tile_tuple) = tile_3_1;

result_coords.emplace_back(&result_tile_2_0, 1);
result_coords.emplace_back(&result_tile_2_0, 3);
Expand Down Expand Up @@ -1271,9 +1271,9 @@ TEST_CASE_METHOD(
.ok());
Tile tile_3_0_d1(
Datatype::UINT64, sizeof(uint64_t), 0, &chunked_buffer_3_0_d1, false);
auto tile_pair = result_tile_3_0.tile_pair("d1");
REQUIRE(tile_pair != nullptr);
tile_pair->first = tile_3_0_d1;
auto tile_tuple = result_tile_3_0.tile_tuple("d1");
REQUIRE(tile_tuple != nullptr);
std::get<0>(*tile_tuple) = tile_3_0_d1;

std::vector<uint64_t> vec_3_0_d2 = {1000, 3, 1000, 1000};
Buffer buff_3_0_d2(&vec_3_0_d2[0], vec_3_0_d2.size() * sizeof(uint64_t));
Expand All @@ -1283,9 +1283,9 @@ TEST_CASE_METHOD(
.ok());
Tile tile_3_0_d2(
Datatype::UINT64, sizeof(uint64_t), 0, &chunked_buffer_3_0_d2, false);
tile_pair = result_tile_3_0.tile_pair("d2");
REQUIRE(tile_pair != nullptr);
tile_pair->first = tile_3_0_d2;
tile_tuple = result_tile_3_0.tile_tuple("d2");
REQUIRE(tile_tuple != nullptr);
std::get<0>(*tile_tuple) = tile_3_0_d2;

std::vector<uint64_t> vec_3_1_d1 = {5, 1000, 5, 1000};
Buffer buff_3_1_d1(&vec_3_1_d1[0], vec_3_1_d1.size() * sizeof(uint64_t));
Expand All @@ -1295,9 +1295,9 @@ TEST_CASE_METHOD(
.ok());
Tile tile_3_1_d1(
Datatype::UINT64, sizeof(uint64_t), 0, &chunked_buffer_3_1_d1, false);
tile_pair = result_tile_3_1.tile_pair("d1");
REQUIRE(tile_pair != nullptr);
tile_pair->first = tile_3_1_d1;
tile_tuple = result_tile_3_1.tile_tuple("d1");
REQUIRE(tile_tuple != nullptr);
std::get<0>(*tile_tuple) = tile_3_1_d1;

std::vector<uint64_t> vec_3_1_d2 = {5, 1000, 6, 1000};
Buffer buff_3_1_d2(&vec_3_1_d2[0], vec_3_1_d2.size() * sizeof(uint64_t));
Expand All @@ -1307,9 +1307,9 @@ TEST_CASE_METHOD(
.ok());
Tile tile_3_1_d2(
Datatype::UINT64, sizeof(uint64_t), 0, &chunked_buffer_3_1_d2, false);
tile_pair = result_tile_3_1.tile_pair("d2");
REQUIRE(tile_pair != nullptr);
tile_pair->first = tile_3_1_d2;
tile_tuple = result_tile_3_1.tile_tuple("d2");
REQUIRE(tile_tuple != nullptr);
std::get<0>(*tile_tuple) = tile_3_1_d2;

result_coords.emplace_back(&result_tile_3_0, 1);
result_coords.emplace_back(&result_tile_3_1, 0);
Expand Down
8 changes: 4 additions & 4 deletions test/src/unit-SubarrayPartitioner-dense.cc
Original file line number Diff line number Diff line change
Expand Up @@ -268,7 +268,7 @@ void SubarrayPartitionerDenseFx::test_subarray_partitioner(
ThreadPool tp;
CHECK(tp.init(4).ok());
SubarrayPartitioner subarray_partitioner(
subarray, memory_budget_, memory_budget_var_, &tp);
subarray, memory_budget_, memory_budget_var_, 0, &tp);
auto st = subarray_partitioner.set_result_budget(attr.c_str(), budget);
CHECK(st.ok());

Expand All @@ -289,7 +289,7 @@ void SubarrayPartitionerDenseFx::test_subarray_partitioner(
ThreadPool tp;
CHECK(tp.init(4).ok());
SubarrayPartitioner subarray_partitioner(
subarray, memory_budget_, memory_budget_var_, &tp);
subarray, memory_budget_, memory_budget_var_, 0, &tp);

// Note: this is necessary, otherwise the subarray partitioner does
// not check if the memory budget is exceeded for attributes whose
Expand All @@ -301,7 +301,7 @@ void SubarrayPartitionerDenseFx::test_subarray_partitioner(
st = subarray_partitioner.set_result_budget("b", 1000000, 1000000);
CHECK(st.ok());

st = subarray_partitioner.set_memory_budget(budget, budget_var);
st = subarray_partitioner.set_memory_budget(budget, budget_var, 0);
CHECK(st.ok());

check_partitions(subarray_partitioner, partitions, unsplittable);
Expand Down Expand Up @@ -563,7 +563,7 @@ TEST_CASE_METHOD(
ThreadPool tp;
CHECK(tp.init(4).ok());
SubarrayPartitioner subarray_partitioner(
subarray, memory_budget_, memory_budget_var_, &tp);
subarray, memory_budget_, memory_budget_var_, 0, &tp);
auto st = subarray_partitioner.set_result_budget("a", 100 * sizeof(int));
CHECK(st.ok());
st = subarray_partitioner.set_result_budget("b", 1, 1);
Expand Down
10 changes: 5 additions & 5 deletions test/src/unit-SubarrayPartitioner-error.cc
Original file line number Diff line number Diff line change
Expand Up @@ -144,7 +144,7 @@ TEST_CASE_METHOD(
ThreadPool tp;
CHECK(tp.init(4).ok());
SubarrayPartitioner subarray_partitioner(
subarray, memory_budget_, memory_budget_var_, &tp);
subarray, memory_budget_, memory_budget_var_, 0, &tp);
uint64_t budget, budget_off, budget_val;

auto st = subarray_partitioner.get_result_budget("a", &budget);
Expand Down Expand Up @@ -207,16 +207,16 @@ TEST_CASE_METHOD(
CHECK(st.ok());
CHECK(budget == 1000);

uint64_t memory_budget, memory_budget_var;
uint64_t memory_budget, memory_budget_var, memory_budget_validity;
st = subarray_partitioner.get_memory_budget(
&memory_budget, &memory_budget_var);
&memory_budget, &memory_budget_var, &memory_budget_validity);
CHECK(st.ok());
CHECK(memory_budget == memory_budget_);
CHECK(memory_budget_var == memory_budget_var_);
st = subarray_partitioner.set_memory_budget(16, 16);
st = subarray_partitioner.set_memory_budget(16, 16, 0);
CHECK(st.ok());
st = subarray_partitioner.get_memory_budget(
&memory_budget, &memory_budget_var);
&memory_budget, &memory_budget_var, &memory_budget_validity);
CHECK(st.ok());
CHECK(memory_budget == 16);
CHECK(memory_budget_var == 16);
Expand Down
Loading

0 comments on commit a7fd8d6

Please sign in to comment.