-
Notifications
You must be signed in to change notification settings - Fork 185
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Storage format specification improvements 2/N #5329
Conversation
21b77df
to
856bcd0
Compare
…o changes) and 2.
Specify which filters are used for each file, and break up a long bullet point.
90a08e5
to
b45e03f
Compare
@@ -154,7 +153,7 @@ Introduced in TileDB 1.6 | |||
* The [footer](./fragment.md#footer) and [R-Tree](./fragment.md#r-tree) structures were added. | |||
* The _Bounding coords_ field was removed. | |||
* The _MBRs_ field was removed. MBRs are now stored in the R-Tree. | |||
* Structures other than the footer like tile offsets, sizes and metadata are wrapped in their own generic tiles. This allows loading them lazily and in parallel. | |||
* Tile offsets and sizes are wrapped in their own generic tiles. This allows loading them lazily and in parallel. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tile metadata did not exist back then.
format_spec/array_file_hierarchy.md
Outdated
* Inside the same commits folder, any number of [consolidated commits files](./consolidated_commits_file.md) of the form [`<timestamped_name>`](./timestamped_name.md)`.con`. | ||
* Inside the same commits folder, any number of [ignore files](./ignore_file.md) of the form [`<timestamped_name>`](./timestamped_name.md)`.ign`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consolidated commit files should have been gated behind a format version but they aren't.
@@ -35,6 +35,8 @@ The “chunk filtered data” bytes contain the final bytes of the chunk after b | |||
|
|||
Internally, any filter in a filter pipeline produces two arrays of data as output: a metadata byte array and a filtered data byte array. Additionally, these output byte arrays can be arbitrarily separated into “parts” by any filter. Typically, when a next filter receives the output of the previous filter as its input, it will filter each “part” independently. | |||
|
|||
_New in version 11_ Cells of arrays are not split across chunks within a tile. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still not sure if this is format-breaking or not. There was some discussion in #5205 (comment).
Supporting splitting cells across chunks would enable some interesting optimizations. Consider for example storing two strings larger than the chunk size and one tiny string in between. They will be forced to be stored in three chunks. Maybe that's what we want for the RLE/dictionary filters, but then we can amend it to say "Cells of arrays are not split across chunks within a tile when using the RLE or dictionary encoding filter".
@@ -72,7 +72,7 @@ Introduced in TileDB 2.10 | |||
|
|||
Introduced in TileDB 2.9 | |||
|
|||
* The [dictionary filter](./filters/dictionary_encoding.md) was added. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Filters are not "added" in a version so we have to reword this.
@@ -80,6 +80,9 @@ For the `TILEDB_FILTER_DELTA` and `TILEDB_FILTER_DOUBLE_DELTA` compression filte | |||
| Compression level | `int32_t` | Ignored | | |||
| Reinterpret datatype | `uint8_t` | Type to reinterpret data prior to compression. | | |||
|
|||
> [!NOTE] | |||
> Prior to version 20, the _Reinterpret datatype_ field was not present for the double delta filter. Also prior to version 19, the same field was not present for the delta filter. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The delta filter did not "exist" prior to version 19, but as said above filters are not introduced in specific format versions so we are specific that the filter option was introduced.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it maybe be better to phrase this like this instead:
The Reinterpret datatype field was added to the delta filter in version 19 and the double delta filter in version 20.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My preference is to document the latest version in the main specification1 and mention behavior in prior versions in notes like this. In other words the spec has been consistently using "Prior to version X, Y did not exist", instead of "Y was added in version X".
Footnotes
-
An exception is simple additions and removals where I used the "New/Removed in version **" phrasing. Adding a "New in version 19 for the delta filter and new in version 20 for the double delta filter" would be too much for an inline comment. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM overall. Great work! Just a few suggestions.
* Inside the same commit folder, any number of [consolidated commits files](./consolidated_commits_file.md) of the form [`<timestamped_name>`](./timestamped_name.md)`.con`. | ||
* Inside the same commit folder, any number of [ignore files](./ignore_file.md) of the form [`<timestamped_name>`](./timestamped_name.md)`.ign`. | ||
* Inside of a fragment metadata folder, any number of [consolidated fragment metadata files](./consolidated_fragment_metadata_file.md) of the form [`<timestamped_name>`](./timestamped_name.md)`.meta`. | ||
* Inside of a `__schema` folder, any number of [array schema files](./array_schema.md) [`<timestamped_name>`](./timestamped_name.md). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* Inside of a `__schema` folder, any number of [array schema files](./array_schema.md) [`<timestamped_name>`](./timestamped_name.md). | |
* Inside the `__schema` folder, any number of [array schema files](./array_schema.md) [`<timestamped_name>`](./timestamped_name.md). |
There's only one __schema
folder per array, as far as I know.
* Inside of a `__schema` folder, any number of [array schema files](./array_schema.md) [`<timestamped_name>`](./timestamped_name.md). | ||
* Note: the name does _not_ include the format version. | ||
* _New in version 20_ Inside of the schema folder, an enumerations folder `__enumerations`. | ||
* Inside of a `__fragments` folder, any number of [fragment folders](./fragment.md) [`<timestamped_name>`](./timestamped_name.md). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* Inside of a `__fragments` folder, any number of [fragment folders](./fragment.md) [`<timestamped_name>`](./timestamped_name.md). | |
* Inside the `__fragments` folder, any number of [fragment folders](./fragment.md) [`<timestamped_name>`](./timestamped_name.md). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using "the" might imply that the existence of the folder is required. There cannot be empty folders in cloud object storage which means that an array with no fragments yet written will not have a __fragments
folder.
format_spec/array_file_hierarchy.md
Outdated
* Note: the name does _not_ include the format version. | ||
* _New in version 20_ Inside of the schema folder, an enumerations folder `__enumerations`. | ||
* Inside of a `__fragments` folder, any number of [fragment folders](./fragment.md) [`<timestamped_name>`](./timestamped_name.md). | ||
* _New in version 12_ Inside of a `__commits` folder, an empty file [`<timestamped_name>`](./timestamped_name.md)`.wrt` associated with every fragment folder [`<timestamped_name>`](./timestamped_name.md), where [`<timestamped_name>`](./timestamped_name.md) is common for the folder and the WRT file. This is used to indicate that fragment [`<timestamped_name>`](./timestamped_name.md) has been *committed* (i.e., its write process finished successfully) and it is ready for use by TileDB. If the WRT file does not exist, the corresponding fragment folder is ignored by TileDB during the reads. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* _New in version 12_ Inside of a `__commits` folder, an empty file [`<timestamped_name>`](./timestamped_name.md)`.wrt` associated with every fragment folder [`<timestamped_name>`](./timestamped_name.md), where [`<timestamped_name>`](./timestamped_name.md) is common for the folder and the WRT file. This is used to indicate that fragment [`<timestamped_name>`](./timestamped_name.md) has been *committed* (i.e., its write process finished successfully) and it is ready for use by TileDB. If the WRT file does not exist, the corresponding fragment folder is ignored by TileDB during the reads. | |
* _New in version 12_ Inside the `__commits` folder lives an empty file [`<timestamped_name>`](./timestamped_name.md)`.wrt` associated with every fragment folder [`<timestamped_name>`](./timestamped_name.md), where [`<timestamped_name>`](./timestamped_name.md) is common for the folder and the WRT file. This is used to indicate that fragment [`<timestamped_name>`](./timestamped_name.md) has been *committed* (i.e., its write process finished successfully) and it is ready for use by TileDB. If the WRT file does not exist, the corresponding fragment folder is ignored by TileDB during the reads. |
* _New in version 18_ Inside of a `__labels` folder, additional TileDB arrays storing dimension label data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should sort the file hierarchy updates by version (e.g., 20, followed by 18, followed by 16, followed by 12), similar to how format_spec/array_format_history.md
is constructed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorted by version (with unversioned items being put first because they are the most important) and grouped by directory hierarchy. I also reworded and simplified some items.
@@ -80,6 +80,9 @@ For the `TILEDB_FILTER_DELTA` and `TILEDB_FILTER_DOUBLE_DELTA` compression filte | |||
| Compression level | `int32_t` | Ignored | | |||
| Reinterpret datatype | `uint8_t` | Type to reinterpret data prior to compression. | | |||
|
|||
> [!NOTE] | |||
> Prior to version 20, the _Reinterpret datatype_ field was not present for the double delta filter. Also prior to version 19, the same field was not present for the delta filter. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it maybe be better to phrase this like this instead:
The Reinterpret datatype field was added to the delta filter in version 19 and the double delta filter in version 20.
Co-authored-by: Nick Vigilante <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
SC-54621
Continuing #5205, this PR goes through each format version and mentions the changes to the main specification document. It also documents structures like
__coords.tdb
and legacy fragment metadata, and fixes any defects encountered in the meantime.TYPE: FORMAT
DESC: The storage format specification was updated to document format changes of previous versions throughout the main document.