Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Storage format specification improvements 2/N #5329

Merged
merged 15 commits into from
Oct 25, 2024

Conversation

teo-tsirpanis
Copy link
Member

@teo-tsirpanis teo-tsirpanis commented Sep 30, 2024

SC-54621

Continuing #5205, this PR goes through each format version and mentions the changes to the main specification document. It also documents structures like __coords.tdb and legacy fragment metadata, and fixes any defects encountered in the meantime.


TYPE: FORMAT
DESC: The storage format specification was updated to document format changes of previous versions throughout the main document.

@teo-tsirpanis teo-tsirpanis force-pushed the teo/format-spec-updates branch from 21b77df to 856bcd0 Compare October 4, 2024 16:56
@teo-tsirpanis teo-tsirpanis force-pushed the teo/format-spec-updates branch from 90a08e5 to b45e03f Compare October 9, 2024 11:39
format_spec/current_domain.md Outdated Show resolved Hide resolved
format_spec/array_schema.md Show resolved Hide resolved
@@ -154,7 +153,7 @@ Introduced in TileDB 1.6
* The [footer](./fragment.md#footer) and [R-Tree](./fragment.md#r-tree) structures were added.
* The _Bounding coords_ field was removed.
* The _MBRs_ field was removed. MBRs are now stored in the R-Tree.
* Structures other than the footer like tile offsets, sizes and metadata are wrapped in their own generic tiles. This allows loading them lazily and in parallel.
* Tile offsets and sizes are wrapped in their own generic tiles. This allows loading them lazily and in parallel.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tile metadata did not exist back then.

Comment on lines 49 to 50
* Inside the same commits folder, any number of [consolidated commits files](./consolidated_commits_file.md) of the form [`<timestamped_name>`](./timestamped_name.md)`.con`.
* Inside the same commits folder, any number of [ignore files](./ignore_file.md) of the form [`<timestamped_name>`](./timestamped_name.md)`.ign`.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consolidated commit files should have been gated behind a format version but they aren't.

@@ -35,6 +35,8 @@ The “chunk filtered data” bytes contain the final bytes of the chunk after b

Internally, any filter in a filter pipeline produces two arrays of data as output: a metadata byte array and a filtered data byte array. Additionally, these output byte arrays can be arbitrarily separated into “parts” by any filter. Typically, when a next filter receives the output of the previous filter as its input, it will filter each “part” independently.

_New in version 11_ Cells of arrays are not split across chunks within a tile.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still not sure if this is format-breaking or not. There was some discussion in #5205 (comment).

Supporting splitting cells across chunks would enable some interesting optimizations. Consider for example storing two strings larger than the chunk size and one tiny string in between. They will be forced to be stored in three chunks. Maybe that's what we want for the RLE/dictionary filters, but then we can amend it to say "Cells of arrays are not split across chunks within a tile when using the RLE or dictionary encoding filter".

@@ -72,7 +72,7 @@ Introduced in TileDB 2.10

Introduced in TileDB 2.9

* The [dictionary filter](./filters/dictionary_encoding.md) was added.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filters are not "added" in a version so we have to reword this.

@@ -80,6 +80,9 @@ For the `TILEDB_FILTER_DELTA` and `TILEDB_FILTER_DOUBLE_DELTA` compression filte
| Compression level | `int32_t` | Ignored |
| Reinterpret datatype | `uint8_t` | Type to reinterpret data prior to compression. |

> [!NOTE]
> Prior to version 20, the _Reinterpret datatype_ field was not present for the double delta filter. Also prior to version 19, the same field was not present for the delta filter.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The delta filter did not "exist" prior to version 19, but as said above filters are not introduced in specific format versions so we are specific that the filter option was introduced.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it maybe be better to phrase this like this instead:

The Reinterpret datatype field was added to the delta filter in version 19 and the double delta filter in version 20.

Copy link
Member Author

@teo-tsirpanis teo-tsirpanis Oct 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My preference is to document the latest version in the main specification1 and mention behavior in prior versions in notes like this. In other words the spec has been consistently using "Prior to version X, Y did not exist", instead of "Y was added in version X".

Footnotes

  1. An exception is simple additions and removals where I used the "New/Removed in version **" phrasing. Adding a "New in version 19 for the delta filter and new in version 20 for the double delta filter" would be too much for an inline comment.

@teo-tsirpanis teo-tsirpanis marked this pull request as ready for review October 9, 2024 12:02
@teo-tsirpanis teo-tsirpanis requested review from ihnorton and a team October 9, 2024 12:02
Copy link
Contributor

@nickvigilante nickvigilante left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall. Great work! Just a few suggestions.

* Inside the same commit folder, any number of [consolidated commits files](./consolidated_commits_file.md) of the form [`<timestamped_name>`](./timestamped_name.md)`.con`.
* Inside the same commit folder, any number of [ignore files](./ignore_file.md) of the form [`<timestamped_name>`](./timestamped_name.md)`.ign`.
* Inside of a fragment metadata folder, any number of [consolidated fragment metadata files](./consolidated_fragment_metadata_file.md) of the form [`<timestamped_name>`](./timestamped_name.md)`.meta`.
* Inside of a `__schema` folder, any number of [array schema files](./array_schema.md) [`<timestamped_name>`](./timestamped_name.md).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Inside of a `__schema` folder, any number of [array schema files](./array_schema.md) [`<timestamped_name>`](./timestamped_name.md).
* Inside the `__schema` folder, any number of [array schema files](./array_schema.md) [`<timestamped_name>`](./timestamped_name.md).

There's only one __schema folder per array, as far as I know.

format_spec/array_file_hierarchy.md Outdated Show resolved Hide resolved
* Inside of a `__schema` folder, any number of [array schema files](./array_schema.md) [`<timestamped_name>`](./timestamped_name.md).
* Note: the name does _not_ include the format version.
* _New in version 20_ Inside of the schema folder, an enumerations folder `__enumerations`.
* Inside of a `__fragments` folder, any number of [fragment folders](./fragment.md) [`<timestamped_name>`](./timestamped_name.md).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Inside of a `__fragments` folder, any number of [fragment folders](./fragment.md) [`<timestamped_name>`](./timestamped_name.md).
* Inside the `__fragments` folder, any number of [fragment folders](./fragment.md) [`<timestamped_name>`](./timestamped_name.md).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using "the" might imply that the existence of the folder is required. There cannot be empty folders in cloud object storage which means that an array with no fragments yet written will not have a __fragments folder.

* Note: the name does _not_ include the format version.
* _New in version 20_ Inside of the schema folder, an enumerations folder `__enumerations`.
* Inside of a `__fragments` folder, any number of [fragment folders](./fragment.md) [`<timestamped_name>`](./timestamped_name.md).
* _New in version 12_ Inside of a `__commits` folder, an empty file [`<timestamped_name>`](./timestamped_name.md)`.wrt` associated with every fragment folder [`<timestamped_name>`](./timestamped_name.md), where [`<timestamped_name>`](./timestamped_name.md) is common for the folder and the WRT file. This is used to indicate that fragment [`<timestamped_name>`](./timestamped_name.md) has been *committed* (i.e., its write process finished successfully) and it is ready for use by TileDB. If the WRT file does not exist, the corresponding fragment folder is ignored by TileDB during the reads.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* _New in version 12_ Inside of a `__commits` folder, an empty file [`<timestamped_name>`](./timestamped_name.md)`.wrt` associated with every fragment folder [`<timestamped_name>`](./timestamped_name.md), where [`<timestamped_name>`](./timestamped_name.md) is common for the folder and the WRT file. This is used to indicate that fragment [`<timestamped_name>`](./timestamped_name.md) has been *committed* (i.e., its write process finished successfully) and it is ready for use by TileDB. If the WRT file does not exist, the corresponding fragment folder is ignored by TileDB during the reads.
* _New in version 12_ Inside the `__commits` folder lives an empty file [`<timestamped_name>`](./timestamped_name.md)`.wrt` associated with every fragment folder [`<timestamped_name>`](./timestamped_name.md), where [`<timestamped_name>`](./timestamped_name.md) is common for the folder and the WRT file. This is used to indicate that fragment [`<timestamped_name>`](./timestamped_name.md) has been *committed* (i.e., its write process finished successfully) and it is ready for use by TileDB. If the WRT file does not exist, the corresponding fragment folder is ignored by TileDB during the reads.

* _New in version 18_ Inside of a `__labels` folder, additional TileDB arrays storing dimension label data.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should sort the file hierarchy updates by version (e.g., 20, followed by 18, followed by 16, followed by 12), similar to how format_spec/array_format_history.md is constructed.

Copy link
Member Author

@teo-tsirpanis teo-tsirpanis Oct 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorted by version (with unversioned items being put first because they are the most important) and grouped by directory hierarchy. I also reworded and simplified some items.

format_spec/array_schema.md Outdated Show resolved Hide resolved
format_spec/current_domain.md Outdated Show resolved Hide resolved
@@ -80,6 +80,9 @@ For the `TILEDB_FILTER_DELTA` and `TILEDB_FILTER_DOUBLE_DELTA` compression filte
| Compression level | `int32_t` | Ignored |
| Reinterpret datatype | `uint8_t` | Type to reinterpret data prior to compression. |

> [!NOTE]
> Prior to version 20, the _Reinterpret datatype_ field was not present for the double delta filter. Also prior to version 19, the same field was not present for the delta filter.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it maybe be better to phrase this like this instead:

The Reinterpret datatype field was added to the delta filter in version 19 and the double delta filter in version 20.

format_spec/fragment.md Outdated Show resolved Hide resolved
format_spec/fragment.md Outdated Show resolved Hide resolved
Co-authored-by: Nick Vigilante <[email protected]>
@ihnorton ihnorton requested a review from davisp October 17, 2024 14:29
Copy link
Contributor

@davisp davisp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@teo-tsirpanis teo-tsirpanis merged commit 6564bd7 into dev Oct 25, 2024
4 checks passed
@teo-tsirpanis teo-tsirpanis deleted the teo/format-spec-updates branch October 25, 2024 13:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants