-
Notifications
You must be signed in to change notification settings - Fork 185
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Storage format specification improvements 2/N #5329
Changes from all commits
d01e956
574aee5
9927daf
8d10390
b439208
fd8002b
477ecd6
7463d3f
ff806c5
bedf0e9
d1788c7
0ee6375
ae4fd0c
b45e03f
8db322c
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -7,6 +7,9 @@ An array is a folder with the following structure: | |
``` | ||
my_array # array folder | ||
|_ __schema # array schema folder | ||
|_ <timestamp_name> # array schema files | ||
|_ ... | ||
|_ __enumerations # array enumerations folder | ||
|_ __fragments # array fragments folder | ||
|_ <timestamped_name> # fragment folder | ||
|_ ... | ||
|
@@ -22,23 +25,39 @@ my_array # array folder | |
|_ <timestamped_name>.con # consolidated commits file | ||
|_ ... | ||
|_ <timestamped_name>.ign # ignore file for consolidated commits file | ||
|_ __fragment_meta | ||
|_ <timestamped_name>.meta # consol. fragment meta file | ||
|_ ... | ||
|_ __fragment_meta # consolidated fragment metadata folder | ||
|_ <timestamped_name>.meta # consolidated fragment meta file | ||
|_ ... | ||
|_ __meta # array metadata folder | ||
|_ __labels # dimension label folder | ||
|
||
|_ <timestamped_name> # legacy fragment folder | ||
|_ ... | ||
|_ <timestamped_name>.ok # legacy fragment write file | ||
|_ <timestamped_name>.meta # legacy consolidated fragment meta file | ||
|_ __array_schema.tdb # legacy array schema file | ||
``` | ||
|
||
Inside the array folder, you can find the following: | ||
|
||
* [Array schema](./array_schema.md) folder `__schema`. | ||
* Inside of a fragments folder, any number of [fragment folders](./fragment.md) [`<timestamped_name>`](./timestamped_name.md). | ||
* Inside of a commit folder, an empty file [`<timestamped_name>`](./timestamped_name.md)`.wrt` associated with every fragment folder [`<timestamped_name>`](./timestamped_name.md), where [`<timestamped_name>`](./timestamped_name.md) is common for the folder and the WRT file. This is used to indicate that fragment [`<timestamped_name>`](./timestamped_name.md) has been *committed* (i.e., its write process finished successfully) and it is ready for use by TileDB. If the WRT file does not exist, the corresponding fragment folder is ignored by TileDB during the reads. | ||
* Inside the same commit folder, any number of [delete commit files](./delete_commit_file.md) of the form [`<timestamped_name>`](./timestamped_name.md)`.del`. | ||
* Inside the same commit folder, any number of [update commit files](./update_commit_file.md) of the form [`<timestamped_name>`](./timestamped_name.md)`.upd`. | ||
* Inside the same commit folder, any number of [consolidated commits files](./consolidated_commits_file.md) of the form [`<timestamped_name>`](./timestamped_name.md)`.con`. | ||
* Inside the same commit folder, any number of [ignore files](./ignore_file.md) of the form [`<timestamped_name>`](./timestamped_name.md)`.ign`. | ||
* Inside of a fragment metadata folder, any number of [consolidated fragment metadata files](./consolidated_fragment_metadata_file.md) of the form [`<timestamped_name>`](./timestamped_name.md)`.meta`. | ||
* [Array metadata](./metadata.md) folder `__meta`. | ||
* Inside of a labels folder, additional TileDB arrays storing dimension label data. | ||
* Inside of a `__schema` folder, any number of [array schema files](./array_schema.md) [`<timestamped_name>`](./timestamped_name.md). | ||
* **Note**: the name does _not_ include the format version. | ||
* _New in version 20_ Inside of the schema folder, an enumerations folder `__enumerations`. | ||
* Inside of a `__meta` folder, any number of [array metadata files](./metadata.md) [`<timestamped_name>`](./timestamped_name.md). | ||
* Inside of a `__fragments` folder, any number of [fragment folders](./fragment.md) [`<timestamped_name>`](./timestamped_name.md). | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We should sort the file hierarchy updates by version (e.g., 20, followed by 18, followed by 16, followed by 12), similar to how There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sorted by version (with unversioned items being put first because they are the most important) and grouped by directory hierarchy. I also reworded and simplified some items. |
||
* _New in version 18_ Inside of a `__labels` folder, additional TileDB arrays storing dimension label data. | ||
* _New in version 12_ Inside of a `__commits` folder: | ||
* Any number of empty files [`<timestamped_name>`](./timestamped_name.md)`.wrt`, each associated with fragment folder [`<timestamped_name>`](./timestamped_name.md), indicating that the fragment has been *committed* (i.e., its write process finished successfully). If the WRT file does not exist, the corresponding fragment must be ignored when reading the array. | ||
* Any number of [consolidated commits files](./consolidated_commits_file.md) of the form [`<timestamped_name>`](./timestamped_name.md)`.con`. | ||
* Any number of [ignore files](./ignore_file.md) of the form [`<timestamped_name>`](./timestamped_name.md)`.ign`. | ||
* _New in version 16_ Any number of [delete commit files](./delete_commit_file.md) of the form [`<timestamped_name>`](./timestamped_name.md)`.del`. | ||
* _New in version 16_ Any number of [update commit files](./update_commit_file.md) of the form [`<timestamped_name>`](./timestamped_name.md)`.upd`. | ||
* _New in version 12_ Inside of a `__fragment_meta` folder, any number of [consolidated fragment metadata files](./consolidated_fragment_metadata_file.md) of the form [`<timestamped_name>`](./timestamped_name.md)`.meta`. | ||
|
||
> [!NOTE] | ||
> Prior to version 12, fragments, commit files, and consolidated fragment metadata were stored directly in the array folder and the extension of commit files was `.ok` instead of `.wrt`. Implementations must support arrays that contain data in both the old and the new hierarchy at the same time. | ||
|
||
> [!NOTE] | ||
> Prior to version 10, the array schema was stored in a single `__array_schema.tdb` file in the array folder. Implementations must support arrays that contain both `__array_schema.tdb` and schemas in the `__schema` folder at the same time. For the purpose of array schema evolution, the timestamp of `__array_schema.tdb` must be considered to be earlier than any schema in the `__schema` folder. | ||
|
||
> [!NOTE] | ||
> Prior to version 5, commit files were not written. Fragments of these versions are considered to be committed if their corresponding fragment metadata file exists. |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,8 +1,8 @@ | ||
--- | ||
title: Format version history | ||
title: Array format version history | ||
--- | ||
|
||
# Format Version History | ||
# Array Format Version History | ||
|
||
## Version 22 | ||
|
||
|
@@ -24,7 +24,7 @@ Introduced in TileDB 2.19 | |
Introduced in TileDB 2.17 | ||
|
||
* Arrays can have [enumerations](./enumeration.md). | ||
* The bit-width reduction and positive delta filters are supported on data of date or time types. | ||
* The bit-width reduction and positive delta encoding filters are supported on data of date or time types. | ||
* The [filter pipeline options](./filter_pipeline.md#filter-options) for the double-delta filter contain the _Reinterpret datatype_ field. | ||
|
||
## Version 19 | ||
|
@@ -45,7 +45,7 @@ Introduced in TileDB 2.15 | |
Introduced in TileDB 2.14 | ||
|
||
* The _Order_ field was added to [attributes](./array_schema.md#attribute). | ||
* Cell offsets in dimensions or attributes of UTF-8 string type are not written in the offset tiles, if the RLE or dictionary filter exists in the filter pipeline. They are instead encoded as part of the data tile. | ||
* Cell offsets in dimensions or attributes of UTF-8 string type are not written in the offset tiles, if the RLE or dictionary encoding filter exists in the filter pipeline. They are instead encoded as part of the data tile. | ||
|
||
## Version 16 | ||
|
||
|
@@ -72,7 +72,7 @@ Introduced in TileDB 2.10 | |
|
||
Introduced in TileDB 2.9 | ||
|
||
* The [dictionary filter](./filters/dictionary_encoding.md) was added. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Filters are not "added" in a version so we have to reword this. |
||
* Cell offsets in dimensions or attributes of ASCII string type are not written in the offset tiles, if the dictionary encoding filter exists in the filter pipeline. They are instead encoded as part of the data tile. | ||
|
||
## Version 12 | ||
|
||
|
@@ -86,7 +86,7 @@ Introduced in TileDB 2.8 | |
|
||
Introduced in TileDB 2.7 | ||
|
||
* Fragment metadata contain [metadata](./fragment.md#tile-mins-maxes) (min/max value, sum, null count) for each tile. | ||
* Fragment metadata contain [metadata](./fragment.md#tile-mins-maxes) (min/max value, sum, null count) for data in the whole fragment and each tile. | ||
* The TileDB implementation has been updated to never split cells when storing them in chunks. | ||
|
||
## Version 10 | ||
|
@@ -154,7 +154,7 @@ Introduced in TileDB 1.6 | |
* The [footer](./fragment.md#footer) and [R-Tree](./fragment.md#r-tree) structures were added. | ||
* The _Bounding coords_ field was removed. | ||
* The _MBRs_ field was removed. MBRs are now stored in the R-Tree. | ||
* Structures other than the footer like tile offsets, sizes and metadata are wrapped in their own generic tiles. This allows loading them lazily and in parallel. | ||
* Tile offsets and sizes are wrapped in their own generic tiles. This allows loading them lazily and in parallel. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Tile metadata did not exist back then. |
||
|
||
## Version 2 | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using "the" might imply that the existence of the folder is required. There cannot be empty folders in cloud object storage which means that an array with no fragments yet written will not have a
__fragments
folder.