-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docs: add BucketFile proposal #1567
Conversation
Signed-off-by: zhulongcheng <[email protected]>
Nice, I am back from holidays, so will take a proper look today/this week @zhulongcheng ! Thanks! |
Signed-off-by: zhulongcheng <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lots of things to read and understand but some initial comments/questions (that might not make sense)
docs/proposals/201909_bucketfile.md
Outdated
|
||
Each BucketFile has a `pages` bitmap, but the bitmap uses very small memory. | ||
|
||
For inactive IndexReaders, if the inactive duration (`now.Sub(lastQueryTime)`) greats than a configured duration (e.g. 15 mins), then IndexReader will remove in-memory objects and close file descriptor and mmap in IndexFile. This will reduce memory usage. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Who will actually track this time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IndexReaders
and ChunksReader
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there will be separate tracking for those two?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update: BucketFile
will track the time.
IndexReaders hold decoded postings offsets
and label offsets
in memory.
so IndexReaders need to track the time.
if one IndexReader is inactive, it will remove in-memory postings offsets
and label offsets
.
ChunksReader no need to track.
docs/proposals/201909_bucketfile.md
Outdated
|
||
For inactive IndexReaders, if the inactive duration (`now.Sub(lastQueryTime)`) greats than a configured duration (e.g. 15 mins), then IndexReader will remove in-memory objects and close file descriptor and mmap in IndexFile. This will reduce memory usage. | ||
|
||
**NOTE**: The `pages` bitmap in BucketFile should always be kept in memory until the server is closed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you elaborate more on this? Why do we want this if files might disappear?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inactive IndexReaders just close file descriptors, not remove local files.
Inactive IndexReaders may be queried again.
Keep pages
bitmap in memory, so we can reuse fetched content.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What we miss from design is human description of what pages bitmat
is, how large it is, what the use case. Cant's see any description now. ):
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use Roaring bitmaps implementation.
Compressed Size: 2.6 bits per int.
1 TiB = 1 * 1024 * 1024 * 1024 KiB = 1073741824 KiB
4 KiB pages: 1073741824 / 4 = 268435456
bitmap size: 268435456 * 3 bits = 805306368 bits = 805306368 / 8 / 1024 / 1024 = 96 MiB
so if we write 1 TiB to local files, then the size of bitmaps is 96 MiB.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we write many bytes (e.g. 100 TiB) to local files, bitmaps will use 10 GiB memory.
so I think we also need to close inactive bitmaps now. ):
|
||
Currently, the store gateway loads all [index.cache.json](https://github.com/thanos-io/thanos/blob/865d5ec710b5651ead08a7dc73864a6f5dbeeaa9/pkg/block/index.go#L40) files into memory on startup, which easily hits OOM situations. On the other hand, some indexes are rarely or never queried, so we don't have to keep these indexes in memory. | ||
|
||
The store gateway fetches chunks from object storage for each query request. but fetched chunks are not reused for next requests. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we will mmap
everything and create sparse files then won't this technically mean that there is kind of an assumption that we have "infinite" RAM and address space? How will BucketFile
ensure that we won't pass a certain threshold? What will we do in situations under memory pressure? I think that ATM having a common pool of RAM for responses serves us well as a capacity planning mechanism but this could potentially mean that someone will just ask for random data from a bucket that could be terabytes in size, and we will hold all of that in memory?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ask for random data from a bucket that could be terabytes in size, and we will hold all of that in memory?
will not hold all of that in memory.
BucketFile
will read small size (e.g. 1 MiB) data from io.Reader
and write to a local file each time, until io.Reader
EOF.
there is kind of an assumption that we have "infinite" RAM and address space?
If we fetch many blocks (> 1000 blocks) from object storage,
it is possible that the size of all mmaps exceeds the virtual address space (e.g. 128TiB) for the store gateway process.
I think Thanos Sharding is a good way to resolve this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah but typically resilient systems gracefully shed their load, no? I would rather have Thanos Store unmmap some files instead of crashing in the middle of the night when someone queries a bunch of disparate data. I am fairly confident that we should probably have some kind of strategy on how to deal with this kind of situation instead of crashing and telling our users that maybe they should split up the Thanos Store nodes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fully agree with @GiedriusS we need to think about that case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a helper struct MmapTracker
to manage the size of all mmaps.
TheMmapTracker
is a global obj for store gw.
For each BucketFile
:
- call
MmapTracker.Allocate(size)
beforemmap
a file - call
MmapTracker.Free(size)
afterunmmap
a file
MmapTracker.allocate
returns an error if the current size is greater than the configurable MaxMmapSize
.
How about this way? @bwplotka @GiedriusS
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, for design it's really enough to say We will track mmaped file globally per store.
And then knowing that there are important design questions to ask, e.g
- What if one request take all MaxMmapSIze?
- Who unmmaps file? Is it get unmmapped after each request?
- We fetch index and chunk pieces on demand but once we fetch all, are we mmapping the whole file? Potentially large one?
docs/proposals/201909_bucketfile.md
Outdated
|
||
(BucketIndexFile and BucketChunksFile are represented by `BucketFile` ) | ||
|
||
1. BucketStore calls `blockSeries` method to retrieve series from a block. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be me but I feel like all of this description might be a bit too verbose. We should focus on how BucketFile
works
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this helps to understand how BucketFile
works with other components.
will remove if no need.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work with this! But yes, this verbosity might actually make this proposal more complex to understand and review. I think generall pseudo code is fine, but every details is not needed (: Probably we should have like 10 steps max not 60 (: Wonder if we can do this from high level without tiny details like BucketStore calls
blockSeries method to retrieve series from a block.
or BucketBlockIndexFile returns postings offsets block
? That's all tiny implementation issue - we can simplify stuff here in design and focus on really important questions like how to make sure this works given limited memory for Store GW which is the main goal right now. (:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Essentially what is here is just coding it straight away but in human language. That's hard to review (: It's similar if we would just start coding without design. We might need to agree on high level things first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, will simplify this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! It is a good start, but we need to address few gaps and maybe reduce complexity added by very detailed pseudo code. Let me know what do you think!
|
||
Currently, the store gateway loads all [index.cache.json](https://github.com/thanos-io/thanos/blob/865d5ec710b5651ead08a7dc73864a6f5dbeeaa9/pkg/block/index.go#L40) files into memory on startup, which easily hits OOM situations. On the other hand, some indexes are rarely or never queried, so we don't have to keep these indexes in memory. | ||
|
||
The store gateway fetches chunks from object storage for each query request. but fetched chunks are not reused for next requests. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fully agree with @GiedriusS we need to think about that case.
docs/proposals/201909_bucketfile.md
Outdated
|
||
(BucketIndexFile and BucketChunksFile are represented by `BucketFile` ) | ||
|
||
1. BucketStore calls `blockSeries` method to retrieve series from a block. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work with this! But yes, this verbosity might actually make this proposal more complex to understand and review. I think generall pseudo code is fine, but every details is not needed (: Probably we should have like 10 steps max not 60 (: Wonder if we can do this from high level without tiny details like BucketStore calls
blockSeries method to retrieve series from a block.
or BucketBlockIndexFile returns postings offsets block
? That's all tiny implementation issue - we can simplify stuff here in design and focus on really important questions like how to make sure this works given limited memory for Store GW which is the main goal right now. (:
docs/proposals/201909_bucketfile.md
Outdated
|
||
(BucketIndexFile and BucketChunksFile are represented by `BucketFile` ) | ||
|
||
1. BucketStore calls `blockSeries` method to retrieve series from a block. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Essentially what is here is just coding it straight away but in human language. That's hard to review (: It's similar if we would just start coding without design. We might need to agree on high level things first.
|
||
The latest blocks will be queried mostly. | ||
|
||
Preload `the last N hours` (configurable) blocks on startup. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Preload what? Full block to memory? last N hours of it? That sounds scary for large objects. Even downloading to disk will take ages. 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Preload block index, the postings offset table
and label offset table
.
The latest blocks are usually small.
For large blocks, users can disable this configure option.
docs/proposals/201909_bucketfile.md
Outdated
|
||
## Risks | ||
|
||
Ensure all providers support `fetch last N bytes` and `fetch everything starting from offset`. Otherwise, we need to know the size of remote index file before fetching TOC block. We may get the size by fetching metadata from object storage. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! Can we actually check .. now? Design phase is exactly perfect moment for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
providers: Azure, S3, COS, GCS, Swift
all these providers support fetch last N bytes
and fetch everything starting from offset
.
some changes in client side:
- Upgrade GCS client pkg to version v0.45.0+
- Update Swift GetRange implementation.
https://github.com/googleapis/google-cloud-go/blob/master/CHANGES.md
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
related links:
Azure: https://docs.microsoft.com/en-us/rest/api/storageservices/specifying-the-range-header-for-blob-service-operations
COS: https://cloud.tencent.com/developer/section/1189987
S3: https://docs.aws.amazon.com/zh_cn/AmazonS3/latest/API/API_GetObject.html
COS: https://cloud.google.com/storage/docs/xml-api/reference-headers#range
COS and Swift support multiple ranges (Range: bytes=A-B,M-N)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Risks
section has been removed. see 3376912
docs/proposals/201909_bucketfile.md
Outdated
|
||
### Future work | ||
|
||
Design a new index file for [postings offset entries](https://github.com/prometheus/prometheus/blob/master/tsdb/docs/format/index.md#postings-offset-table) and [label offset entries](https://github.com/prometheus/prometheus/blob/master/tsdb/docs/format/index.md#label-offset-table) for sequential and random accesses. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Future work means not next steps to implement this designed idea
- it's rather: What's next if we implement this design?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to design a new search algorithm and file disk format.
Although its goal is to reduce memory, it is independent and can be described in the next design.
Should describe details in this design?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I thought you mean something else.
No, it should not be part of the design. I would state it as Consider changing index file design for (...)
as we first need to agree on this. I am quite sure we would like to be compliant with native TSDB, so we might start this discussion there at some point (:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I will make it clear. (:
It is compliant with native TSDB.
We do not change the native TSDB index file.
Add a new thanos-index
file in blocks.
├── b-000001
│ ├── index
│ ├── thanos-index
. . .
a draft: https://gist.github.com/zhulongcheng/236e45728d70554f26fdb75809120928
Signed-off-by: zhulongcheng <[email protected]>
IndexReaders and BucketFiles will track last query time. Signed-off-by: zhulongcheng <[email protected]>
Signed-off-by: zhulongcheng <[email protected]>
Signed-off-by: zhulongcheng <[email protected]>
Signed-off-by: zhulongcheng <[email protected]>
Signed-off-by: zhulongcheng <[email protected]>
Signed-off-by: zhulongcheng <[email protected]>
1558b2d
to
17c916c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some comments (: Good work so far, looking forward to slowly start some prototyping.
|
||
Currently, the store gateway loads all [index.cache.json](https://github.com/thanos-io/thanos/blob/865d5ec710b5651ead08a7dc73864a6f5dbeeaa9/pkg/block/index.go#L40) files into memory on startup, which easily hits OOM situations. On the other hand, some indexes are rarely or never queried, so we don't have to keep these indexes in memory. | ||
|
||
The store gateway fetches chunks from object storage for each query request. but fetched chunks are not reused for next requests. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, for design it's really enough to say We will track mmaped file globally per store.
And then knowing that there are important design questions to ask, e.g
- What if one request take all MaxMmapSIze?
- Who unmmaps file? Is it get unmmapped after each request?
- We fetch index and chunk pieces on demand but once we fetch all, are we mmapping the whole file? Potentially large one?
|
||
Each BucketFile has a `pages` bitmap, but the bitmap uses small memory. | ||
|
||
For inactive IndexReaders and BucketFiles, if the inactive duration (`now.Sub(lastQueryTime)`) is greater than a configured duration (e.g. 15 mins), then IndexReaders will remove in-memory objects, and BucketFiles will remove in-memory bitmaps, close file descriptors and unmmap local files. This will reduce memory usage. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, what about disk space? I would say we should try to have limited space as well.
BTW can we remove pieces of e.g index from memory or only full index? or from disk? E.g questions is can we make it sparse for things that are not used.. (Probably too much effort)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BucketFiles will also remove local files. (will update the doc)
Maybe we can track disk space, just like tracking mmap size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IndexReaders decode the postings offset block
and label offset block
into in-memory objs.
we remove those in-memory objs.
The disk index is represented by BucketFile. If BucketFile is inactive, BucketFile will remove the disk index.
|
||
The latest blocks will be queried mostly. | ||
|
||
Preload `the last N hours` (configurable) blocks on startup. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Preload what? Full block to memory? last N hours of it? That sounds scary for large objects. Even downloading to disk will take ages. 🤔
|
||
Because the max size of a index file is 64 GiB, and the default size of a chunks file is 512 MiB. | ||
|
||
So we can set the size of a local index file to 64 GiB, set the size of a local chunks file to 768 (512*1.5) MiB. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really need to know/set max size of files here? Why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mmap
needs a size to allocate virtual address space. If we give a small size, when the offset of fetched data is greater than mmap size, we need to re-mmap.
3. Fetch everything starting at `start` till `end`: | ||
`Range{start:N, end: M}` -> Request header `Range: bytes=M-N` | ||
|
||
Now all providers (Azure, COS, GCS, S3, Swift) support `Fetch last N bytes`, `Fetch everything starting from offset` and `Fetch everything starting at start till end`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for research, let's see first if we really need to have this working, maybe we can do it without those 3 special features of fetching ranges?
Can you provide motivation here for them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Essentially, index TOC knows exactly where things are in the first place, so we need just that 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we do not know the size of the remote index file, so we need the Feature-1 Fetch last N bytes
to fetch TOC.
we know the start of postings offset table
block through TOC, but we do not know the end of this block, so we need the Feature-2 Fetch everything starting from start
to fetch TOC.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: check overfetch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we know the start of postings offset table block through TOC, but we do not know the end of this block, so we need the Feature-2 Fetch everything starting from start to fetch TOC.
yes, you are right here.
How did this end? Could someone give a summary? I think this is quite important. |
Yes! Let's get back to this. Sorry for delay @zhulongcheng and thanks for the good work (: I think we need to improve this proposal and split it into smaller ones as we touch many improvements that we should add separately! For example, the optimized fetching index header can be beneficial for current flow as well! (: This allows iterating over those quicker. Also, are you fine @zhulongcheng to maybe us maintainers to propose some PRs, commits to your PR so we can work on this together? I would say let's focus @zhulongcheng on this:
Overall let's focus on sparse file implementation here in this PR maybe. Overall I have still concerns around:
However, by splitting, we can keep the discussion focused! Thoughts? |
Particularly once we split we can update: #1705 |
docs/proposals/201909_bucketfile.md
Outdated
|
||
### Future work | ||
|
||
Design a new index file for [postings offset entries](https://github.com/prometheus/prometheus/blob/master/tsdb/docs/format/index.md#postings-offset-table) and [label offset entries](https://github.com/prometheus/prometheus/blob/master/tsdb/docs/format/index.md#label-offset-table) for sequential and random accesses. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a new index-cache /index-header I guess? Yes! And it might be the first step /treated separately not future work (:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed. 21d2804
Pulled composing index-cache to another issue: #1711 |
Great! Please commit. |
Signed-off-by: zhulongcheng <[email protected]>
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
related to #1471