Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: add BucketFile proposal #1567

Closed
wants to merge 10 commits into from

Conversation

zhulongcheng
Copy link
Contributor

related to #1471

Signed-off-by: zhulongcheng <[email protected]>
@bwplotka
Copy link
Member

bwplotka commented Oct 1, 2019

Nice, I am back from holidays, so will take a proper look today/this week @zhulongcheng ! Thanks!

@bwplotka bwplotka requested review from bwplotka and domgreen October 1, 2019 09:12
@bwplotka
Copy link
Member

bwplotka commented Oct 6, 2019

cc @GiedriusS @ppanyukov

@bwplotka bwplotka requested a review from GiedriusS October 6, 2019 11:04
Signed-off-by: zhulongcheng <[email protected]>
Copy link
Member

@GiedriusS GiedriusS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lots of things to read and understand but some initial comments/questions (that might not make sense)


Each BucketFile has a `pages` bitmap, but the bitmap uses very small memory.

For inactive IndexReaders, if the inactive duration (`now.Sub(lastQueryTime)`) greats than a configured duration (e.g. 15 mins), then IndexReader will remove in-memory objects and close file descriptor and mmap in IndexFile. This will reduce memory usage.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Who will actually track this time?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IndexReaders and ChunksReader

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there will be separate tracking for those two?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update: BucketFile will track the time.

IndexReaders hold decoded postings offsets and label offsets in memory.
so IndexReaders need to track the time.
if one IndexReader is inactive, it will remove in-memory postings offsets and label offsets.

ChunksReader no need to track.


For inactive IndexReaders, if the inactive duration (`now.Sub(lastQueryTime)`) greats than a configured duration (e.g. 15 mins), then IndexReader will remove in-memory objects and close file descriptor and mmap in IndexFile. This will reduce memory usage.

**NOTE**: The `pages` bitmap in BucketFile should always be kept in memory until the server is closed.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you elaborate more on this? Why do we want this if files might disappear?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inactive IndexReaders just close file descriptors, not remove local files.

Inactive IndexReaders may be queried again.
Keep pages bitmap in memory, so we can reuse fetched content.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What we miss from design is human description of what pages bitmat is, how large it is, what the use case. Cant's see any description now. ):

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use Roaring bitmaps implementation.

Compressed Size: 2.6 bits per int.

1 TiB = 1 * 1024 * 1024 * 1024 KiB = 1073741824 KiB
4 KiB pages: 1073741824 / 4 = 268435456
bitmap size: 268435456 * 3 bits = 805306368 bits = 805306368 / 8 / 1024 / 1024 = 96 MiB

so if we write 1 TiB to local files, then the size of bitmaps is 96 MiB.

Copy link
Contributor Author

@zhulongcheng zhulongcheng Oct 10, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we write many bytes (e.g. 100 TiB) to local files, bitmaps will use 10 GiB memory.

so I think we also need to close inactive bitmaps now. ):


Currently, the store gateway loads all [index.cache.json](https://github.com/thanos-io/thanos/blob/865d5ec710b5651ead08a7dc73864a6f5dbeeaa9/pkg/block/index.go#L40) files into memory on startup, which easily hits OOM situations. On the other hand, some indexes are rarely or never queried, so we don't have to keep these indexes in memory.

The store gateway fetches chunks from object storage for each query request. but fetched chunks are not reused for next requests.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we will mmap everything and create sparse files then won't this technically mean that there is kind of an assumption that we have "infinite" RAM and address space? How will BucketFile ensure that we won't pass a certain threshold? What will we do in situations under memory pressure? I think that ATM having a common pool of RAM for responses serves us well as a capacity planning mechanism but this could potentially mean that someone will just ask for random data from a bucket that could be terabytes in size, and we will hold all of that in memory?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ask for random data from a bucket that could be terabytes in size, and we will hold all of that in memory?

will not hold all of that in memory.
BucketFile will read small size (e.g. 1 MiB) data from io.Reader and write to a local file each time, until io.Reader EOF.

there is kind of an assumption that we have "infinite" RAM and address space?

If we fetch many blocks (> 1000 blocks) from object storage,
it is possible that the size of all mmaps exceeds the virtual address space (e.g. 128TiB) for the store gateway process.
I think Thanos Sharding is a good way to resolve this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah but typically resilient systems gracefully shed their load, no? I would rather have Thanos Store unmmap some files instead of crashing in the middle of the night when someone queries a bunch of disparate data. I am fairly confident that we should probably have some kind of strategy on how to deal with this kind of situation instead of crashing and telling our users that maybe they should split up the Thanos Store nodes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fully agree with @GiedriusS we need to think about that case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a helper struct MmapTracker to manage the size of all mmaps.

TheMmapTracker is a global obj for store gw.

For each BucketFile:

  1. call MmapTracker.Allocate(size) before mmap a file
  2. call MmapTracker.Free(size) after unmmap a file

MmapTracker.allocate returns an error if the current size is greater than the configurable MaxMmapSize.

How about this way? @bwplotka @GiedriusS

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, for design it's really enough to say We will track mmaped file globally per store.

And then knowing that there are important design questions to ask, e.g

  • What if one request take all MaxMmapSIze?
  • Who unmmaps file? Is it get unmmapped after each request?
  • We fetch index and chunk pieces on demand but once we fetch all, are we mmapping the whole file? Potentially large one?


(BucketIndexFile and BucketChunksFile are represented by `BucketFile` )

1. BucketStore calls `blockSeries` method to retrieve series from a block.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be me but I feel like all of this description might be a bit too verbose. We should focus on how BucketFile works

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this helps to understand how BucketFile works with other components.

will remove if no need.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work with this! But yes, this verbosity might actually make this proposal more complex to understand and review. I think generall pseudo code is fine, but every details is not needed (: Probably we should have like 10 steps max not 60 (: Wonder if we can do this from high level without tiny details like BucketStore calls blockSeries method to retrieve series from a block.or BucketBlockIndexFile returns postings offsets block? That's all tiny implementation issue - we can simplify stuff here in design and focus on really important questions like how to make sure this works given limited memory for Store GW which is the main goal right now. (:

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Essentially what is here is just coding it straight away but in human language. That's hard to review (: It's similar if we would just start coding without design. We might need to agree on high level things first.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, will simplify this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated.

Copy link
Member

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! It is a good start, but we need to address few gaps and maybe reduce complexity added by very detailed pseudo code. Let me know what do you think!

docs/proposals/201909_bucketfile.md Outdated Show resolved Hide resolved
docs/proposals/201909_bucketfile.md Outdated Show resolved Hide resolved

Currently, the store gateway loads all [index.cache.json](https://github.com/thanos-io/thanos/blob/865d5ec710b5651ead08a7dc73864a6f5dbeeaa9/pkg/block/index.go#L40) files into memory on startup, which easily hits OOM situations. On the other hand, some indexes are rarely or never queried, so we don't have to keep these indexes in memory.

The store gateway fetches chunks from object storage for each query request. but fetched chunks are not reused for next requests.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fully agree with @GiedriusS we need to think about that case.


(BucketIndexFile and BucketChunksFile are represented by `BucketFile` )

1. BucketStore calls `blockSeries` method to retrieve series from a block.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work with this! But yes, this verbosity might actually make this proposal more complex to understand and review. I think generall pseudo code is fine, but every details is not needed (: Probably we should have like 10 steps max not 60 (: Wonder if we can do this from high level without tiny details like BucketStore calls blockSeries method to retrieve series from a block.or BucketBlockIndexFile returns postings offsets block? That's all tiny implementation issue - we can simplify stuff here in design and focus on really important questions like how to make sure this works given limited memory for Store GW which is the main goal right now. (:


(BucketIndexFile and BucketChunksFile are represented by `BucketFile` )

1. BucketStore calls `blockSeries` method to retrieve series from a block.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Essentially what is here is just coding it straight away but in human language. That's hard to review (: It's similar if we would just start coding without design. We might need to agree on high level things first.

docs/proposals/201909_bucketfile.md Outdated Show resolved Hide resolved

The latest blocks will be queried mostly.

Preload `the last N hours` (configurable) blocks on startup.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting idea.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Preload what? Full block to memory? last N hours of it? That sounds scary for large objects. Even downloading to disk will take ages. 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Preload block index, the postings offset table and label offset table.

The latest blocks are usually small.
For large blocks, users can disable this configure option.

docs/proposals/201909_bucketfile.md Outdated Show resolved Hide resolved

## Risks

Ensure all providers support `fetch last N bytes` and `fetch everything starting from offset`. Otherwise, we need to know the size of remote index file before fetching TOC block. We may get the size by fetching metadata from object storage.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Can we actually check .. now? Design phase is exactly perfect moment for this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

providers: Azure, S3, COS, GCS, Swift

all these providers support fetch last N bytes and fetch everything starting from offset.

some changes in client side:

  1. Upgrade GCS client pkg to version v0.45.0+
  2. Update Swift GetRange implementation.

https://github.com/googleapis/google-cloud-go/blob/master/CHANGES.md

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Risks section has been removed. see 3376912


### Future work

Design a new index file for [postings offset entries](https://github.com/prometheus/prometheus/blob/master/tsdb/docs/format/index.md#postings-offset-table) and [label offset entries](https://github.com/prometheus/prometheus/blob/master/tsdb/docs/format/index.md#label-offset-table) for sequential and random accesses.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Future work means not next steps to implement this designed idea - it's rather: What's next if we implement this design?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to design a new search algorithm and file disk format.

Although its goal is to reduce memory, it is independent and can be described in the next design.

Should describe details in this design?

Copy link
Member

@bwplotka bwplotka Oct 10, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I thought you mean something else.

No, it should not be part of the design. I would state it as Consider changing index file design for (...) as we first need to agree on this. I am quite sure we would like to be compliant with native TSDB, so we might start this discussion there at some point (:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I will make it clear. (:

It is compliant with native TSDB.
We do not change the native TSDB index file.

Add a new thanos-index file in blocks.

├── b-000001
│   ├── index
│   ├── thanos-index
. . . 

a draft: https://gist.github.com/zhulongcheng/236e45728d70554f26fdb75809120928

Signed-off-by: zhulongcheng <[email protected]>
IndexReaders and BucketFiles will track last query time.

Signed-off-by: zhulongcheng <[email protected]>
Copy link
Member

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments (: Good work so far, looking forward to slowly start some prototyping.


Currently, the store gateway loads all [index.cache.json](https://github.com/thanos-io/thanos/blob/865d5ec710b5651ead08a7dc73864a6f5dbeeaa9/pkg/block/index.go#L40) files into memory on startup, which easily hits OOM situations. On the other hand, some indexes are rarely or never queried, so we don't have to keep these indexes in memory.

The store gateway fetches chunks from object storage for each query request. but fetched chunks are not reused for next requests.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, for design it's really enough to say We will track mmaped file globally per store.

And then knowing that there are important design questions to ask, e.g

  • What if one request take all MaxMmapSIze?
  • Who unmmaps file? Is it get unmmapped after each request?
  • We fetch index and chunk pieces on demand but once we fetch all, are we mmapping the whole file? Potentially large one?


Each BucketFile has a `pages` bitmap, but the bitmap uses small memory.

For inactive IndexReaders and BucketFiles, if the inactive duration (`now.Sub(lastQueryTime)`) is greater than a configured duration (e.g. 15 mins), then IndexReaders will remove in-memory objects, and BucketFiles will remove in-memory bitmaps, close file descriptors and unmmap local files. This will reduce memory usage.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, what about disk space? I would say we should try to have limited space as well.

BTW can we remove pieces of e.g index from memory or only full index? or from disk? E.g questions is can we make it sparse for things that are not used.. (Probably too much effort)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BucketFiles will also remove local files. (will update the doc)
Maybe we can track disk space, just like tracking mmap size.

Copy link
Contributor Author

@zhulongcheng zhulongcheng Oct 17, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IndexReaders decode the postings offset block and label offset block into in-memory objs.
we remove those in-memory objs.

The disk index is represented by BucketFile. If BucketFile is inactive, BucketFile will remove the disk index.


The latest blocks will be queried mostly.

Preload `the last N hours` (configurable) blocks on startup.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Preload what? Full block to memory? last N hours of it? That sounds scary for large objects. Even downloading to disk will take ages. 🤔


Because the max size of a index file is 64 GiB, and the default size of a chunks file is 512 MiB.

So we can set the size of a local index file to 64 GiB, set the size of a local chunks file to 768 (512*1.5) MiB.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need to know/set max size of files here? Why?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mmap needs a size to allocate virtual address space. If we give a small size, when the offset of fetched data is greater than mmap size, we need to re-mmap.

3. Fetch everything starting at `start` till `end`:
`Range{start:N, end: M}` -> Request header `Range: bytes=M-N`

Now all providers (Azure, COS, GCS, S3, Swift) support `Fetch last N bytes`, `Fetch everything starting from offset` and `Fetch everything starting at start till end`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for research, let's see first if we really need to have this working, maybe we can do it without those 3 special features of fetching ranges?
Can you provide motivation here for them?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Essentially, index TOC knows exactly where things are in the first place, so we need just that 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we do not know the size of the remote index file, so we need the Feature-1 Fetch last N bytes to fetch TOC.

we know the start of postings offset table block through TOC, but we do not know the end of this block, so we need the Feature-2 Fetch everything starting from start to fetch TOC.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: check overfetch.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we know the start of postings offset table block through TOC, but we do not know the end of this block, so we need the Feature-2 Fetch everything starting from start to fetch TOC.

yes, you are right here.

@GiedriusS
Copy link
Member

How did this end? Could someone give a summary? I think this is quite important.

@bwplotka
Copy link
Member

bwplotka commented Nov 1, 2019

Yes! Let's get back to this. Sorry for delay @zhulongcheng and thanks for the good work (:

I think we need to improve this proposal and split it into smaller ones as we touch many improvements that we should add separately! For example, the optimized fetching index header can be beneficial for current flow as well! (: This allows iterating over those quicker.

Also, are you fine @zhulongcheng to maybe us maintainers to propose some PRs, commits to your PR so we can work on this together?

I would say let's focus @zhulongcheng on this:

  1. Treat this proposal mainly for the goal of adding DISK CACHING to Thanos. In particular, this is one of many possible disk cache formats we can use (: e.g we could use just https://github.com/peterbourgon/diskv (: I think a disk cache + index cache composed in sparse file implementation is what Reduce store memory usage by reworking block.indexCache #1471 proposes.
  2. Extract constructing index-cache (the initial header) improvements we talk about in this proposal to the separate issue/proposal. I mean this part with fetching last item etc. I think it's super-valuable but we don't need sparse file implementation for this.
  3. Open issue/proposal for the improvements to the TSDB index file itself you had @zhulongcheng in future work! This is an incredibly valuable idea that is totally fair (: I would call it make index file object storage-friendly (:

Overall let's focus on sparse file implementation here in this PR maybe. Overall I have still concerns around:

  • partitioning requests against object storage
  • granular eviction of items.

However, by splitting, we can keep the discussion focused! Thoughts?

@bwplotka
Copy link
Member

bwplotka commented Nov 1, 2019

Particularly once we split we can update: #1705


### Future work

Design a new index file for [postings offset entries](https://github.com/prometheus/prometheus/blob/master/tsdb/docs/format/index.md#postings-offset-table) and [label offset entries](https://github.com/prometheus/prometheus/blob/master/tsdb/docs/format/index.md#label-offset-table) for sequential and random accesses.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a new index-cache /index-header I guess? Yes! And it might be the first step /treated separately not future work (:

Copy link
Contributor Author

@zhulongcheng zhulongcheng Nov 4, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed. 21d2804

@bwplotka
Copy link
Member

bwplotka commented Nov 4, 2019

Pulled composing index-cache to another issue: #1711

@zhulongcheng
Copy link
Contributor Author

zhulongcheng commented Nov 4, 2019

... maybe us maintainers to propose some PRs, commits to your PR so we can work on this together?

Great! Please commit.

@stale
Copy link

stale bot commented Jan 11, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Jan 11, 2020
@stale stale bot closed this Jan 18, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants