-
Notifications
You must be signed in to change notification settings - Fork 812
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add proposal for timeseries partitioning in compactor #4843
add proposal for timeseries partitioning in compactor #4843
Conversation
f5a9c95
to
be44266
Compare
be44266
to
ddf5c2c
Compare
Signed-off-by: Roy Chiang <[email protected]>
ddf5c2c
to
83331c6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Just some nits
|
||
## Problem and Requirements | ||
|
||
Cortex introduced horizontally scaling compactor which allows multiple compactors to compact blocks for a single tenant, sharded by time interval. The compactor is capable of compacting multiple smaller blocks into a larger block, to reduce the the duplicated information in index. The following is an illustration of how the shuffle sharding compactor works, where each arrow represents a single compaction that can be carried out independently. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cortex introduced horizontally scaling compactor which allows multiple compactors to compact blocks for a single tenant, sharded by time interval. The compactor is capable of compacting multiple smaller blocks into a larger block, to reduce the the duplicated information in index. The following is an illustration of how the shuffle sharding compactor works, where each arrow represents a single compaction that can be carried out independently. | |
Cortex introduced horizontally scaling compactor which allows multiple compactors to compact blocks for a single tenant, sharded by time interval. The compactor is capable of compacting multiple smaller blocks into a larger block, to reduce the duplicated information in index. The following is an illustration of how the shuffle sharding compactor works, where each arrow represents a single compaction that can be carried out independently. |
|
||
* handling the 64GB index limit | ||
* reducing the overall compaction time | ||
* reducing the amount of data downloaded |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't reduce the amount of data downloaded, but it's done in smaller batches
* reducing the amount of data downloaded | |
* downloading the data in smaller batches |
* handling the 64GB index limit | ||
* reducing the overall compaction time | ||
* reducing the amount of data downloaded | ||
* reducing the time required to compact |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The most important thing is to allow the compactor to continue scaling horizontally
|
||
### Dynamic Number of Partition | ||
|
||
We can also increase/decrease the number of partition without needing the `multiplier` factor. However, if a tenant is sending highly varying number of timeseries or label size, the index size can be very different, resulting in highly dynamic number of partitions. To perform deduplication, we’ll end up having to download all the sub-blocks, and it can be inefficient as less parallelization can be done, and we will spend more time downloading all the unnecessary blocks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for putting this here. I was wondering why does it have to be this complex.
Signed-off-by: Roy Chiang <[email protected]>
Now the planner knows the resulting compaction will have 8 partitions, it can start planning out which groups of blocks can go into a single compaction group. Given that we need 8 partitions in total, the planner will go through the process above to find out what blocks are necessary. Using the above example again, but we have distinct time intervals, T1, T2, and T3. T1 has 2 partitions, T2 has 4 partitions, and T3 has 8 partitions, and we want to produce T1-T3 blocks | ||
![Grouping](/images/proposals/timeseries-partitioning-in-compactor-grouping.png) | ||
``` | ||
Compaction Group 1-8 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now for compaction partitioning, we split 8 jobs to 8 compactors to pick up and run the compaction. This is okay for us but I am not sure if all Cortex users value compaction speed over resources usage.
Probably we can mention or have another mode for 1 compaction job which creates 8 blocks the same time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If compaction is delayed, the read path needs more resources (store-gateways, queriers, etc). So I would say yes, most users want speedy compaction to avoid spending on resources on the read path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree speed is important. But resources should be also taken into consideration. Let's say a block has 16 shards so in this case we need to download it and compact it 16 times.
If a single compaction can create 8 blocks, compare to 8 compactions to generate 8 blocks. Although the latter is faster, single compaction should still be better in terms of total CPU time. Since no block and index needs to be download or verified multiple times.
Compaction Group 8-8 | ||
T1 - Partition 2-2 | ||
T2 - Partition 4-4 | ||
T3 - Partition 8-8 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's say we have 8 compaction groups but only have 3 compactor instances. So in this case each compactor needs to compact more than 1 compaction groups.
If one compactor needs to compact group 1-8
and 3-8
locally, the source blocks T1-T3
are the same for the two groups, so do we have a way to ensure not download those blocks twice for a single instance?
I feel this scenario is common as when we need to shard more like 16 shards then it is probably hard to have 16 instances compactors running.
Would there be any interest in reusing the "splitting" Prometheus compactor as used by Grafana Mimir? It's straightforward adaptation of Prometheus compactor such that it can produce multiple blocks from single compaction. Code (apache2, as it's Prometheus fork) starts here: https://github.com/grafana/mimir-prometheus/blob/main/tsdb/compact.go#L430, and "normal" compaction is simply using shardCount=1. Splitting series into output blocks is performed at
As you can see, it uses We would be happy to contribute this upstream, if there was wider interest (eg. Cortex, Thanos), and if Prometheus project would be interested too. |
Hey Peter, thanks for linking the change! It's great work, and I think it makes sense. However, for Cortex to use it, the interface needs to be changed a little bit, so that it fits our use-case. More specifically, we currently produce only 1 block out of N partitions per compaction, and this is so that we can have better control over the parallelization of the process. Do you mind creating a Prometheus issue, an we can discuss there? |
What this PR does:
The proposal for allowing compactor to produced partitioned TSDB blocks, so that Cortex can work around the 64GB index issue, while achieving faster compaction time.
Which issue(s) this PR fixes:
Fixes #4705
Checklist
CHANGELOG.md
updated - the order of entries should be[CHANGE]
,[FEATURE]
,[ENHANCEMENT]
,[BUGFIX]