New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

add proposal for timeseries partitioning in compactor #4843

Merged

alvinlin123 merged 2 commits into cortexproject:master from roystchiang:block-partitioning

Sep 27, 2022

Contributor

roystchiang commented Aug 25, 2022 •

edited

Loading

What this PR does:
The proposal for allowing compactor to produced partitioned TSDB blocks, so that Cortex can work around the 64GB index issue, while achieving faster compaction time.

Which issue(s) this PR fixes:
Fixes #4705

Checklist

Tests updated
Documentation added
CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

pull-request-size bot added the size/L label

roystchiang force-pushed the block-partitioning branch from f5a9c95 to be44266 Compare

August 25, 2022 20:48

roystchiang mentioned this pull request

Vertical Block Sharding thanos-io/thanos#5437

Open

roystchiang force-pushed the block-partitioning branch from be44266 to ddf5c2c Compare

August 25, 2022 20:59


          add proposal for timeseries partitioning in compactor

83331c6

Signed-off-by: Roy Chiang <[email protected]>

roystchiang force-pushed the block-partitioning branch from ddf5c2c to 83331c6 Compare

August 25, 2022 21:26

alvinlin123 requested a review from friedrichg

August 27, 2022 00:03

friedrich-at-adobe suggested changes

View reviewed changes

Contributor

friedrich-at-adobe left a comment

LGTM. Just some nits

docs/proposals/timeseries-partitioning-in-compactor.md Outdated


		## Problem and Requirements

		Cortex introduced horizontally scaling compactor which allows multiple compactors to compact blocks for a single tenant, sharded by time interval. The compactor is capable of compacting multiple smaller blocks into a larger block, to reduce the the duplicated information in index. The following is an illustration of how the shuffle sharding compactor works, where each arrow represents a single compaction that can be carried out independently.

Contributor

friedrich-at-adobe Sep 2, 2022

Suggested change

      
            Cortex introduced horizontally scaling compactor which allows multiple compactors to compact blocks for a single tenant, sharded by time interval. The compactor is capable of compacting multiple smaller blocks into a larger block, to reduce the the duplicated information in index. The following is an illustration of how the shuffle sharding compactor works, where each arrow represents a single compaction that can be carried out independently.
          
            Cortex introduced horizontally scaling compactor which allows multiple compactors to compact blocks for a single tenant, sharded by time interval. The compactor is capable of compacting multiple smaller blocks into a larger block, to reduce the duplicated information in index. The following is an illustration of how the shuffle sharding compactor works, where each arrow represents a single compaction that can be carried out independently.

docs/proposals/timeseries-partitioning-in-compactor.md Outdated

+              * handling the 64GB index limit
+              * reducing the overall compaction time
+                  * reducing the amount of data downloaded

Contributor

friedrich-at-adobe Sep 6, 2022

We don't reduce the amount of data downloaded, but it's done in smaller batches

Suggested change

      
                * reducing the amount of data downloaded
          
                * downloading the data in smaller batches

docs/proposals/timeseries-partitioning-in-compactor.md

+              * handling the 64GB index limit
+              * reducing the overall compaction time
+                  * reducing the amount of data downloaded
+                  * reducing the time required to compact

Contributor

friedrich-at-adobe Sep 6, 2022

The most important thing is to allow the compactor to continue scaling horizontally

docs/proposals/timeseries-partitioning-in-compactor.md


		### Dynamic Number of Partition

		We can also increase/decrease the number of partition without needing the `multiplier` factor. However, if a tenant is sending highly varying number of timeseries or label size, the index size can be very different, resulting in highly dynamic number of partitions. To perform deduplication, we’ll end up having to download all the sub-blocks, and it can be inefficient as less parallelization can be done, and we will spend more time downloading all the unnecessary blocks.

Contributor

friedrich-at-adobe Sep 6, 2022

Thanks for putting this here. I was wondering why does it have to be this complex.


          fix linter and address comments

e6bfbe9

Signed-off-by: Roy Chiang <[email protected]>

friedrichg approved these changes

View reviewed changes

alvinlin123 approved these changes

View reviewed changes

alvinlin123 merged commit da25aa3 into cortexproject:master

yeya24 reviewed

View reviewed changes

docs/proposals/timeseries-partitioning-in-compactor.md

+              Now the planner knows the resulting compaction will have 8 partitions, it can start planning out which groups of blocks can go into a single compaction group. Given that we need 8 partitions in total, the planner will go through the process above to find out what blocks are necessary. Using the above example again, but we have distinct time intervals, T1, T2, and T3. T1 has 2 partitions, T2 has 4 partitions, and T3 has 8 partitions, and we want to produce T1-T3 blocks
+              ![Grouping](/images/proposals/timeseries-partitioning-in-compactor-grouping.png)
+              ```
+              Compaction Group 1-8

Contributor

yeya24 Sep 27, 2022

Right now for compaction partitioning, we split 8 jobs to 8 compactors to pick up and run the compaction. This is okay for us but I am not sure if all Cortex users value compaction speed over resources usage.
Probably we can mention or have another mode for 1 compaction job which creates 8 blocks the same time?

Member

friedrichg Sep 28, 2022

If compaction is delayed, the read path needs more resources (store-gateways, queriers, etc). So I would say yes, most users want speedy compaction to avoid spending on resources on the read path.

Contributor

yeya24 Sep 28, 2022

I agree speed is important. But resources should be also taken into consideration. Let's say a block has 16 shards so in this case we need to download it and compact it 16 times.
If a single compaction can create 8 blocks, compare to 8 compactions to generate 8 blocks. Although the latter is faster, single compaction should still be better in terms of total CPU time. Since no block and index needs to be download or verified multiple times.

alexqyle mentioned this pull request

Update proposal for timeseries partitioning in compactor #4882

Merged

3 tasks

yeya24 reviewed

View reviewed changes

docs/proposals/timeseries-partitioning-in-compactor.md

+              Compaction Group 8-8
+              T1 - Partition 2-2
+              T2 - Partition 4-4
+              T3 - Partition 8-8

Contributor

yeya24 Sep 28, 2022

Let's say we have 8 compaction groups but only have 3 compactor instances. So in this case each compactor needs to compact more than 1 compaction groups.
If one compactor needs to compact group 1-8 and 3-8 locally, the source blocks T1-T3 are the same for the two groups, so do we have a way to ensure not download those blocks twice for a single instance?
I feel this scenario is common as when we need to shard more like 16 shards then it is probably hard to have 16 instances compactors running.

Contributor

pstibrany commented Nov 11, 2022

Would there be any interest in reusing the "splitting" Prometheus compactor as used by Grafana Mimir? It's straightforward adaptation of Prometheus compactor such that it can produce multiple blocks from single compaction.

Code (apache2, as it's Prometheus fork) starts here: https://github.com/grafana/mimir-prometheus/blob/main/tsdb/compact.go#L430, and "normal" compaction is simply using shardCount=1.

Splitting series into output blocks is performed at

As you can see, it uses labels.Hash() % numShards to compute output block index, as suggested by this proposal too.

We would be happy to contribute this upstream, if there was wider interest (eg. Cortex, Thanos), and if Prometheus project would be interested too.

Contributor Author

roystchiang commented Nov 14, 2022

Hey Peter, thanks for linking the change!

It's great work, and I think it makes sense. However, for Cortex to use it, the interface needs to be changed a little bit, so that it fits our use-case. More specifically, we currently produce only 1 block out of N partitions per compaction, and this is so that we can have better control over the parallelization of the process.

Do you mind creating a Prometheus issue, an we can discuss there?

alexqyle mentioned this pull request

Partitioning compaction for Cortex #5025

Closed

3 tasks

alexqyle mentioned this pull request

Partitioning compaction for Cortex #5316

Closed

3 tasks

liguozhong mentioned this pull request

Mark all alternative stores but TSDB as deprecated grafana/loki#9105

Closed

alexqyle mentioned this pull request

Partitioning compaction for Cortex #5465

Closed

3 tasks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels