Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add proposal for timeseries partitioning in compactor #4843

Merged
merged 2 commits into from
Sep 27, 2022

Conversation

roystchiang
Copy link
Contributor

@roystchiang roystchiang commented Aug 25, 2022

What this PR does:
The proposal for allowing compactor to produced partitioned TSDB blocks, so that Cortex can work around the 64GB index issue, while achieving faster compaction time.

Which issue(s) this PR fixes:
Fixes #4705

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

Copy link
Contributor

@friedrich-at-adobe friedrich-at-adobe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Just some nits


## Problem and Requirements

Cortex introduced horizontally scaling compactor which allows multiple compactors to compact blocks for a single tenant, sharded by time interval. The compactor is capable of compacting multiple smaller blocks into a larger block, to reduce the the duplicated information in index. The following is an illustration of how the shuffle sharding compactor works, where each arrow represents a single compaction that can be carried out independently.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Cortex introduced horizontally scaling compactor which allows multiple compactors to compact blocks for a single tenant, sharded by time interval. The compactor is capable of compacting multiple smaller blocks into a larger block, to reduce the the duplicated information in index. The following is an illustration of how the shuffle sharding compactor works, where each arrow represents a single compaction that can be carried out independently.
Cortex introduced horizontally scaling compactor which allows multiple compactors to compact blocks for a single tenant, sharded by time interval. The compactor is capable of compacting multiple smaller blocks into a larger block, to reduce the duplicated information in index. The following is an illustration of how the shuffle sharding compactor works, where each arrow represents a single compaction that can be carried out independently.


* handling the 64GB index limit
* reducing the overall compaction time
* reducing the amount of data downloaded
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't reduce the amount of data downloaded, but it's done in smaller batches

Suggested change
* reducing the amount of data downloaded
* downloading the data in smaller batches

* handling the 64GB index limit
* reducing the overall compaction time
* reducing the amount of data downloaded
* reducing the time required to compact
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The most important thing is to allow the compactor to continue scaling horizontally


### Dynamic Number of Partition

We can also increase/decrease the number of partition without needing the `multiplier` factor. However, if a tenant is sending highly varying number of timeseries or label size, the index size can be very different, resulting in highly dynamic number of partitions. To perform deduplication, we’ll end up having to download all the sub-blocks, and it can be inefficient as less parallelization can be done, and we will spend more time downloading all the unnecessary blocks.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for putting this here. I was wondering why does it have to be this complex.

@alvinlin123 alvinlin123 merged commit da25aa3 into cortexproject:master Sep 27, 2022
Now the planner knows the resulting compaction will have 8 partitions, it can start planning out which groups of blocks can go into a single compaction group. Given that we need 8 partitions in total, the planner will go through the process above to find out what blocks are necessary. Using the above example again, but we have distinct time intervals, T1, T2, and T3. T1 has 2 partitions, T2 has 4 partitions, and T3 has 8 partitions, and we want to produce T1-T3 blocks
![Grouping](/images/proposals/timeseries-partitioning-in-compactor-grouping.png)
```
Compaction Group 1-8
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now for compaction partitioning, we split 8 jobs to 8 compactors to pick up and run the compaction. This is okay for us but I am not sure if all Cortex users value compaction speed over resources usage.
Probably we can mention or have another mode for 1 compaction job which creates 8 blocks the same time?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If compaction is delayed, the read path needs more resources (store-gateways, queriers, etc). So I would say yes, most users want speedy compaction to avoid spending on resources on the read path.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree speed is important. But resources should be also taken into consideration. Let's say a block has 16 shards so in this case we need to download it and compact it 16 times.
If a single compaction can create 8 blocks, compare to 8 compactions to generate 8 blocks. Although the latter is faster, single compaction should still be better in terms of total CPU time. Since no block and index needs to be download or verified multiple times.

Compaction Group 8-8
T1 - Partition 2-2
T2 - Partition 4-4
T3 - Partition 8-8
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's say we have 8 compaction groups but only have 3 compactor instances. So in this case each compactor needs to compact more than 1 compaction groups.
If one compactor needs to compact group 1-8 and 3-8 locally, the source blocks T1-T3 are the same for the two groups, so do we have a way to ensure not download those blocks twice for a single instance?
I feel this scenario is common as when we need to shard more like 16 shards then it is probably hard to have 16 instances compactors running.

@pstibrany
Copy link
Contributor

Would there be any interest in reusing the "splitting" Prometheus compactor as used by Grafana Mimir? It's straightforward adaptation of Prometheus compactor such that it can produce multiple blocks from single compaction.

Code (apache2, as it's Prometheus fork) starts here: https://github.com/grafana/mimir-prometheus/blob/main/tsdb/compact.go#L430, and "normal" compaction is simply using shardCount=1.

Splitting series into output blocks is performed at

As you can see, it uses labels.Hash() % numShards to compute output block index, as suggested by this proposal too.

We would be happy to contribute this upstream, if there was wider interest (eg. Cortex, Thanos), and if Prometheus project would be interested too.

@roystchiang
Copy link
Contributor Author

Hey Peter, thanks for linking the change!

It's great work, and I think it makes sense. However, for Cortex to use it, the interface needs to be changed a little bit, so that it fits our use-case. More specifically, we currently produce only 1 block out of N partitions per compaction, and this is so that we can have better control over the parallelization of the process.

Do you mind creating a Prometheus issue, an we can discuss there?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Failure to compact due to maximum index size 64 GiB
6 participants