Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spread TSDB head compaction over the configured interval #4364

Merged
merged 5 commits into from
Mar 3, 2023

Conversation

pracucci
Copy link
Collaborator

@pracucci pracucci commented Mar 3, 2023

What this PR does

I would like to experiment spreading the TSDB head compaction over the configured interval. To make it possible, in this PR I propose two changes:

  1. Allow to set -blocks-storage.tsdb.head-compaction-interval to a value higher than 5m (in this PR I propose a max of 15m)
  2. Apply an initial jitter to the first compaction check, and then run it at regular intervals. This should effectively spread the compaction over time across different ingesters.

When running with the default config (1m interval) there should be no practical difference.

Which issue(s) this PR fixes or relates to

N/A

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

@pracucci pracucci marked this pull request as ready for review March 3, 2023 10:13
@pracucci pracucci requested review from a team as code owners March 3, 2023 10:13

for ctx.Err() == nil {
select {
case <-ticker.C:
case <-time.After(nextPeriodicCompaction):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If force compaction is triggered, it will restart wait time, so subsequent compactions will be shifted (contrary to what help string says).

We could use Ticker and simply reset its duration after first compaction.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WDYT about 6cb30be ?

@pstibrany
Copy link
Member

Description makes it sound like we're spreading compactions for all TSDBs. That's not the case. All TSDBs will still be compacted at the same time, just this time is now slightly less aligned to 2h interval.

@pracucci pracucci force-pushed the add-initial-jitter-to-compaction-interval branch from 1efb6d7 to 6cb30be Compare March 3, 2023 10:59
@pstibrany
Copy link
Member

Description makes it sound like we're spreading compactions for all TSDBs. That's not the case. All TSDBs will still be compacted at the same time, just this time is now slightly less aligned to 2h interval.

I see that both PR description and CLI flag already mention that this helps across multiple ingesters only. I missed that before. Sorry for the noise.

Signed-off-by: Marco Pracucci <[email protected]>
Copy link
Member

@pstibrany pstibrany left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Signed-off-by: Marco Pracucci <[email protected]>
@pracucci pracucci merged commit 6030ca6 into main Mar 3, 2023
@pracucci pracucci deleted the add-initial-jitter-to-compaction-interval branch March 3, 2023 14:55
krajorama added a commit that referenced this pull request Mar 6, 2023
* Helm: nginx HPA and tests kubeversion fixes (#4299)

* Helm: fix Kubernetes override for nginx HPA

The template did not take into account the override "kubeVersionOverride".
Fix by using the mimir template implemented for this reason.

Signed-off-by: György Krajcsovits <[email protected]>

* Helm: fix missing Kubernetes version overrides in tests

The golden record tests need a fixed version because helm uses the version
of the default context and can produce different results between
contributor's machine and the CI environment.

Add logic to test build to inject the minimal version if not found in the
values file. Mainly because we cannot have a version override in the
small and large values files.

Signed-off-by: György Krajcsovits <[email protected]>
Co-authored-by: Jon Kartago Lamida <[email protected]>

* Ruler: load more tenants in parallel during startup (#4258)

* Ruler: load more tenants in parallel during startup

* add more tests

* fix lint

* Apply suggestions from code review

Co-authored-by: Marco Pracucci <[email protected]>

* Ingester: fix OOO blocks labelling (#4297)

* Ingester: fix OOO blocks labelling

This fixes a bug where the OutOfOrderExternalLabel
was being added to all blocks instead of the ones coming
from OOO data, when the feature flag was enabled.

* Changelog

* PR number to changelog

* Update previous changelog entry instead

* Ruler: load more tenants in parallel during startup

* fix context

* improve unittest

---------

Co-authored-by: Marco Pracucci <[email protected]>
Co-authored-by: Nicolás Pazos <[email protected]>

* Change language to match the math. (#4356)

* Upgrade mimir-prometheus to get a fast regexp path optimization (#4357)

* Upgrade mimir-prometheus to get a fast regexp path optimization

Signed-off-by: Marco Pracucci <[email protected]>

* Added CHANGELOG entry

Signed-off-by: Marco Pracucci <[email protected]>

---------

Signed-off-by: Marco Pracucci <[email protected]>

* Fix typo in the docs URL for migrating from Cortex (#4358)

* Remove forced paragraph break. (#4359)

* Bump actions/setup-go to v3 to resolve Node.js 12 deprecation warning. (#4361)

* Improve flaky `TestIngesterWithShippingDisabledDeletesBlocksOnlyAfterRetentionExpires` (#4362)

* Use more specific assertion to include more information in test failures.

See #4198.

* Reduce flakiness of test by extending retention period.

This gives the rest of the test more time to retrieve `oldBlocks`
before any of the blocks is removed.

* Add asynchronous validation scaffolding for block upload (#3411)

* Add asynchronous validation scaffolding for block upload

* addressed lint errors

* Update pkg/compactor/block_upload.go

Co-authored-by: Arve Knudsen <[email protected]>

* Update pkg/compactor/block_upload.go

Co-authored-by: Arve Knudsen <[email protected]>

* Update pkg/compactor/block_upload.go

Co-authored-by: Arve Knudsen <[email protected]>

* Update pkg/compactor/block_upload.go

Co-authored-by: Arve Knudsen <[email protected]>

* Update pkg/compactor/block_upload.go

Co-authored-by: Arve Knudsen <[email protected]>

* Update pkg/compactor/block_upload.go

Co-authored-by: Arve Knudsen <[email protected]>

* enable block upload for dev testing

* fixed validation errors, added debug log messages

* fixed cancelled context issue

* changed name of flag to disable complete block upload

* addressed reviewer feedback

* addressed reviewer feedback

* Address some review comments, WIP

* Small spacing cleanup

* Transition to in-memory bucket for block finish test

* Async validation test coordination, adding configuration flags

* Small comment and flag fix

* Swap config strategy, test still needs separation

* Docs + lint

* Review comments, begin separating tests

* Finish validateAndComplete test

* Update docs

* Remove docker compose arguments

* Regenerate rather than modify

* Review, add test for periodicValidationUpdater

* Make validateAndComplete test clearer, add upload meta check

* Set missing cancelContext

* Add sleep as suggested

* Configure data directory

* fixed compactor data dir in e2e test

* Add changelog entry

* Split into two entries

* Missing entry number

* Update CHANGELOG.md

---------

Co-authored-by: Arve Knudsen <[email protected]>
Co-authored-by: Andy Asp <[email protected]>
Co-authored-by: Andy Asp <[email protected]>

* Jsonnet: honor the minimum shard size configured (#4363)

Signed-off-by: Marco Pracucci <[email protected]>

* [CHANGE] Ruler: set default `evaluation-delay-duration` to 1m (#4250)

* change the default evaluation delay of ruler to 1m

* revert the changes in ruler test

* change integration test to set ruler default value

* fix integration tests

* Update integration/configs.go

Co-authored-by: Marco Pracucci <[email protected]>

* Update CHANGELOG.md

Co-authored-by: Peter Štibraný <[email protected]>

---------

Co-authored-by: Marco Pracucci <[email protected]>
Co-authored-by: Peter Štibraný <[email protected]>

* [Chore] Update jsonnet manifest create query frontend discovery only when it is necessary (#4353)

* [Chore] update jsonnet manifest, avoid setting querier.frontend-address or create query-frontend-discovery when deployement mode is microserivces or query-scheduler is enabled

* linter and changelog

* Helm: fix parity with jsonnet on query frontend headless service

Do not generate query-frontend-headless service if query scheduler
 is enabled

Signed-off-by: György Krajcsovits <[email protected]>

* Apply suggestions from code review

Co-authored-by: Marco Pracucci <[email protected]>

* correct changelog

* regenerate helm golden files

* Update CHANGELOG.md

---------

Signed-off-by: György Krajcsovits <[email protected]>
Co-authored-by: György Krajcsovits <[email protected]>
Co-authored-by: Marco Pracucci <[email protected]>

* Remove block validation mimirtool changelog entry (#4369)

* Spread TSDB head compaction over the configured interval (#4364)

* Spread TSDB head compaction over the configured interval

Signed-off-by: Marco Pracucci <[email protected]>

* Fixed unit test

Signed-off-by: Marco Pracucci <[email protected]>

* Apply suggestion from code review

Signed-off-by: Marco Pracucci <[email protected]>

* Fix typo in CHANGELOG entry

Signed-off-by: Marco Pracucci <[email protected]>

* Fix typo in CHANGELOG entry

Signed-off-by: Marco Pracucci <[email protected]>

---------

Signed-off-by: Marco Pracucci <[email protected]>

* Fix port number values. (#4368)

* Ruler: change deployment max surge and max unavailable to reduce ownership spillover (#4381)

* Ruler: change deployment max surge and max unavailable to reduce ownership spillover

Signed-off-by: Marco Pracucci <[email protected]>

* Apply suggestions from code review

Co-authored-by: Dimitar Dimitrov <[email protected]>

---------

Signed-off-by: Marco Pracucci <[email protected]>
Co-authored-by: Dimitar Dimitrov <[email protected]>

* Move "Note:" about cross-zone costs to "Costs" (#4370)

This note was in an unrelated section.

Signed-off-by: Oleg Zaytsev <[email protected]>

* Change default -blocks-storage.tsdb.retention-period from 24h to 13h (#4382)

Signed-off-by: Marco Pracucci <[email protected]>

* Support histograms in pkg/storage and update other breakages (#4354)

* Support histograms in pkg/storage and update other breakages

Signed-off-by: Ganesh Vernekar <[email protected]>

---------

Signed-off-by: György Krajcsovits <[email protected]>
Signed-off-by: Marco Pracucci <[email protected]>
Signed-off-by: Oleg Zaytsev <[email protected]>
Signed-off-by: Ganesh Vernekar <[email protected]>
Co-authored-by: Jon Kartago Lamida <[email protected]>
Co-authored-by: ying-jeanne <[email protected]>
Co-authored-by: Marco Pracucci <[email protected]>
Co-authored-by: Nicolás Pazos <[email protected]>
Co-authored-by: Ursula Kallio <[email protected]>
Co-authored-by: l3ioo <[email protected]>
Co-authored-by: Charles Korn <[email protected]>
Co-authored-by: Vernon Miller <[email protected]>
Co-authored-by: Arve Knudsen <[email protected]>
Co-authored-by: Andy Asp <[email protected]>
Co-authored-by: Andy Asp <[email protected]>
Co-authored-by: Peter Štibraný <[email protected]>
Co-authored-by: Dimitar Dimitrov <[email protected]>
Co-authored-by: Oleg Zaytsev <[email protected]>
Co-authored-by: Ganesh Vernekar <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants