Compactor should ignore dirty data or automatically delete corrupted data #4046

mikechengwei · 2021-04-12T02:29:59Z

Thanos, Prometheus and Golang version used:
Thanos version: v0.17.2
Prometheus: v2.11.0
go version go1.14.4

Object Storage Provider:
Ceph S3.
What happened:
Thanos compact stuck in continuous restart loop when index file missing.

Error log:

List ceph data chunk folder:

I think we can actively skip the damaged chunk and continue to merge.

https://github.com/thanos-io/thanos/blob/4bf956deb64b2a12c8c26d28f03a6795d75ada4f/pkg/compact/compact.go#L722-739.

What you expected to happen:
Don't keep restarting.
How to reproduce it (as minimally and precisely as possible):
When chunk data loss.

wiardvanrij · 2021-04-12T18:51:48Z

I think we can/should do a better sanity check on these blocks. I.e. check if meta.json, index and chunks are present. Seeing the errors, it just assumes there is an index file.
Though, I'm not so sure on how to "deal" with it as there can be numerous reasons why the index file is missing.

I make some assumptions on the logic here but compactor sees the ULID -> downloads the entire folder -> does it's thing. In your case it fails because the index file is not there. Yet what if the index file is in the object store, but it simply failed to download all the contents? What if someone made an 'oopsie' and had the wrong permissions on a file, or what if the disk is full. Even this could happen when for example the sidecar ships its stuff but fails 'somewhere' and mistakenly misses the index. Now the block is not broken at all and could be fixed.

IMO the compactor job is pretty much irreversible. The behaviour of 'dirty data' or even 'corrupted data' should be the exception rather than a standard. Therefore my initial thought here is that we should not make assumptions and unfortunately do manual actions as sanity check before proceeding.

Do you perhaps have a proposal on when we could define data as corrupted and 100% not fixable? Or WDYT?

mikechengwei · 2021-04-15T05:21:00Z

I get you meaning and agree with you the behaviour of 'dirty data' or even 'corrupted data' should't be a standard.

I think we can throw exception when meeting corrupted data but not keep container restart loop. We should ensure that the application will not crash due to some exceptions. WDYT? @wiardvanrij

mikechengwei · 2021-04-15T05:23:37Z

Sorry, I have been busy recently.
If you agree that we should not crash the process when an exception occurs, I can help improve the code.

bwplotka · 2021-04-20T15:35:46Z

Thanks a lot for this report, we sometimes hit a similar issue when we have broken manually uploaded block - compactor is halted and awaiting a manual decision.

Unfortunately skipping right now is not easy, we don't have a mechanism for that. Unfortunately continuing like in PR is not helping, (compaction will pick the same block in the next iteration). We are discussing Compact v2 today, and I will bring this issue.

mikechengwei · 2021-04-20T15:43:24Z

Ok. I hope we can find a reasonable solution and optimize it, and I am willing to assist in the completion.

stale · 2021-06-20T11:55:12Z

Hello 👋 Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

stale · 2021-08-28T15:56:03Z

Hello 👋 Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

stale · 2021-09-19T01:33:15Z

Closing for now as promised, let us know if you need this to be reopened! 🤗

mikechengwei changed the title ~~Compactor ignore dirty data or automatically delete corrupted data~~ Compactor should ignore dirty data or automatically delete corrupted data Apr 12, 2021

mikechengwei mentioned this issue Apr 17, 2021

When encountering damaged data, continue to compress other data. #4072

Closed

bwplotka added the feature request/improvement label Apr 20, 2021

yashrsharma44 mentioned this issue May 13, 2021

Improvements on Thanos Compactor #4233

Open

bboreham mentioned this issue Jun 14, 2021

Add proposal for parallel compaction by time interval cortexproject/cortex#4272

Merged

3 tasks

stale bot added the stale label Jun 20, 2021

mikechengwei closed this as completed Jun 28, 2021

mikechengwei reopened this Jun 28, 2021

stale bot removed the stale label Jun 28, 2021

stale bot added the stale label Aug 28, 2021

stale bot closed this as completed Sep 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compactor should ignore dirty data or automatically delete corrupted data #4046

Compactor should ignore dirty data or automatically delete corrupted data #4046

mikechengwei commented Apr 12, 2021 •

edited

Loading

wiardvanrij commented Apr 12, 2021

mikechengwei commented Apr 15, 2021

mikechengwei commented Apr 15, 2021

bwplotka commented Apr 20, 2021

mikechengwei commented Apr 20, 2021

stale bot commented Jun 20, 2021

stale bot commented Aug 28, 2021

stale bot commented Sep 19, 2021

Compactor should ignore dirty data or automatically delete corrupted data #4046

Compactor should ignore dirty data or automatically delete corrupted data #4046

Comments

mikechengwei commented Apr 12, 2021 • edited Loading

wiardvanrij commented Apr 12, 2021

mikechengwei commented Apr 15, 2021

mikechengwei commented Apr 15, 2021

bwplotka commented Apr 20, 2021

mikechengwei commented Apr 20, 2021

stale bot commented Jun 20, 2021

stale bot commented Aug 28, 2021

stale bot commented Sep 19, 2021

mikechengwei commented Apr 12, 2021 •

edited

Loading