-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Store gateway fails to sync block meta.json #1874
Comments
The error fom Compactor looks similar to #453. Couple of questions for you:
|
I'd split it to two issues: The store should not crashLoop because of one block with missing metadata.json. In my case I have stores pointed to dozens of TSDB blocks that actually duplicate each other, so one block being partially uploaded/corrupted - not a big deal. And another issue connected to the compactor. |
@zygiss Hi, thanks for responding!
|
@homelessnessbo Would you like me to restructure this into two separate GitHub issues? |
By "S3" most poeple mean AWS, but there are other storage systems that implemnet the S3 API (Ceph, OpenStack Swift, MinIO). There are several open bugs about block corruption when using Ceph (#1728, #1345) for instance, so wondered if you're affected by one of those. |
@zygiss Ahh, sorry about the misunderstanding. I am using AWS S3. |
We found a similar problem, where the local cache of the meta.json contained empty files. This caused our store servers to crash loop. I verified that the meta.json that ended up empty on the store were created by the compactor. But the timestamps in GCS were from months earlier. So it seems like they were downloaded by the store, but written out as zero bytes. Details here: https://gitlab.com/gitlab-com/gl-infra/production/issues/1525 |
Ack, thanks for the report. This is a really useful finding. I believe the empty meta.json can be left on filesystem when store crashes during meta.json download time. I am on it. |
I also encountered the same problem using aliyun objstore.
I try to see the data on aliyun objstore bucket, then i found that:
|
Fixes: #1874 * Corrupted disk cache for meta.json is handled gracefully. * Synchronize was not taking into account deletion by removing meta.json. * Prepare for future implementation of https://thanos.io/proposals/201901-read-write-operations-bucket.md/ * Better observability for syncronize process. * More logs for store startup process. TODO in separate PR: * More observability for index-cache loading / adding time. Signed-off-by: Bartlomiej Plotka <[email protected]>
Fixes: #1874 * Corrupted disk cache for meta.json is handled gracefully. * Synchronize was not taking into account deletion by removing meta.json. * Prepare for future implementation of https://thanos.io/proposals/201901-read-write-operations-bucket.md/ * Better observability for syncronize process. * More logs for store startup process. TODO in separate PR: * More observability for index-cache loading / adding time. Signed-off-by: Bartlomiej Plotka <[email protected]>
Fixes: #1874 * Corrupted disk cache for meta.json is handled gracefully. * Synchronize was not taking into account deletion by removing meta.json. * Prepare for future implementation of https://thanos.io/proposals/201901-read-write-operations-bucket.md/ * Better observability for syncronize process. * More logs for store startup process. TODO in separate PR: * More observability for index-cache loading / adding time. Signed-off-by: Bartlomiej Plotka <[email protected]>
Fixes: #1874 * Corrupted disk cache for meta.json is handled gracefully. * Synchronize was not taking into account deletion by removing meta.json. * Prepare for future implementation of https://thanos.io/proposals/201901-read-write-operations-bucket.md/ * Better observability for syncronize process. * More logs for store startup process. TODO in separate PR: * More observability for index-cache loading / adding time. Signed-off-by: Bartlomiej Plotka <[email protected]>
Fixes: #1874 * Corrupted disk cache for meta.json is handled gracefully. * Synchronize was not taking into account deletion by removing meta.json. * Prepare for future implementation of https://thanos.io/proposals/201901-read-write-operations-bucket.md/ * Better observability for syncronize process. * More logs for store startup process. TODO in separate PR: * More observability for index-cache loading / adding time. Signed-off-by: Bartlomiej Plotka <[email protected]>
Fixes: #1874 * Corrupted disk cache for meta.json is handled gracefully. * Synchronize was not taking into account deletion by removing meta.json. * Prepare for future implementation of https://thanos.io/proposals/201901-read-write-operations-bucket.md/ * Better observability for syncronize process. * More logs for store startup process. TODO in separate PR: * More observability for index-cache loading / adding time. Signed-off-by: Bartlomiej Plotka <[email protected]>
Fixes: #1874 * Corrupted disk cache for meta.json is handled gracefully. * Synchronize was not taking into account deletion by removing meta.json. * Prepare for future implementation of https://thanos.io/proposals/201901-read-write-operations-bucket.md/ * Better observability for syncronize process. * More logs for store startup process. TODO in separate PR: * More observability for index-cache loading / adding time. Signed-off-by: Bartlomiej Plotka <[email protected]>
Fixes: #1874 * Corrupted disk cache for meta.json is handled gracefully. * Synchronize was not taking into account deletion by removing meta.json. * Prepare for future implementation of https://thanos.io/proposals/201901-read-write-operations-bucket.md/ * Better observability for syncronize process. * More logs for store startup process. TODO in separate PR: * More observability for index-cache loading / adding time. Signed-off-by: Bartlomiej Plotka <[email protected]>
Fixes: #1874 * Corrupted disk cache for meta.json is handled gracefully. * Synchronize was not taking into account deletion by removing meta.json. * Prepare for future implementation of https://thanos.io/proposals/201901-read-write-operations-bucket.md/ * Better observability for syncronize process. * More logs for store startup process. TODO in separate PR: * More observability for index-cache loading / adding time. Signed-off-by: Bartlomiej Plotka <[email protected]>
Thanos, Prometheus and Golang version used:
Image: quay.io/thanos/thanos:v0.9.0
thanos, version 0.9.0 (branch: HEAD, revision: 0833cad)
build user: circleci@8e3da52515a2
build date: 20191203-17:03:13
go version: go1.13.1
prom/prometheus:v2.12.0
prometheus, version 2.12.0 (branch: HEAD, revision: 43acd0e2e93f9f70c49b2267efa0124f1e759e86)
build user: root@7a9dbdbe0cc7
build date: 20190818-13:53:16
go version: go1.12.8
Object Storage Provider: S3
What happened: The compactor and store gateway entered a crash loop. I checked the validity of each JSON file in the S3 bucket with an adhoc script. Every JSON file appears to be valid JSON format. I ran the
thanos bucket verify --repair
tool. This indicated that it fixed a couple issues, but it did not resolve any of the issues observed. I have tried removing some of the bad blocks that the compactor has complained about, but it continued crashing on other bad blocks.What you expected to happen: On the store gateway, I expect to see some way to identify which block contains the invalid JSON file. On the compactor, I expect the repair tool to fix any bad blocks.
How to reproduce it (as minimally and precisely as possible): Unsure
Full logs to relevant components:
Compactor Logs:
Store Gateway Logs:
Anything else we need to know:
The bucket contains approximately 2TB of metrics. I would like to be able to recover the metrics in this bucket, but currently, it is unusable since the store gateway is crashing.
The text was updated successfully, but these errors were encountered: