Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Store gateway fails to sync block meta.json #1874

Closed
ryanprobus opened this issue Dec 11, 2019 · 10 comments · Fixed by #1936
Closed

Store gateway fails to sync block meta.json #1874

ryanprobus opened this issue Dec 11, 2019 · 10 comments · Fixed by #1936
Assignees

Comments

@ryanprobus
Copy link

ryanprobus commented Dec 11, 2019

Thanos, Prometheus and Golang version used:

Image: quay.io/thanos/thanos:v0.9.0
thanos, version 0.9.0 (branch: HEAD, revision: 0833cad)
build user: circleci@8e3da52515a2
build date: 20191203-17:03:13
go version: go1.13.1

prom/prometheus:v2.12.0
prometheus, version 2.12.0 (branch: HEAD, revision: 43acd0e2e93f9f70c49b2267efa0124f1e759e86)
build user: root@7a9dbdbe0cc7
build date: 20190818-13:53:16
go version: go1.12.8

Object Storage Provider: S3

What happened: The compactor and store gateway entered a crash loop. I checked the validity of each JSON file in the S3 bucket with an adhoc script. Every JSON file appears to be valid JSON format. I ran the thanos bucket verify --repair tool. This indicated that it fixed a couple issues, but it did not resolve any of the issues observed. I have tried removing some of the bad blocks that the compactor has complained about, but it continued crashing on other bad blocks.

What you expected to happen: On the store gateway, I expect to see some way to identify which block contains the invalid JSON file. On the compactor, I expect the repair tool to fix any bad blocks.

How to reproduce it (as minimally and precisely as possible): Unsure

Full logs to relevant components:
Compactor Logs:

level=warn ts=2019-12-11T20:30:35.543841262Z caller=prober.go:117 msg="changing probe status" status=not-ready reason="error executing compaction: compaction failed: compaction failed for group 0@14986719258624701985: compact blocks [/data/compact/0@14986719258624701985/01DME8R02C5W0EQKNX9HTHBR26 /data/compact/0@14986719258624701985/01DMEFKQAE6HHZPE7YFV74D06V /data/compact/0@14986719258624701985/01DMEPFEJMYW6A0FPCGBJH0QEA /data/compact/0@14986719258624701985/01DMEXB5TCV14HJX1CVJCCXX2Q]: write compaction: chunk 8 not found: reference sequence 0 out of range"
level=info ts=2019-12-11T20:30:35.543939219Z caller=http.go:78 service=http/server component=compact msg="internal server shutdown" err="error executing compaction: compaction failed: compaction failed for group 0@14986719258624701985: compact blocks [/data/compact/0@14986719258624701985/01DME8R02C5W0EQKNX9HTHBR26 /data/compact/0@14986719258624701985/01DMEFKQAE6HHZPE7YFV74D06V /data/compact/0@14986719258624701985/01DMEPFEJMYW6A0FPCGBJH0QEA /data/compact/0@14986719258624701985/01DMEXB5TCV14HJX1CVJCCXX2Q]: write compaction: chunk 8 not found: reference sequence 0 out of range"
level=info ts=2019-12-11T20:30:35.543960437Z caller=prober.go:137 msg="changing probe status" status=not-healthy reason="error executing compaction: compaction failed: compaction failed for group 0@14986719258624701985: compact blocks [/data/compact/0@14986719258624701985/01DME8R02C5W0EQKNX9HTHBR26 /data/compact/0@14986719258624701985/01DMEFKQAE6HHZPE7YFV74D06V /data/compact/0@14986719258624701985/01DMEPFEJMYW6A0FPCGBJH0QEA /data/compact/0@14986719258624701985/01DMEXB5TCV14HJX1CVJCCXX2Q]: write compaction: chunk 8 not found: reference sequence 0 out of range"
level=error ts=2019-12-11T20:30:35.54398373Z caller=main.go:194 msg="running command failed" err="error executing compaction: compaction failed: compaction failed for group 0@14986719258624701985: compact blocks [/data/compact/0@14986719258624701985/01DME8R02C5W0EQKNX9HTHBR26 /data/compact/0@14986719258624701985/01DMEFKQAE6HHZPE7YFV74D06V /data/compact/0@14986719258624701985/01DMEPFEJMYW6A0FPCGBJH0QEA /data/compact/0@14986719258624701985/01DMEXB5TCV14HJX1CVJCCXX2Q]: write compaction: chunk 8 not found: reference sequence 0 out of range"

Store Gateway Logs:

level=debug ts=2019-12-11T19:42:56.423840267Z caller=main.go:101 msg="maxprocs: Updating GOMAXPROCS=[1]: using minimum allowed GOMAXPROCS"
level=info ts=2019-12-11T19:42:56.424073196Z caller=main.go:149 msg="Tracing will be disabled"
level=info ts=2019-12-11T19:42:56.424156986Z caller=factory.go:43 msg="loading bucket configuration"
level=info ts=2019-12-11T19:42:56.620788641Z caller=cache.go:172 msg="created index cache" maxItemSizeBytes=131072000 maxSizeBytes=262144000 maxItems=math.MaxInt64
level=info ts=2019-12-11T19:42:56.621005242Z caller=options.go:20 protocol=gRPC msg="disabled TLS, key and cert must be set to enable"
level=info ts=2019-12-11T19:42:56.621309084Z caller=store.go:261 msg="starting store node"
level=info ts=2019-12-11T19:42:56.621513052Z caller=prober.go:127 msg="changing probe status" status=healthy
level=info ts=2019-12-11T19:42:56.62156242Z caller=http.go:53 service=http/server component=store msg="listening for requests and metrics" address=0.0.0.0:10902
level=info ts=2019-12-11T19:42:56.621814375Z caller=store.go:216 msg="initializing bucket store"
level=info ts=2019-12-11T19:45:19.464301604Z caller=http.go:78 service=http/server component=store msg="internal server shutdown" err="bucket store initial sync: sync block: iter: load meta: read meta.json: unexpected end of JSON input"
level=info ts=2019-12-11T19:45:19.464692447Z caller=prober.go:137 msg="changing probe status" status=not-healthy reason="bucket store initial sync: sync block: iter: load meta: read meta.json: unexpected end of JSON input"
level=info ts=2019-12-11T19:45:19.464742786Z caller=grpc.go:117 service=gRPC/server component=store msg="gracefully stopping internal server"
level=info ts=2019-12-11T19:45:19.464781601Z caller=grpc.go:129 service=gRPC/server component=store msg="internal server shutdown" err="bucket store initial sync: sync block: iter: load meta: read meta.json: unexpected end of JSON input"
level=info ts=2019-12-11T19:45:19.464821219Z caller=prober.go:107 msg="changing probe status" status=ready
level=info ts=2019-12-11T19:45:19.464874564Z caller=grpc.go:98 service=gRPC/server component=store msg="listening for StoreAPI gRPC" address=0.0.0.0:10901
level=error ts=2019-12-11T19:45:19.464912706Z caller=main.go:194 msg="running command failed" err="bucket store initial sync: sync block: iter: load meta: read meta.json: unexpected end of JSON input"

Anything else we need to know:
The bucket contains approximately 2TB of metrics. I would like to be able to recover the metrics in this bucket, but currently, it is unusable since the store gateway is crashing.

@zygiss
Copy link

zygiss commented Dec 12, 2019

The error fom Compactor looks similar to #453.

Couple of questions for you:

  • what is the backing store for your S3 bucket?
  • what version of Thanos produced these blocks? The producer will either be Compactor (if the source blocks are already compacted) or Sidecar (if these are new blocks)

@ivan-kiselev
Copy link

I'd split it to two issues:

The store should not crashLoop because of one block with missing metadata.json. In my case I have stores pointed to dozens of TSDB blocks that actually duplicate each other, so one block being partially uploaded/corrupted - not a big deal.

And another issue connected to the compactor.

@ryanprobus
Copy link
Author

ryanprobus commented Dec 12, 2019

@zygiss Hi, thanks for responding!

  1. We are using an s3 bucket. Can you clarify what you mean by "backing store"?
  2. The set of blocks causing the compactor to fail in the logs I provided were produced by the sidecar. However, I also saw compactor blocks that were also bad that I backed up and removed from the bucket. The blocks were produced by an older version of the sidecar (v0.4.0) since I recently upgraded thanos but initially did not see any issues.

@ryanprobus
Copy link
Author

@homelessnessbo Would you like me to restructure this into two separate GitHub issues?

@zygiss
Copy link

zygiss commented Dec 13, 2019

@ryanprobus:

1. We are using an s3 bucket. Can you clarify what you mean by "backing store"?

By "S3" most poeple mean AWS, but there are other storage systems that implemnet the S3 API (Ceph, OpenStack Swift, MinIO).

There are several open bugs about block corruption when using Ceph (#1728, #1345) for instance, so wondered if you're affected by one of those.

@ryanprobus
Copy link
Author

ryanprobus commented Dec 13, 2019

@zygiss Ahh, sorry about the misunderstanding. I am using AWS S3.

@SuperQ
Copy link
Contributor

SuperQ commented Jan 2, 2020

We found a similar problem, where the local cache of the meta.json contained empty files. This caused our store servers to crash loop. I wonder if there's a race between the compactor or sidecars adding new meta.json files and the store downloading them.

I verified that the meta.json that ended up empty on the store were created by the compactor. But the timestamps in GCS were from months earlier. So it seems like they were downloaded by the store, but written out as zero bytes.

Details here: https://gitlab.com/gitlab-com/gl-infra/production/issues/1525

@bwplotka
Copy link
Member

bwplotka commented Jan 2, 2020

Ack, thanks for the report. This is a really useful finding. I believe the empty meta.json can be left on filesystem when store crashes during meta.json download time.

I am on it.

@yahaa
Copy link
Contributor

yahaa commented Jan 3, 2020

I also encountered the same problem using aliyun objstore.
Compactor(v0.9.0) Logs:

level=warn ts=2020-01-03T03:08:51.82788962Z caller=prober.go:117 msg="changing probe status" status=not-ready reason="error executing compaction: compaction failed: compaction failed for group 0@17628213559036925513: compact blocks [/var/thanos/compactor/compact/0@17628213559036925513/01DXCYAP626GTB43C85XZ6AS3A /var/thanos/compactor/compact/0@17628213559036925513/01DXD56DE29C91EMH303GFD80K /var/thanos/compactor/compact/0@17628213559036925513/01DXDC24P3R8HAM4GSDZWXCKA9 /var/thanos/compactor/compact/0@17628213559036925513/01DXDJXVY3Q04CHAG4XCNSZH3F]: write compaction: chunk 8 not found: reference sequence 0 out of range"
level=info ts=2020-01-03T03:08:51.828156418Z caller=http.go:78 service=http/server component=compact msg="internal server shutdown" err="error executing compaction: compaction failed: compaction failed for group 0@17628213559036925513: compact blocks [/var/thanos/compactor/compact/0@17628213559036925513/01DXCYAP626GTB43C85XZ6AS3A /var/thanos/compactor/compact/0@17628213559036925513/01DXD56DE29C91EMH303GFD80K /var/thanos/compactor/compact/0@17628213559036925513/01DXDC24P3R8HAM4GSDZWXCKA9 /var/thanos/compactor/compact/0@17628213559036925513/01DXDJXVY3Q04CHAG4XCNSZH3F]: write compaction: chunk 8 not found: reference sequence 0 out of range"
level=info ts=2020-01-03T03:08:51.828202681Z caller=prober.go:137 msg="changing probe status" status=not-healthy reason="error executing compaction: compaction failed: compaction failed for group 0@17628213559036925513: compact blocks [/var/thanos/compactor/compact/0@17628213559036925513/01DXCYAP626GTB43C85XZ6AS3A /var/thanos/compactor/compact/0@17628213559036925513/01DXD56DE29C91EMH303GFD80K /var/thanos/compactor/compact/0@17628213559036925513/01DXDC24P3R8HAM4GSDZWXCKA9 /var/thanos/compactor/compact/0@17628213559036925513/01DXDJXVY3Q04CHAG4XCNSZH3F]: write compaction: chunk 8 not found: reference sequence 0 out of range"
level=error ts=2020-01-03T03:08:51.82828033Z caller=main.go:194 msg="running command failed" err="error executing compaction: compaction failed: compaction failed for group 0@17628213559036925513: compact blocks [/var/thanos/compactor/compact/0@17628213559036925513/01DXCYAP626GTB43C85XZ6AS3A /var/thanos/compactor/compact/0@17628213559036925513/01DXD56DE29C91EMH303GFD80K /var/thanos/compactor/compact/0@17628213559036925513/01DXDC24P3R8HAM4GSDZWXCKA9 /var/thanos/compactor/compact/0@17628213559036925513/01DXDJXVY3Q04CHAG4XCNSZH3F]

I try to see the data on aliyun objstore bucket, then i found that:

01DXCYAP626GTB43C85XZ6AS3A  block have chunks folder, index file and meta.json file
01DXD56DE29C91EMH303GFD80K block have chunks folder, index file and meta.json file
01DXDC24P3R8HAM4GSDZWXCKA9 block have chunks folder, index file and meta.json file
01DXDJXVY3Q04CHAG4XCNSZH3F block just have index file and meta.json file

image

@bwplotka
Copy link
Member

bwplotka commented Jan 3, 2020

@yahaa Let's keep this issue focused on Store GW syncing meta.json failures.

For compactor issues please refer and comment on: #1300

bwplotka added a commit that referenced this issue Jan 3, 2020
Fixes: #1874

* Corrupted disk cache for meta.json is handled gracefully.
* Synchronize was not taking into account deletion by removing meta.json.
* Prepare for future implementation of https://thanos.io/proposals/201901-read-write-operations-bucket.md/
* Better observability for syncronize process.
* More logs for store startup process.

TODO in separate PR:
* More observability for index-cache loading / adding time.

Signed-off-by: Bartlomiej Plotka <[email protected]>
bwplotka added a commit that referenced this issue Jan 3, 2020
Fixes: #1874

* Corrupted disk cache for meta.json is handled gracefully.
* Synchronize was not taking into account deletion by removing meta.json.
* Prepare for future implementation of https://thanos.io/proposals/201901-read-write-operations-bucket.md/
* Better observability for syncronize process.
* More logs for store startup process.

TODO in separate PR:
* More observability for index-cache loading / adding time.

Signed-off-by: Bartlomiej Plotka <[email protected]>
bwplotka added a commit that referenced this issue Jan 3, 2020
Fixes: #1874

* Corrupted disk cache for meta.json is handled gracefully.
* Synchronize was not taking into account deletion by removing meta.json.
* Prepare for future implementation of https://thanos.io/proposals/201901-read-write-operations-bucket.md/
* Better observability for syncronize process.
* More logs for store startup process.

TODO in separate PR:
* More observability for index-cache loading / adding time.

Signed-off-by: Bartlomiej Plotka <[email protected]>
bwplotka added a commit that referenced this issue Jan 3, 2020
Fixes: #1874

* Corrupted disk cache for meta.json is handled gracefully.
* Synchronize was not taking into account deletion by removing meta.json.
* Prepare for future implementation of https://thanos.io/proposals/201901-read-write-operations-bucket.md/
* Better observability for syncronize process.
* More logs for store startup process.

TODO in separate PR:
* More observability for index-cache loading / adding time.

Signed-off-by: Bartlomiej Plotka <[email protected]>
bwplotka added a commit that referenced this issue Jan 3, 2020
Fixes: #1874

* Corrupted disk cache for meta.json is handled gracefully.
* Synchronize was not taking into account deletion by removing meta.json.
* Prepare for future implementation of https://thanos.io/proposals/201901-read-write-operations-bucket.md/
* Better observability for syncronize process.
* More logs for store startup process.

TODO in separate PR:
* More observability for index-cache loading / adding time.

Signed-off-by: Bartlomiej Plotka <[email protected]>
bwplotka added a commit that referenced this issue Jan 3, 2020
Fixes: #1874

* Corrupted disk cache for meta.json is handled gracefully.
* Synchronize was not taking into account deletion by removing meta.json.
* Prepare for future implementation of https://thanos.io/proposals/201901-read-write-operations-bucket.md/
* Better observability for syncronize process.
* More logs for store startup process.

TODO in separate PR:
* More observability for index-cache loading / adding time.

Signed-off-by: Bartlomiej Plotka <[email protected]>
bwplotka added a commit that referenced this issue Jan 6, 2020
Fixes: #1874

* Corrupted disk cache for meta.json is handled gracefully.
* Synchronize was not taking into account deletion by removing meta.json.
* Prepare for future implementation of https://thanos.io/proposals/201901-read-write-operations-bucket.md/
* Better observability for syncronize process.
* More logs for store startup process.

TODO in separate PR:
* More observability for index-cache loading / adding time.

Signed-off-by: Bartlomiej Plotka <[email protected]>
bwplotka added a commit that referenced this issue Jan 6, 2020
Fixes: #1874

* Corrupted disk cache for meta.json is handled gracefully.
* Synchronize was not taking into account deletion by removing meta.json.
* Prepare for future implementation of https://thanos.io/proposals/201901-read-write-operations-bucket.md/
* Better observability for syncronize process.
* More logs for store startup process.

TODO in separate PR:
* More observability for index-cache loading / adding time.

Signed-off-by: Bartlomiej Plotka <[email protected]>
bwplotka added a commit that referenced this issue Jan 6, 2020
Fixes: #1874

* Corrupted disk cache for meta.json is handled gracefully.
* Synchronize was not taking into account deletion by removing meta.json.
* Prepare for future implementation of https://thanos.io/proposals/201901-read-write-operations-bucket.md/
* Better observability for syncronize process.
* More logs for store startup process.

TODO in separate PR:
* More observability for index-cache loading / adding time.

Signed-off-by: Bartlomiej Plotka <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants