Thanos Store Panic #6934

nicolastakashi · 2023-11-27T16:57:15Z

Thanos, Prometheus and Golang version used:

Thanos: 0.32.5
Prometheus: 2.48

Object Storage Provider: AWS S3

What happened:

Using Thanos Store with Time Filter from time to time I'm it's crashing with the following message.

panic: runtime error: slice bounds out of range [:16000] with capacity 1282

What you expected to happen:

Not crash

How to reproduce it (as minimally and precisely as possible):

Actually!
I have no idea, this is only happening in one shard.

Full logs to relevant components:

panic: runtime error: slice bounds out of range [:16000] with capacity 1290goroutine 507203 [running]:github.com/thanos-io/thanos/pkg/store.(*bucketChunkReader).loadChunks(0x400261e720, {0x26b43f8, 0x407d9c7540}, {0x406284b000, 0x10c, 0x400195e870?}, {0x4102db7748, 0x1, 0x2}, 0x3968440?, ...)	/app/pkg/store/bucket.go:3342 +0xe1cgithub.jparrowsec.cn/thanos-io/thanos/pkg/store.(*bucketChunkReader).load.func4()	/app/pkg/store/bucket.go:3270 +0x74golang.org/x/sync/errgroup.(*Group).Go.func1()	/go/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:75 +0x58created by golang.org/x/sync/errgroup.(*Group).Go in goroutine 505525	/go/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:72 +0x98

Anything else we need to know:

The text was updated successfully, but these errors were encountered:

yeya24 · 2023-11-27T16:59:38Z

@nicolastakashi Can you share the actual panic stacktrace? I think I saw something similar last week

MichaHoffmann · 2023-11-27T17:50:36Z

We have:

	bufPooled, err := r.block.chunkPool.Get(r.block.estimatedMaxChunkSize)
	if err == nil {
		buf = *bufPooled
	} else {
		buf = make([]byte, r.block.estimatedMaxChunkSize)
	}

and the crash happens here few lines later

		chunkLen = r.block.estimatedMaxChunkSize
		if i+1 < len(pIdxs) {
			if diff = pIdxs[i+1].offset - pIdx.offset; int(diff) < chunkLen {
				chunkLen = int(diff)
			}
		}
->	        cb := buf[:chunkLen]

MichaHoffmann · 2023-11-27T18:02:04Z

Is the pool returning too small size slices?

nicolastakashi · 2023-11-28T11:52:14Z

I rolled back to 0.31 and seems it's not happening on this version

jon-rei · 2023-11-28T13:53:06Z

This is also happening to us:
Thanos: v0.32.5 (deployed through bitnami helm chart)
Prometheus: v2.45.1

The problem only occurs on queries which are a little bit older and only rely on the store s3 data.

panic: runtime error: slice bounds out of range [:16000] with capacity 2422
goroutine 2641055 [running]:
github.com/thanos-io/thanos/pkg/store.(*bucketChunkReader).loadChunks(0xc24d3caae0, {0x2d0fc20, 0xc338675680}, {0xc362b24000, 0x9, 0x0?}, {0xc26a0f8f80, 0x1, 0x2}, 0x1, ...)
/bitnami/blacksmith-sandox/thanos-0.32.5/src/github.com/thanos-io/thanos/pkg/store/bucket.go:3342 +0x11e5
github.com/thanos-io/thanos/pkg/store.(*bucketChunkReader).load.func4()
/bitnami/blacksmith-sandox/thanos-0.32.5/src/github.com/thanos-io/thanos/pkg/store/bucket.go:3270 +0xff
golang.org/x/sync/errgroup.(*Group).Go.func1()
/bitnami/blacksmith-sandox/thanos-0.32.5/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:75 +0x64
created by golang.org/x/sync/errgroup.(*Group).Go
/bitnami/blacksmith-sandox/thanos-0.32.5/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:72 +0xa5

I can also provide traces if that would be helpful.

MichaHoffmann · 2023-11-28T14:25:05Z

how is storage gateway configured?

nicolastakashi · 2023-11-28T14:35:30Z

@MichaHoffmann This is mine.

- store
- '--log.level=warn'
- '--log.format=json'
- '--grpc-address=0.0.0.0:10901'
- '--http-address=0.0.0.0:10902'
- '--data-dir=/data'
- '--objstore.config-file=/conf/objstore.yml'
- '--index-cache.config-file=/conf/index-cache.yml'
- '--max-time=-48h'
- '--min-time=-720h'
- '--grpc-grace-period=5s'
- '--store.enable-index-header-lazy-reader'

jon-rei · 2023-11-28T14:39:04Z

This is our config:

- store
- '--log.level=info'
- '--log.format=logfmt'
- '--grpc-address=0.0.0.0:10901'
- '--http-address=0.0.0.0:10902'
- '--data-dir=/data'
- '--objstore.config-file=/conf/objstore.yml'
- |
  --tracing.config=type: OTLP
  config:
    client_type: grpc
    service_name: "thanos-storegateway"
    endpoint: 127.0.0.1:4317
    insecure: true
    compression: gzip

pauloconnor · 2024-03-07T12:10:20Z

We're also seeing this issue on one of our storagegateways sets, out of 6

Prometheus: v2.49.1
Thanos sidecar: v0.28.1
Thanos: v0.34.0

panic: runtime error: slice bounds out of range [:1027] with capacity 1024

goroutine 303830 [running]:
github.com/thanos-io/thanos/pkg/store.(*bucketChunkReader).loadChunks(0xc06bb0d980, {0x35ee848, 0xc1421085a0}, {0xc1d4e00000, 0x1849, 0x100000001?}, {0xc07006e2f8, 0x1, 0x2}, 0x2, ...)
	/bitnami/blacksmith-sandox/thanos-0.34.0/src/github.com/thanos-io/thanos/pkg/store/bucket.go:3532 +0x1118
github.com/thanos-io/thanos/pkg/store.(*bucketChunkReader).load.func4()
	/bitnami/blacksmith-sandox/thanos-0.34.0/src/github.com/thanos-io/thanos/pkg/store/bucket.go:3466 +0x11b
golang.org/x/sync/errgroup.(*Group).Go.func1()
	/bitnami/blacksmith-sandox/thanos-0.34.0/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:75 +0x56
created by golang.org/x/sync/errgroup.(*Group).Go in goroutine 295848
	/bitnami/blacksmith-sandox/thanos-0.34.0/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:72 +0x96

- name: storegateway
          image: docker.io/bitnami/thanos:0.34.0-debian-11-r0
          args:
            - store
            - '--log.level=info'
            - '--log.format=logfmt'
            - '--grpc-address=0.0.0.0:10901'
            - '--http-address=0.0.0.0:10902'
            - '--data-dir=/data'
            - '--objstore.config-file=/conf/objstore.yml'

nicolastakashi · 2024-03-11T10:13:19Z

Still seeing this issue on 0.33.0
panic: runtime error: slice bounds out of range [:16000] with capacity 1249goroutine 15326510 [running]:github.com/thanos-io/thanos/pkg/store.(*bucketChunkReader).loadChunks(0x40032faae0, {0x26c6c28, 0x40248fcf00}, {0x4023fe8000, 0x94, 0x2bf1bc?}, {0x41ced43ba8, 0x1, 0x2}, 0x3997b40?, ...) /app/pkg/store/bucket.go:3504 +0xd18github.jparrowsec.cn/thanos-io/thanos/pkg/store.(*bucketChunkReader).load.func4() /app/pkg/store/bucket.go:3438 +0x84golang.org/x/sync/errgroup.(*Group).Go.func1() /go/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:75 +0x58created by golang.org/x/sync/errgroup.(*Group).Go in goroutine 15326102 /go/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:72 +0x98

martinfrycs1 · 2024-03-11T11:02:09Z

So far we fixed it ourself in pkg.store, line 3532:

		chunkLen = r.block.estimatedMaxChunkSize
		if i+1 < len(pIdxs) {
			if diff = pIdxs[i+1].offset - pIdx.offset; int(diff) < chunkLen {
				chunkLen = int(diff)
			}
		}

		// Fix: If we are about to read a chunk that is bigger than the buffer capacity,
		// we need to make sure we have enough space in the buffer.
		if cap(buf) < chunkLen {
			// Put the current buffer back to the pool.
			r.block.chunkPool.Put(&buf)

			// Get a new, bigger, buffer from the pool.
			bufPooled, err = r.block.chunkPool.Get(chunkLen)
			if err == nil {
				buf = *bufPooled
			} else {
				buf = make([]byte, chunkLen)
			}
		}
		// Fix: end

		cb := buf[:chunkLen]
		n, err = io.ReadFull(bufReader, cb)

From what I was able to debug, source of the problem is chunk size defined in the block meta json. For some reason it states smaller size than reality is, requested buffer is too small and store panics.
I guess source of this will be compactor or receiver. We have also "historic stores" serving data created by Thanos v30 and older, there are no such panics, so I assume its combination of the store optimization together with something in receiver or compactor code not filing these values correctly in latest versions.

This will not fix source of this issue but at least fix store code which doesn't correctly check inputs.

dominicqi · 2024-08-16T12:20:40Z

We're also seeing this issue on Thanos: v0.35.1 .

yeya24 · 2024-08-18T20:29:06Z

@martinfrycs1 Thanks, I think your fix makes sense to me. Would you like to create a PR?
The estimatedMaxChunkSize set in block metadata could indeed be smaller than the actual chunk length. The estimated chunk size has quite some constraints and skips some cases. https://github.com/thanos-io/thanos/blob/main/pkg/block/index.go#L322

martinfrycs1 · 2024-08-20T07:23:34Z

@martinfrycs1

ok, it make sense to implement it then in this way. I didn't have much time to explore source of these data more closely. I will prepare the PR with this fix.

dominicqi · 2024-08-21T03:54:23Z

I think this should be fixed in

thanos/pkg/pool/pool.go

Line 88 in 6737c8d

func (p *BucketedBytes) Get(sz int) (*[]byte, error) {

Pool.Get should not return a buffer smaller than the requested size. I haven't yet figured out the source that Put the wrong buffer size.

// Get returns a new byte slice that fits the given size.
func (p *BucketedBytes) Get(sz int) (*[]byte, error) {
	p.mtx.Lock()
	defer p.mtx.Unlock()

	if p.maxTotal > 0 && p.usedTotal+uint64(sz) > p.maxTotal {
		return nil, ErrPoolExhausted
	}

	for i, bktSize := range p.sizes {
		if sz > bktSize {
			continue
		}
		b, ok := p.buckets[i].Get().(*[]byte)
		if !ok || /*check buffer cap from pool*/ cap(*b) < sz {
			b = p.new(bktSize)
		}

		p.usedTotal += uint64(cap(*b))
		return b, nil
	}

	// The requested size exceeds that of our highest bucket, allocate it directly.
	p.usedTotal += uint64(sz)
	return p.new(sz), nil
}

martinfrycs1 · 2024-08-21T08:24:15Z

Its not problem of the pool not working correctly, but thanos code, asking for small buffer.

This whole code is in the for loop. Previously buffers with fixed size 16000 were requested, around version .31? code was changed to ask for r.block.estimatedMaxChunkSize instead which varies and sometimes contain incorrect value.

Code I provided above just make sure, that it doesn't blindly trust input value, but instead check if big enough buffer was actually requested before so "buffer overflow" don't happen. If not, buffer is returned to the pool and bigger one is requested.

We are running patched versions of Thanos with this change and we did not hit this panic again.

dominicqi · 2024-08-21T12:06:15Z

thanos/pkg/pool/pool.go

But if the pool has putted buffer that cap does not equal bucket size, the returned buffer cap could still be less than the requested。

martinfrycs1 · 2024-08-21T12:41:11Z

From what I understand from the pool code, its able to hold variable sized buffers.

In this panic (panic: runtime error: slice bounds out of range [:1027] with capacity 1024),
before the for loop it asks for the buffer with 1024 capacity, but later inside the for loop it try to copy 1027 bytes long data into it.

Fix I posted above is just checking if already obtained buffer is big enough and if not, its switched for the bigger one.

dominicqi · 2024-08-22T03:31:55Z

From what I understand from the pool code, its able to hold variable sized buffers.

In this panic (panic: runtime error: slice bounds out of range [:1027] with capacity 1024), before the for loop it asks for the buffer with 1024 capacity, but later inside the for loop it try to copy 1027 bytes long data into it.

Fix I posted above is just checking if already obtained buffer is big enough and if not, its switched for the bigger one.

Maybe I'm wrong.
But in the fix, check then get from the pool again, and could also get a smaller buffer than the ask size, if the chunk pool still has smaller buffers.

yeya24 · 2024-08-27T08:11:13Z

@dominicqi Thanks, this seems a legit bug and we need to fix it.

But the issue found by @martinfrycs1 is also valid and we need to fix the problem of actual chunk size larger than the asked buffer size.

They are separate issues I think and we can fix them separately.

skhilar · 2024-09-23T01:54:46Z

The requested size check is there; how it is returning a smaller slice is strange to me.

func (p *BucketedBytes) Get(sz int) (*[]byte, error) {
	p.mtx.Lock()
	defer p.mtx.Unlock()

	if p.maxTotal > 0 && p.usedTotal+uint64(sz) > p.maxTotal {
		return nil, ErrPoolExhausted
	}

	for i, bktSize := range p.sizes {
		if sz > bktSize {
			continue                                  **<=== it is making sure, it is getting slice with requested size**
		}
		b, ok := p.buckets[i].Get().(*[]byte)
		if !ok {
			b = p.new(bktSize)
		}

		p.usedTotal += uint64(cap(*b))
		return b, nil
	}

	// The requested size exceeds that of our highest bucket, allocate it directly.
	p.usedTotal += uint64(sz)
	return p.new(sz), nil
}

There must be another thread returning the slice with a smaller size to the pool. I feel p.buckets[i].Put(b) should be in lock.

func (p *BucketedBytes) Put(b *[]byte) {
	if b == nil {
		return
	}

	sz := cap(*b)
	**for i, bktSize := range p.sizes {
		if sz > bktSize {
			continue
		}
		*b = (*b)[:0]
		p.buckets[i].Put(b)                                             **<== return to pool code is not in lock**
		break
	}**

	p.mtx.Lock()
	defer p.mtx.Unlock()
	// We could assume here that our users will not make the slices larger
	// but lets be on the safe side to avoid an underflow of p.usedTotal.
	if uint64(sz) >= p.usedTotal {
		p.usedTotal = 0
	} else {
		p.usedTotal -= uint64(sz)
	}
}

Kot-o-pes · 2024-11-22T09:11:49Z

Looks like im also facing this issue, i have 3 stores with same config for HA, version 0.34.1

Nov 22 12:01:24  systemd[1]: thanos-store.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Nov 22 12:01:11  thanos-store[3314756]:    /app/pkg/store/proxy_heap.go:676 +0x437
Nov 22 12:01:11  thanos-store[3314756]: created by github.com/thanos-io/thanos/pkg/store.newEagerRespSet in goroutine 807405024
Nov 22 12:01:11  thanos-store[3314756]:    /app/pkg/store/proxy_heap.go:752 +0x257
Nov 22 12:01:11  thanos-store[3314756]: github.com/thanos-io/thanos/pkg/store.newEagerRespSet.func1(0xc3b37622c0)
Nov 22 12:01:11  thanos-store[3314756]:    /app/pkg/store/proxy_heap.go:738 +0x885
Nov 22 12:01:11  thanos-store[3314756]: github.com/thanos-io/thanos/pkg/store.newEagerRespSet.func1.3(0xc0ad255450)
Nov 22 12:01:11  thanos-store[3314756]:    /app/pkg/store/storepb/custom.go:502 +0x227
Nov 22 12:01:11  thanos-store[3314756]: github.com/thanos-io/thanos/pkg/store/storepb.(*SeriesStatsCounter).Count(0xcb1a52cec0, 0xd8f442cf30)
Nov 22 12:01:11  thanos-store[3314756]:    /app/pkg/store/storepb/custom.go:476
Nov 22 12:01:11  thanos-store[3314756]: github.com/thanos-io/thanos/pkg/store/storepb.(*Chunk).XORNumSamples(...)
Nov 22 12:01:11  thanos-store[3314756]:    /usr/local/go/src/encoding/binary/binary.go:139
Nov 22 12:01:11  thanos-store[3314756]: encoding/binary.bigEndian.Uint16(...)
Nov 22 12:01:11  thanos-store[3314756]: goroutine 807405307 [running]:
Nov 22 12:01:11  thanos-store[3314756]: panic: runtime error: index out of range [1] with length 0

nicolastakashi changed the title ~~Thanos Store Panicing~~ Thanos Store Panic Nov 27, 2023

dominicqi mentioned this issue Aug 22, 2024

Store: Fix panic too smaller buffer #7658

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Thanos Store Panic #6934

Thanos Store Panic #6934

nicolastakashi commented Nov 27, 2023 •

edited

Loading

yeya24 commented Nov 27, 2023

MichaHoffmann commented Nov 27, 2023 •

edited

Loading

MichaHoffmann commented Nov 27, 2023

nicolastakashi commented Nov 28, 2023

jon-rei commented Nov 28, 2023 •

edited

Loading

MichaHoffmann commented Nov 28, 2023

nicolastakashi commented Nov 28, 2023

jon-rei commented Nov 28, 2023

pauloconnor commented Mar 7, 2024 •

edited

Loading

nicolastakashi commented Mar 11, 2024

martinfrycs1 commented Mar 11, 2024 •

edited

Loading

dominicqi commented Aug 16, 2024

yeya24 commented Aug 18, 2024

martinfrycs1 commented Aug 20, 2024

dominicqi commented Aug 21, 2024

martinfrycs1 commented Aug 21, 2024

dominicqi commented Aug 21, 2024

martinfrycs1 commented Aug 21, 2024

dominicqi commented Aug 22, 2024 •

edited

Loading

yeya24 commented Aug 27, 2024

skhilar commented Sep 23, 2024 •

edited

Loading

Kot-o-pes commented Nov 22, 2024

Thanos Store Panic #6934

Thanos Store Panic #6934

Comments

nicolastakashi commented Nov 27, 2023 • edited Loading

yeya24 commented Nov 27, 2023

MichaHoffmann commented Nov 27, 2023 • edited Loading

MichaHoffmann commented Nov 27, 2023

nicolastakashi commented Nov 28, 2023

jon-rei commented Nov 28, 2023 • edited Loading

MichaHoffmann commented Nov 28, 2023

nicolastakashi commented Nov 28, 2023

jon-rei commented Nov 28, 2023

pauloconnor commented Mar 7, 2024 • edited Loading

nicolastakashi commented Mar 11, 2024

martinfrycs1 commented Mar 11, 2024 • edited Loading

dominicqi commented Aug 16, 2024

yeya24 commented Aug 18, 2024

martinfrycs1 commented Aug 20, 2024

dominicqi commented Aug 21, 2024

martinfrycs1 commented Aug 21, 2024

dominicqi commented Aug 21, 2024

martinfrycs1 commented Aug 21, 2024

dominicqi commented Aug 22, 2024 • edited Loading

yeya24 commented Aug 27, 2024

skhilar commented Sep 23, 2024 • edited Loading

Kot-o-pes commented Nov 22, 2024

nicolastakashi commented Nov 27, 2023 •

edited

Loading

MichaHoffmann commented Nov 27, 2023 •

edited

Loading

jon-rei commented Nov 28, 2023 •

edited

Loading

pauloconnor commented Mar 7, 2024 •

edited

Loading

martinfrycs1 commented Mar 11, 2024 •

edited

Loading

dominicqi commented Aug 22, 2024 •

edited

Loading

skhilar commented Sep 23, 2024 •

edited

Loading