-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Thanos Store Panic #6934
Comments
@nicolastakashi Can you share the actual panic stacktrace? I think I saw something similar last week |
We have:
and the crash happens here few lines later
|
Is the pool returning too small size slices? |
I rolled back to 0.31 and seems it's not happening on this version |
This is also happening to us: The problem only occurs on queries which are a little bit older and only rely on the store s3 data.
I can also provide traces if that would be helpful. |
how is storage gateway configured? |
@MichaHoffmann This is mine.
|
This is our config:
|
We're also seeing this issue on one of our storagegateways sets, out of 6 Prometheus: v2.49.1
|
Still seeing this issue on 0.33.0 |
So far we fixed it ourself in pkg.store, line 3532: chunkLen = r.block.estimatedMaxChunkSize
if i+1 < len(pIdxs) {
if diff = pIdxs[i+1].offset - pIdx.offset; int(diff) < chunkLen {
chunkLen = int(diff)
}
}
// Fix: If we are about to read a chunk that is bigger than the buffer capacity,
// we need to make sure we have enough space in the buffer.
if cap(buf) < chunkLen {
// Put the current buffer back to the pool.
r.block.chunkPool.Put(&buf)
// Get a new, bigger, buffer from the pool.
bufPooled, err = r.block.chunkPool.Get(chunkLen)
if err == nil {
buf = *bufPooled
} else {
buf = make([]byte, chunkLen)
}
}
// Fix: end
cb := buf[:chunkLen]
n, err = io.ReadFull(bufReader, cb) From what I was able to debug, source of the problem is chunk size defined in the block meta json. For some reason it states smaller size than reality is, requested buffer is too small and store panics. This will not fix source of this issue but at least fix store code which doesn't correctly check inputs. |
We're also seeing this issue on Thanos: v0.35.1 . |
@martinfrycs1 Thanks, I think your fix makes sense to me. Would you like to create a PR? |
ok, it make sense to implement it then in this way. I didn't have much time to explore source of these data more closely. I will prepare the PR with this fix. |
I think this should be fixed in Line 88 in 6737c8d
Pool.Get should not return a buffer smaller than the requested size. I haven't yet figured out the source that Put the wrong buffer size.
|
Its not problem of the pool not working correctly, but thanos code, asking for small buffer. This whole code is in the for loop. Previously buffers with fixed size 16000 were requested, around version .31? code was changed to ask for r.block.estimatedMaxChunkSize instead which varies and sometimes contain incorrect value. Code I provided above just make sure, that it doesn't blindly trust input value, but instead check if big enough buffer was actually requested before so "buffer overflow" don't happen. If not, buffer is returned to the pool and bigger one is requested. We are running patched versions of Thanos with this change and we did not hit this panic again. |
But if the pool has putted buffer that cap does not equal bucket size, the returned buffer cap could still be less than the requested。 |
From what I understand from the pool code, its able to hold variable sized buffers. In this panic ( Fix I posted above is just checking if already obtained buffer is big enough and if not, its switched for the bigger one. |
Maybe I'm wrong. |
@dominicqi Thanks, this seems a legit bug and we need to fix it. But the issue found by @martinfrycs1 is also valid and we need to fix the problem of actual chunk size larger than the asked buffer size. They are separate issues I think and we can fix them separately. |
The requested size check is there; how it is returning a smaller slice is strange to me.
There must be another thread returning the slice with a smaller size to the pool. I feel p.buckets[i].Put(b) should be in lock.
|
Looks like im also facing this issue, i have 3 stores with same config for HA, version 0.34.1
|
Thanos, Prometheus and Golang version used:
Thanos: 0.32.5
Prometheus: 2.48
Object Storage Provider: AWS S3
What happened:
Using Thanos Store with Time Filter from time to time I'm it's crashing with the following message.
What you expected to happen:
Not crash
How to reproduce it (as minimally and precisely as possible):
Actually!
I have no idea, this is only happening in one shard.
Full logs to relevant components:
Anything else we need to know:
The text was updated successfully, but these errors were encountered: