-
Notifications
You must be signed in to change notification settings - Fork 561
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
panic: runtime error: index out of range in the ingester with native histograms #5576
Comments
Can confirm that this also happens with mimir v2.9.0 and otel collector v0.82.0. Pushing downsampled native histograms to mimir eventually triggers this error, which causes the ingester to crash. It will then continue to crash on startup until the WAL is wiped. Every time I've tried upgrading to collector v0.82.0 to make use of the downscaling, this issue appears within 24hrs. We're producing native histograms via OTEL client libs (js/node), and the spanmetrics connector in the otel collector. |
Add experimental support for testing otel collector together with Mimir. For reproducing #5576 Signed-off-by: György Krajcsovits <[email protected]>
Hi, I've started to work on reproduction, is a very basic otel collector configuration such as the one in https://github.com/grafana/mimir/pull/5968/files sufficient? (I'll keep working at it, but don't want to go down the wrong path.) |
* dev docker compose: add otel collector with prom remote write Add experimental support for testing otel collector together with Mimir. For reproducing #5576 Signed-off-by: György Krajcsovits <[email protected]>
@aptomaKetil Hi, do you still have this problem? Is there any chance you could share a corrupt WAL I could analyze? I've got a simple setup going with random observations generator 0m - 5m observing into a scale 20 exponential histogram -> otel collector -> remote write -> mimir 2.9. But so far nothing. |
Hi @krajorama I'll fire up a dev-cluster on 2.9 and see if I can still reproduce. |
I can still reproduce with 2.9 and collector 0.84. I've kept the broken WAL (~135MB zipped), and can share it in private if you can suggest a sensible way of doing so. Pasting my collector config, I've removed all trace-stuff for brevity: exporters:
prometheusremotewrite:
auth:
authenticator: basicauth/mimir
endpoint: <snip>
target_info:
enabled: false
extensions:
basicauth/mimir:
client_auth:
password: ${mimir_password}
username: ${mimir_username}
health_check: {}
memory_ballast:
size_in_percentage: 40
processors:
batch: {}
filter:
error_mode: ignore
metrics:
metric:
- Len(data_points) == 0
k8sattributes:
extract:
labels:
- from: pod
key: app.kubernetes.io/name
tag_name: __app
- from: pod
key: app
tag_name: __app
- from: pod
key: k8s-app
tag_name: __app
- from: pod
key: app.kubernetes.io/version
tag_name: __version
metadata:
- k8s.namespace.name
- k8s.pod.name
- k8s.node.name
pod_association:
- sources:
- from: connection
memory_limiter:
check_interval: 5s
limit_percentage: 80
spike_limit_percentage: 25
resource:
attributes:
- action: upsert
from_attribute: __app
key: service.name
- action: upsert
from_attribute: __version
key: service.version
- action: insert
from_attribute: k8s.pod.name
key: service.instance.id
- action: insert
from_attribute: k8s.namespace.name
key: service.namespace
- action: insert
from_attribute: k8s.node.name
key: host.name
transform:
error_mode: ignore
metric_statements:
- context: datapoint
statements:
- set(attributes["cluster"], resource.attributes["deployment.environment"])
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
service:
extensions:
- health_check
- basicauth/mimir
pipelines:
metrics:
exporters:
- prometheusremotewrite
processors:
- memory_limiter
- filter
- k8sattributes
- resource
- transform
- batch
receivers:
- otlp
|
hi, if you have a google account I can share a drive folder you could upload to (or the other way around). Please drop me an email |
I've just built Prometheus from main and tried to load your WAL:
As suspected the bug is in Prometheus, not Mimir, I'm opening an issue in Prometheus to link this to. |
Opened PR with unit test that triggers this case: prometheus/prometheus#12838 |
@aptomaKetil I have a potential fix in Prometheus prometheus/prometheus#12838 . With the fix I could load the WAL into Prometheus. I'll have to create a Mimir build with the fix next for you to test.
|
@aptomaKetil I've pushed a test image under this name: grafana/mimir:krajo-test-prom-12838-on-2.9-cb50fed07 It is the same as 2.9, but with my proposed fix back ported. From branch/PR: #6016 |
@krajorama Thanks! I've rolled this out in our dev environment. Unfortunately it can take a while for the issue to show up (nearly 24hrs last time), but I'll keep you updated. |
No issues to report so far, I've rolled this out for our prod metrics as well now. |
Thank you. I think we should wait a couple of days. Which means this will probably not make it to Mimir 2.10. I'll negotiate a patch release if we can get more confirmation plus merge into Prometheus upstream. cc release shepherds @colega @pstibrany |
Still no issues to report. |
This brings in prometheus/prometheus#12838 Fixes #5576 Signed-off-by: György Krajcsovits <[email protected]>
So far 2.10.1 is not on the cards, unless we'll have a security vulnerability that triggers it |
No problem, I can stay on the test-build until the next release. Thanks for looking into this! |
This brings in prometheus/prometheus#12838 Fixes #5576 Signed-off-by: György Krajcsovits <[email protected]>
This brings in prometheus/prometheus#12838 Fixes #5576 Signed-off-by: György Krajcsovits <[email protected]>
We did have a security fix after all doing 2.10.1 release which will include this: https://github.com/grafana/mimir/releases/tag/mimir-2.10.1 |
Describe the bug
Ingesting native histograms can cause a panic in
counterResetInAnyBucket
which then corrupts the WAL.To Reproduce
Steps to reproduce the behavior:
prometheusremotewrite
exporter as native histograms.Expected behavior
No panics.
Environment
Additional Context
I decided to kick the tires on the histogram downsampling (from open-telemetry/opentelemetry-collector-contrib#24026), as we'd like to start using these from the OTEL libs. Reconfiguring the collector to write downsampled native histograms immediately led to the below error in the mimir ingesters. After restarting I then got the same error during WAL replay. Ended up having to wipe the WALs completely. I see there has been a previous PR for
counterResetInAnyBucket
which looks very similar in prometheus/prometheus#12357.Logs:
The text was updated successfully, but these errors were encountered: