Skip to content

Commit 8cbf559

Browse files
committed
feat(operator): Add alert for discarded samples (grafana#13512)
1 parent 7307346 commit 8cbf559

File tree

4 files changed

+69
-1
lines changed

4 files changed

+69
-1
lines changed

operator/CHANGELOG.md

+1
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,7 @@
22

33
## Release 5.9.5
44

5+
- [13512](https://github.com/grafana/loki/pull/13512) **xperimental**: feat(operator): Add alert for discarded samples
56
- [13497](https://github.com/grafana/loki/pull/13497) **xperimental**: fix(operator): Remove duplicate conditions from status
67

78
## Release 5.9.4

operator/docs/lokistack/sop.md

+33-1
Original file line numberDiff line numberDiff line change
@@ -309,6 +309,38 @@ The query queue is currently under high load.
309309

310310
- Increase the number of queriers
311311

312+
## Loki Discarded Samples Warning
313+
314+
### Impact
315+
316+
Loki is discarding samples (log entries) because they fail validation. This alert only fires for errors that are not retryable. This means that the discarded samples are lost.
317+
318+
### Summary
319+
320+
Loki can reject log entries (samples) during submission when they fail validation. This happens on a per-stream basis, so only the specific samples or streams failing validation are lost.
321+
322+
The possible validation errors are documented in the [Loki documentation](https://grafana.com/docs/loki/latest/operations/request-validation-rate-limits/#validation-errors). This alert only fires for the validation errors that are not retryable, which means that discarded samples are permanently lost.
323+
324+
The alerting can only show the affected Loki tenant. Since Loki 3.1.0 more detailed information about the affected streams is provided in an error message emitted by the distributor component.
325+
326+
This information can be used to pinpoint the application sending the offending logs. For some of the validations there are configuration parameters that can be tuned in LokiStack's `limits` structure, if the messages should be accepted. Usually it is recommended to fix the issue either on the emitting application (if possible) or by changing collector configuration to fix non-compliant messages before sending them to Loki.
327+
328+
### Severity
329+
330+
`Warning`
331+
332+
### Access Required
333+
334+
- Console access to the cluster
335+
- View access in the namespace where the LokiStack is deployed
336+
- OpenShift
337+
- `openshift-logging` (LokiStack)
338+
339+
### Steps
340+
341+
- View detailed log output from the Loki distributors to identify affected streams
342+
- Decide on further steps depending on log source and validation error
343+
312344
## Lokistack Storage Schema Warning
313345

314346
### Impact
@@ -332,4 +364,4 @@ The schema configuration does not contain the most recent schema version and nee
332364

333365
### Steps
334366

335-
- Add a new object storage schema V13 with a future EffectiveDate
367+
- Add a new object storage schema V13 with a future EffectiveDate

operator/internal/manifests/internal/alerts/prometheus-alerts.yaml

+18
Original file line numberDiff line numberDiff line change
@@ -175,6 +175,24 @@ groups:
175175
for: 15m
176176
labels:
177177
severity: warning
178+
- alert: LokiDiscardedSamplesWarning
179+
annotations:
180+
message: |-
181+
Loki in namespace {{ $labels.namespace }} is discarding samples in the "{{ $labels.tenant }}" tenant during ingestion.
182+
Samples are discarded because of "{{ $labels.reason }}" at a rate of {{ .Value | humanize }} samples per second.
183+
summary: Loki is discarding samples during ingestion because they fail validation.
184+
runbook_url: "[[ .RunbookURL]]#Loki-Discarded-Samples-Warning"
185+
expr: |
186+
sum by(namespace, tenant, reason) (
187+
irate(loki_discarded_samples_total{
188+
reason!="rate_limited",
189+
reason!="per_stream_rate_limit",
190+
reason!="stream_limit"}[2m])
191+
)
192+
> 0
193+
for: 15m
194+
labels:
195+
severity: warning
178196
- alert: LokistackSchemaUpgradesRequired
179197
annotations:
180198
message: |-

operator/internal/manifests/internal/alerts/testdata/test.yaml

+17
Original file line numberDiff line numberDiff line change
@@ -63,6 +63,9 @@ tests:
6363
- series: 'loki_logql_querystats_latency_seconds_bucket{namespace="my-ns", job="querier", route="my-route", le="+Inf"}'
6464
values: '0+100x20'
6565

66+
- series: 'loki_discarded_samples_total{namespace="my-ns", tenant="application", reason="line_too_long"}'
67+
values: '0x5 0+120x25 3000'
68+
6669
alert_rule_test:
6770
- eval_time: 16m
6871
alertname: LokiRequestErrors
@@ -177,3 +180,17 @@ tests:
177180
summary: "The read path has high volume of queries, causing longer response times."
178181
message: "The read path is experiencing high load."
179182
runbook_url: "[[ .RunbookURL ]]#Loki-Read-Path-High-Load"
183+
- eval_time: 22m
184+
alertname: LokiDiscardedSamplesWarning
185+
exp_alerts:
186+
- exp_labels:
187+
namespace: my-ns
188+
tenant: application
189+
severity: warning
190+
reason: line_too_long
191+
exp_annotations:
192+
message: |-
193+
Loki in namespace my-ns is discarding samples in the "application" tenant during ingestion.
194+
Samples are discarded because of "line_too_long" at a rate of 2 samples per second.
195+
summary: Loki is discarding samples during ingestion because they fail validation.
196+
runbook_url: "[[ .RunbookURL]]#Loki-Discarded-Samples-Warning"

0 commit comments

Comments
 (0)