Skip to content

Commit 4547f1d

Browse files
committed
feat(operator): Add alert for discarded samples
1 parent bd20171 commit 4547f1d

File tree

2 files changed

+51
-1
lines changed

2 files changed

+51
-1
lines changed

operator/docs/lokistack/sop.md

+33-1
Original file line numberDiff line numberDiff line change
@@ -309,6 +309,38 @@ The query queue is currently under high load.
309309

310310
- Increase the number of queriers
311311

312+
## Loki Discarded Samples Warning
313+
314+
### Impact
315+
316+
Loki is discarding samples (log entries) because they fail validation. This alert only fires for errors that are not retryable. This means that the discarded samples are lost.
317+
318+
### Summary
319+
320+
Loki can reject log entries (samples) during submission when they fail validation. This happens on a per-stream basis, so only the specific samples or streams failing validation are lost.
321+
322+
The possible validation errors are documented in the [Loki documentation](https://grafana.com/docs/loki/latest/operations/request-validation-rate-limits/#validation-errors). This alert only fires for the validation errors that are not retryable, which means that discarded samples are permanently lost.
323+
324+
The alerting can only show the affected Loki tenant. Since Loki 3.1.0 more detailed information about the affected streams is provided in an error message emitted by the distributor component.
325+
326+
This information can be used to pinpoint the application sending the offending logs. For some of the validations there are configuration parameters that can be tuned in LokiStack's `limits` structure, if the messages should be accepted. Usually it is recommended to fix the issue either on the emitting application (if possible) or by changing collector configuration to fix non-compliant messages before sending them to Loki.
327+
328+
### Severity
329+
330+
`Warning`
331+
332+
### Access Required
333+
334+
- Console access to the cluster
335+
- View access in the namespace where the LokiStack is deployed
336+
- OpenShift
337+
- `openshift-logging` (LokiStack)
338+
339+
### Steps
340+
341+
- View detailed log output from the Loki distributors to identify affected streams
342+
- Decide on further steps depending on log source and validation error
343+
312344
## Lokistack Storage Schema Warning
313345

314346
### Impact
@@ -332,4 +364,4 @@ The schema configuration does not contain the most recent schema version and nee
332364

333365
### Steps
334366

335-
- Add a new object storage schema V13 with a future EffectiveDate
367+
- Add a new object storage schema V13 with a future EffectiveDate

operator/internal/manifests/internal/alerts/prometheus-alerts.yaml

+18
Original file line numberDiff line numberDiff line change
@@ -175,6 +175,24 @@ groups:
175175
for: 15m
176176
labels:
177177
severity: warning
178+
- alert: LokiDiscardedSamplesWarning
179+
annotations:
180+
message: |-
181+
Loki in namespace {{ $labels.namespace }} is discarding samples in the "{{ $labels.tenant }}" tenant during ingestion.
182+
Samples are discarded because of "{{ $labels.reason }}" at a rate of {{ .Value | humanize }} samples per second.
183+
summary: Loki is discarding samples during ingestion because they fail validation.
184+
runbook_url: "[[ .RunbookURL]]#Loki-Discarded-Samples-Warning"
185+
expr: |
186+
sum by(namespace, tenant, reason) (
187+
irate(loki_discarded_samples_total{
188+
reason!="rate_limited",
189+
reason!="per_stream_rate_limit",
190+
reason!="stream_limit"}[2m])
191+
)
192+
> 0
193+
for: 15m
194+
labels:
195+
severity: warning
178196
- alert: LokistackSchemaUpgradesRequired
179197
annotations:
180198
message: |-

0 commit comments

Comments
 (0)