Skip to content

Commit

Permalink
control-service: add namespace label to Prometheus alerts (#442)
Browse files Browse the repository at this point in the history
Add the Kubernetes namespace where the data job execution happened
as a label to the alerts generated by the Prometheus rules. This label
was initially dropped from the alerts for simplicity, but it turned out
that there are some use cases where it could be useful.

Testing done: tested the modified rules on a live Prometheus server
where the rules are already deployed and verified that the Kubernetes
namespace appears as a label.

Signed-off-by: Tsvetomir Palashki <[email protected]>
  • Loading branch information
tpalashki authored Oct 25, 2021
1 parent 45ad788 commit a3ae86e
Show file tree
Hide file tree
Showing 2 changed files with 11 additions and 5 deletions.
8 changes: 7 additions & 1 deletion projects/control-service/CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,10 +12,16 @@ MAJOR.MINOR - dd.MM.yyyy
* **Breaking Changes**


1.3 - 25.10.2021
----
* **Improvement**
* Add Kubernetes namespace as label to notification alerts.


1.3 - 21.10.2021
----
* **Bug fixes**
* Clean up metrics when data jobs are deleted
* Clean up metrics when data jobs are deleted.


1.3 - 08.10.2021
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -494,7 +494,7 @@ alerting:
(avg by(data_job) (taurus_datajob_termination_status)
* on(data_job) group_left(email_notified_on_success)
avg by(data_job, email_notified_on_success) (taurus_datajob_info{email_notified_on_success!=""}) == bool 0)
* on(data_job) group_left(job_name)
* on(data_job) group_left(job_name, namespace)
topk by(data_job) (1, label_replace(kube_job_status_completion_time, "data_job", "$1", "job_name", "(.*)-.*")) != 0,
"execution_id", "$1", "job_name", "(.*)")`}}
JobDelay:
Expand Down Expand Up @@ -535,7 +535,7 @@ alerting:
avg by(data_job, email_notified_on_platform_error) (taurus_datajob_info{email_notified_on_platform_error!=""})
<
on(data_job)
group_right(email_notified_on_platform_error)
group_right(email_notified_on_platform_error, namespace)
-(min by(data_job) (taurus_datajob_notification_delay) * 60),
"execution_id", "$1", "data_job", "(.*)")`}}
JobFailurePlatform:
Expand Down Expand Up @@ -572,7 +572,7 @@ alerting:
(max by(data_job) (taurus_datajob_termination_status)
* on(data_job) group_left(email_notified_on_platform_error)
avg by(data_job, email_notified_on_platform_error) (taurus_datajob_info{email_notified_on_platform_error!=""}) == bool 1)
* on(data_job) group_left(job_name)
* on(data_job) group_left(job_name, namespace)
topk by(data_job) (1, label_replace(kube_job_failed * on(job_name) group_left() kube_job_status_start_time, "data_job", "$1", "job_name", "(.*)-.*")) != 0,
"execution_id", "$1", "job_name", "(.*)"),
"short_execution_id", "$1", "execution_id", "([a-zA-Z -_]{1,58}).*")`}}
Expand Down Expand Up @@ -610,7 +610,7 @@ alerting:
(max by(data_job) (taurus_datajob_termination_status)
* on(data_job) group_left(email_notified_on_user_error)
avg by(data_job, email_notified_on_user_error) (taurus_datajob_info{email_notified_on_user_error!=""}) == bool 3)
* on(data_job) group_left(job_name)
* on(data_job) group_left(job_name, namespace)
topk by(data_job) (1, label_replace(kube_job_status_start_time, "data_job", "$1", "job_name", "(.*)-.*")) != 0,
"execution_id", "$1", "job_name", "(.*)"),
"short_execution_id", "$1", "execution_id", "([a-zA-Z -_]{1,58}).*")`}}
Expand Down

0 comments on commit a3ae86e

Please sign in to comment.