Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

control-service: add namespace label to Prometheus alerts #442

Merged
merged 1 commit into from
Oct 25, 2021

Conversation

tpalashki
Copy link
Contributor

Add the Kubernetes namespace where the data job execution happened
as a label to the alerts generated by the Prometheus rules. This label
was initially dropped from the alerts for simplicity, but it turned out
that there are some use cases where it could be useful.

Testing done: tested the modified rules on a live Prometheus server
where the rules are already deployed and verified that the Kubernetes
namespace appears as a label. Example result from the rule evaluation:

{data_job="some-job", email_notified_on_success="[email protected]", execution_id="some-job-1633932000", job_name="some-job-1633932000", namespace="data-pipelines"}

Signed-off-by: Tsvetomir Palashki [email protected]

Add the Kubernetes namespace where the data job execution happened
as a label to the alerts generated by the Prometheus rules. This label
was initially dropped from the alerts for simplicity, but it turned out
that there are some use cases where it could be useful.

Testing done: tested the modified rules on a live Prometheus server
where the rules are already deployed and verified that the Kubernetes
namespace appears as a label.

Signed-off-by: Tsvetomir Palashki <[email protected]>
@tpalashki tpalashki force-pushed the person/tpalashki/add-namespace branch from 7a44fc4 to 42f6d11 Compare October 25, 2021 04:50
@antoniivanov
Copy link
Collaborator

It's not very clear to me why do we need namespace here?

@ivakoleva ivakoleva self-requested a review October 25, 2021 08:05
@@ -494,7 +494,7 @@ alerting:
(avg by(data_job) (taurus_datajob_termination_status)
* on(data_job) group_left(email_notified_on_success)
avg by(data_job, email_notified_on_success) (taurus_datajob_info{email_notified_on_success!=""}) == bool 0)
* on(data_job) group_left(job_name)
* on(data_job) group_left(job_name, namespace)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Grouping alerts by target email and source namespace, correct?
Is this PR covering a use-case of multiple deployments on the same k8s cluster?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Grouping alerts by target email and source namespace, correct?

This particular grouping is by job name and namespace.

Is this PR covering a use-case of multiple deployments on the same k8s cluster?

Single vs multiple clusters should not matter if the deployments are in different namespaces. If the deployments are in the same namespace and they contain a job with the same name, it might be hard to distinguish between the two jobs.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we support multiple deployments in the same namespace - as the builder jobs names and the cron jobs names (job deployment), k8s secrets and other resources are namespace specific (and not deployment-specific).

But supporting multiple deployments on the same cluster is certainly important as it's normal for multiple tenants to share same cluster.

@tpalashki
Copy link
Contributor Author

This particular requirement comes from one of the internal adopters of the VDK Control Service.

Regardless, having the namespace as a label in each alert could be useful in the case of multiple deployments of the Control Service (in different namespaces). It is common that the entire Kubernetes cluster is governed by a single Alertmanager. In this case, it may be useful to be able to determine the Control Service instance from which an alert came.

@antoniivanov
Copy link
Collaborator

This particular requirement comes from one of the internal adopters of the VDK Control Service.

Regardless, having the namespace as a label in each alert could be useful in the case of multiple deployments of the Control Service (in different namespaces). It is common that the entire Kubernetes cluster is governed by a single Alertmanager. In this case, it may be useful to be able to determine the Control Service instance from which an alert came.

That makes sense to me.

@tpalashki tpalashki merged commit a3ae86e into main Oct 25, 2021
@tpalashki tpalashki deleted the person/tpalashki/add-namespace branch October 25, 2021 11:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants