Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alertmanager: Skip starting the Alertmanager for Grafana tenants unless they have a promoted, non-default configuration #10491

Merged

Conversation

santihernandezc
Copy link
Contributor

@santihernandezc santihernandezc commented Jan 21, 2025

Description

We currently run an Alertmanager for each tenant in the database, no matter whether their configurations are usable or not. This results in many unused Alertmanagers consuming resources.

This PR adds the option to skip initializing Alertmanagers for Grafana Alertmanager tenants without a usable Grafana Alertmanager configuration.

The new -alertmanager.grafana-alertmanager-conditionally-skip-tenant-suffix argument takes a string that will be compared against tenants' IDs. If it matches, the Alertmanager will only be initialized if there's a promoted, non-default, non-empty Grafana Alertmanager configuration for that tenant.

Testing locally

I tested this by spinning up two Alertmanagers:

  • mimir-1 built from main
  • mimir-1-new build from this branch

I added 600 tenants to each one:

  • 200 with promoted, non-default configuration
  • 200 with non-promoted, non-default configuration
  • 200 with promoted, default configuration
Both Alertmanagers owned 600 tenants, but only 200 of them were active in the Alertmanager build from this branch.

Tenants owned per AM.

Screenshot 2025-01-23 at 2 48 41 PM

Tenant config reloads per Alertmanager.

Screenshot 2025-01-23 at 2 44 16 PM
The memory footprint was reduced by roughly half in the Alertmanager filtering out the Grafana Alertmanager tenants without a usable configuration.

avg_over_time(go_memstats_heap_inuse_bytes{job=~"mimir-1.*"}[1m])

Screenshot 2025-01-23 at 4 17 08 PM
CPU time in the new Alertmanager was decreased by roughly ~45%, goroutines by ~55%.

sum by (job) (rate(process_cpu_seconds_total{job=~"mimir-1.*"}[5m]))

Screenshot 2025-01-23 at 4 20 19 PM

go_goroutines{job=~"mimir-1.*"}

Screenshot 2025-01-23 at 4 22 01 PM

@santihernandezc santihernandezc requested review from a team as code owners January 21, 2025 17:24
@santihernandezc santihernandezc marked this pull request as draft January 21, 2025 17:24
@santihernandezc santihernandezc force-pushed the santihernandezc/dont_run_alertmanager_default_config branch from 5f2c772 to c2e7145 Compare January 22, 2025 14:48
@santihernandezc santihernandezc changed the title (PoC) Alertmanager: Don't start Alertmanager if the config is default (PoC) Alertmanager: Skip starting the Alertmanager for Grafana tenants unless they have a promoted, non-default configuration Jan 22, 2025
@santihernandezc santihernandezc force-pushed the santihernandezc/dont_run_alertmanager_default_config branch from a32f18d to 27219a7 Compare January 22, 2025 16:07
@santihernandezc santihernandezc force-pushed the santihernandezc/dont_run_alertmanager_default_config branch from 27219a7 to 21c4fc9 Compare January 22, 2025 16:08
@santihernandezc santihernandezc force-pushed the santihernandezc/dont_run_alertmanager_default_config branch from ffbc07f to 1da54f9 Compare January 22, 2025 16:21
@santihernandezc santihernandezc changed the title (PoC) Alertmanager: Skip starting the Alertmanager for Grafana tenants unless they have a promoted, non-default configuration Alertmanager: Skip starting the Alertmanager for Grafana tenants unless they have a promoted, non-default configuration Jan 22, 2025
@santihernandezc santihernandezc marked this pull request as ready for review January 22, 2025 16:55
Copy link
Contributor

github-actions bot commented Jan 22, 2025

💻 Deploy preview deleted.

Comment on lines +703 to +704
_, initSkipped := amInitSkipped[userID]
if !exists || initSkipped {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the skipped Alertmanager was already running, I'm deleting its data and adding it to the map of Alertmanagers to stop.

Comment on lines +731 to +736
// If the Grafana configuration is either default, not promoted, or empty, use the Mimir configuration.
if !cfgs.Grafana.Promoted || cfgs.Grafana.Default || cfgs.Grafana.RawConfig == "" {
level.Debug(am.logger).Log("msg", "using mimir config", "user", cfgs.Mimir.User)
isGrafanaTenant := am.cfg.GrafanaAlertmanagerTenantSuffix != "" && strings.HasSuffix(cfgs.Mimir.User, am.cfg.GrafanaAlertmanagerTenantSuffix)
return cfg, !isGrafanaTenant, nil
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I decided to remove the granular debug log lines to shorten the function.

@@ -239,6 +239,8 @@ Usage of ./cmd/mimir/mimir:
Enables periodic cleanup of alertmanager stateful data (notification logs and silences) from object storage. When enabled, data is removed for any tenant that does not have a configuration. (default true)
-alertmanager.grafana-alertmanager-compatibility-enabled
[experimental] Enable routes to support the migration and operation of the Grafana Alertmanager.
-alertmanager.grafana-tenant-suffix string
Copy link
Contributor

@titolins titolins Jan 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a super nitpick, but the flag name is bit generic IMO. Feels like it's missing an action there - e.g. skip-init-for-grafana-tenant-suffix. WDYT?

Copy link
Contributor

@titolins titolins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

Copy link
Member

@JacobsonMT JacobsonMT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🚀

Copy link
Contributor

@titolins titolins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

@santihernandezc santihernandezc merged commit 8f44959 into main Jan 24, 2025
30 checks passed
@santihernandezc santihernandezc deleted the santihernandezc/dont_run_alertmanager_default_config branch January 24, 2025 10:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants