-
Notifications
You must be signed in to change notification settings - Fork 810
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cleanup obsolete local files for alertmanager. #3910
Cleanup obsolete local files for alertmanager. #3910
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good job! I'm a bit scared about the alertmanagersMtx
locking complexity in MultitenantAlertmanager
. I'm wondering if it's time to properly address it (with a per-tenant alertmanager lock and a state to CAS on it, so we can safely handle all cases). I'm also open to work on such refactoring.
b3d987d
to
abecf4c
Compare
I've pushed next version of PR that implemens following changes:
I'd be interested to hear opinions on these design decisions. (Tests are not yet updated to cover this, so please ignore tests for now) One con of this approach is that reverting back to previous version of AM will ignore existing snapshot files in per-tenant directory. Hopefully moving from new to previous AM version is rare. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good job! Changes LGTM. I've left a couple of nits and a comment about testing.
CHANGELOG.md
Outdated
@@ -2,6 +2,8 @@ | |||
|
|||
## master / unreleased | |||
|
|||
* [CHANGE] Alertmanager: clean obsolete local files after Alertmanager is no longer running for removed or resharded user. #3910 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand this CHANGELOG entry. Also remember to mention that the path where each tenant alertmanager data is stored has changed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only a quick pass over. I had a question though:
there is a migration procedure, that moves existing nflog:, silences: and > templates into per-tenant directory on startup.
I'm wondering what the benefit is from having the migration - as there is some complexity here we could avoid. Is listing out the files for a particular user difficult in some way? Or perhaps are we worried about the number of files in a directory?
Not saying one way or the other, just curious on the reasoning.
Rationale behind single directory per tenant is to make it easy to delete tenant files when needed. Previous version of this PR only deleted silence and nflog file, but today I've found that I've missed template files – which then led to idea of using per-tenant directory. Reasoning behind migrating is to avoid losing notifications and silences state. Even though this state is propagated by other alertmanagers via gossip, it takes some time and in the meantime missing silences could cause spurious notifications. Note that this problem still exists when doing downgrade to previous AM, which doesn't understand new structure. |
69a9383
to
e432d82
Compare
PR has changed since it was approved, please take a look again.
PR is now ready for review. Since original version, it has also added change to how files are stored locally. I've also extended unit tests to cover use of templates, that were missing. |
Signed-off-by: Peter Štibraný <[email protected]>
Signed-off-by: Peter Štibraný <[email protected]>
Signed-off-by: Peter Štibraný <[email protected]>
Signed-off-by: Peter Štibraný <[email protected]>
Signed-off-by: Peter Štibraný <[email protected]>
Signed-off-by: Peter Štibraný <[email protected]>
Add test for migration. Fix test for deletion of unused dirs. Signed-off-by: Peter Štibraný <[email protected]>
Signed-off-by: Peter Štibraný <[email protected]>
Signed-off-by: Peter Štibraný <[email protected]>
Signed-off-by: Peter Štibraný <[email protected]>
Signed-off-by: Peter Štibraný <[email protected]>
Signed-off-by: Peter Štibraný <[email protected]>
5f6bdc5
to
de67abb
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good job, LGTM! I left few nits just to be a bit more clear in log messages (some log messages are generic and the final user may be lost reading them).
func (am *MultitenantAlertmanager) newAlertmanager(userID string, amConfig *amconfig.Config, rawCfg string) (*Alertmanager, error) { | ||
reg := prometheus.NewRegistry() | ||
|
||
tenantDir := am.getTenantDirectory(userID) | ||
err := os.MkdirAll(tenantDir, 0777) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
err := os.MkdirAll(tenantDir, 0777) | |
err := os.MkdirAll(tenantDir, os.ModePerm) |
|
||
for userID, files := range st { | ||
tenantDir := am.getTenantDirectory(userID) | ||
err := os.MkdirAll(tenantDir, 0777) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
err := os.MkdirAll(tenantDir, 0777) | |
err := os.MkdirAll(tenantDir, os.ModePerm) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While it is the same value, the intent is not the same. os.ModePerm
is simply all permission-bits set and defined next to other Mode* constants for higher-bits. Note that there is no such constant for files (0666, without exec-bit).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After the fact since this is merged, but why do we need 0777 instead of 0755? Every one of these causes some pain for anyone using source scanning tools for :( I know we have some other similar permissions, so there may be a good reason, but want to understand if so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason for using wide permissions is mostly for consistency with vendored Prometheus code, which already uses very wide permissions (see prometheus/prometheus#7782). Cortex users that want reduced permissions need to use umask to do so. (I’m not quite sure how to do that in Kubernetes though).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, just a unit test nit.
Co-authored-by: Marco Pracucci <[email protected]> Signed-off-by: Peter Štibraný <[email protected]>
Signed-off-by: Peter Štibraný <[email protected]>
bcf060a
to
6746059
Compare
* Cleanup obsolete local files for alertmanager. Signed-off-by: Peter Štibraný <[email protected]> * CHANGELOG.md Signed-off-by: Peter Štibraný <[email protected]> * Comment. Signed-off-by: Peter Štibraný <[email protected]> * Don't ignore directories. Log error when deletion fails instead. Signed-off-by: Peter Štibraný <[email protected]> * Address review feedback. Signed-off-by: Peter Štibraný <[email protected]> * Move per-tenant state into tenant directory to simplify cleanup. Signed-off-by: Peter Štibraný <[email protected]> * Move migration to separate function. Add test for migration. Fix test for deletion of unused dirs. Signed-off-by: Peter Štibraný <[email protected]> * Store templates to correct place. Signed-off-by: Peter Štibraný <[email protected]> * CHANGELOG.md Signed-off-by: Peter Štibraný <[email protected]> * Verify that templates are stored properly into correct location. Signed-off-by: Peter Štibraný <[email protected]> * Comments. Signed-off-by: Peter Štibraný <[email protected]> * Comments. Signed-off-by: Peter Štibraný <[email protected]> * Apply suggestions from code review Co-authored-by: Marco Pracucci <[email protected]> Signed-off-by: Peter Štibraný <[email protected]> * Review feedback. Signed-off-by: Peter Štibraný <[email protected]> Co-authored-by: Marco Pracucci <[email protected]>
What this PR does: This PR implements cleanup of local files on alertmanager when AM is no longer running for given user.
Checklist
CHANGELOG.md
updated - the order of entries should be[CHANGE]
,[FEATURE]
,[ENHANCEMENT]
,[BUGFIX]