Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: Force disable database secrets engine #5293

Closed
fraajad opened this issue Sep 6, 2018 · 16 comments
Closed

Feature Request: Force disable database secrets engine #5293

fraajad opened this issue Sep 6, 2018 · 16 comments
Labels
community-sentiment Tracking high-profile issues from the community performance secret/database

Comments

@fraajad
Copy link

fraajad commented Sep 6, 2018

Is your feature request related to a problem? Please describe.
I ran into a few panics from database credential revocation and wish it was easier to recover from them. First #4846 while on 0.10.4 so I upgrade to 0.11.0 and immediately hit #5262. I was able to get Vault running by turning the linux clock back in time but then had trouble resolving the issue so I could get a running system again. I didn't see the fix in #5262 at the time, and all I could come up with was creating a build of vault that had revocation removed and using it to disable the database mount.

Describe the solution you'd like
If there was something like vault lease revoke -force for the database mount ie vault secrets disable -force database that could destroy the mount without doing the revocation it would help to remove the mount when it is not working or misconfigured.

Describe alternatives you've considered
Its possible that once all the unexpected returns are accounted for this won't be an issue that comes up anymore.

Explain any additional use-cases
While testing the database secret engine a user sometimes puts the wrong configs in and just wants to return to a clean slate without caring about what is left on the test DB.

@scallister
Copy link

I hit this and got pretty stuck. I was hitting two problems. The first was I couldn't revoke leases. The vault CLI however has a -f force flag and a -prefix flag (required for force). That allow me to delete all the leases.

vault lease revoke -f -prefix database/creds/

That will probably work for most people. However, I had a second problem. I had two vault clusters that replicate between each other, and my VAULT_ADDR was pointed at the secondary replication cluster, not the primary. Once I pointed at the primary I was able to clean out the leases and disable the database engine properly.

@catsby catsby added bug Used to indicate a potential bug secret/database labels Nov 8, 2019
@michelvocks
Copy link
Contributor

Hi @fraajad!

I think @scallister described the solution very well. You can use vault lease revoke -f -prefix to revoke all leases from a specific secret engine without caring about existing users in the remote database. With vault list sys/leases/lookup you are also able to browse existing leases and see if and where these exist.

I will close this issue for now since I don't see a reason for an additional command which basically would do the same like vault lease revoke. Feel free to open a new issue if you think otherwise.

Cheers,
Michel

@stepps
Copy link

stepps commented Feb 19, 2020

I have stumbled on this issue after unsuccessfully trying to disable some unreachable mysql and mongo secret engines (These are of the kind deprecated in 0.7.1):

vault lease revoke -f -prefix mongo-r9/
Warning! Force-removing leases can cause Vault to become out of sync with secret engines!
Error force revoking leases with prefix mongo-r9/creds: context deadline exceeded

This is a pretty big inconvenience, because when mongo retries multiple times to revoke credentials, it reaches the configured throughput of the DynamoDB backend and becomes unresponsive.

Eventually the maximum retries is reached and the vault stabilizes, but I have pretty hefty downtimes unless I scale Dynamodb esponentially.

@stepps
Copy link

stepps commented Feb 20, 2020

By reading the logs I have found out that, while going in timeout, each run would revoke some leases.
From here it was easy to put the revoke command in a loop and delete the secret engine I was stuck on

@blairdrummond
Copy link

So we did this, and it eventually kinda works, but it took literally days for the vault lease revoke -f -prefix secret_engine/ loop to remove all the secrets. Would it be possible to re-open this and have the engine disabled and abandon revoking the leases? This is really an issue in cases where the remote system was decommissioned.

@heatherezell
Copy link
Contributor

So we did this, and it eventually kinda works, but it took literally days for the vault lease revoke -f -prefix secret_engine/ loop to remove all the secrets. Would it be possible to re-open this and have the engine disabled and abandon revoking the leases? This is really an issue in cases where the remote system was decommissioned.

Thanks for coming back to chime in on this! We can re-open this, for sure, and re-evaluate possible ways to help ameliorate the pain.

@aphorise
Copy link
Contributor

aphorise commented Sep 1, 2022

There is also another approach to dealing with this in terms of removing the mount reference via recovery-mode - that could be scripted with jq and other tool chains like that if you have the unseal / recovery keys. While this may encounter a down-time of a few seconds to load into recovery mode and perform the needed actions - it's more predicable IMO.

See the Support KB they have: Recovery using recovery-mode - Disabling & Deleting Mounts

You can just do the disable portion and worry about the deletions incrementally and in smaller portions via vault delete ... if you do not have other recursive methods available to you store natively (like in consul kv delete -recurse vault/...).

In the case integrated storage / raft - when you've stopped the service (for recovery) - you can also use other bbolt compatible utilities (like: boltdb) to perform the deletion directly on boltdb file which I believe is likely the fastest approach to dealing with the recursion. The only draw back is that you'd either need to perform this on all the nodes or first down scale to a single leader only cluster - perform action - then scale back up. Still could be minutes as opposed to hours or days.

The issue here is two fold:

  1. How the failure of forced revocations are dealt with by plugin or if that's implemented correctly and even if so what happens when the store refuses to delete it?
  2. Deletion can not performed in a single go due to the recursion thus needing to be sort of background and performed in capped batches (taking even longer and second to other activity).

What's more assuming you have raw_storage_mode enabled (that might be bit of an issue) then another approach can also be force deleting the correlating /sys/raw/... path of secrets as much as possible where there'd be nothing preventing the disabling of that empty mount.

Anyway I'd been keen to hear if any of these notes here may have helped anyone else.

@AbdullahAlShaad
Copy link

Is this feature available? Can we delete the secret engine and it's leases when the backend database is deleted or the connection with backend database cannot be established?

@maxb
Copy link
Contributor

maxb commented Jun 6, 2023

While this may encounter a down-time of a few seconds to load into recovery mode and perform the needed actions

A few seconds? That sounds extraordinarily optimistic to me. You're talking about restarting services, providing unseal/recovery keys, running various commands, and then getting Vault restarted back into production mode. IMO that's a number of minutes even for a well-practiced, experienced Vault operator.

Also, whilst I really appreciate that Vault does have options such as recovery mode and sys/raw, they are incredibly powerful tools that are risky to use unless you have intimate knowledge of Vault internals.

They really shouldn't be the go-to answer for what is a not particularly rare operational issue.

I think there's a far simpler way forward here ... the revoke-force operation we have today attempts to revoke the lease, and then deletes it anyway if the revocation fails. This means that if the problem with the environment is such that each revocation attempt fails slowly, it is a user-unfriendly solution. That would address the "it is slow" part of the problem.

Then, the other part of the problem is to make it easier for Vault operators to discover what they need to, when a secrets engine disable fails. The easier option would be a more detailed, verbose error message. The nicer, more polished, option, would be an option for enabling forced lease revocation as a built-in part of the secrets engine disable operation, as per the original feature request here.

@aphorise
Copy link
Contributor

aphorise commented Jun 6, 2023

... the revoke-force operation we have today attempts to revoke the lease, and then deletes it anyway if the revocation fails.

This is also an excellent point referring to the API: /sys/leases/revoke-force/:prefix or CLI: vault lease revoke -force -prefix .... Put simply perform a force revoke on the related mount (either by path, role, etc) and thereafter attempt to disable the mount which will be massively faster than the default behavior that's to attempt to try both of those for you with the last disable step being contingent on the first portion (revocations) succeeding.

@maxb
Copy link
Contributor

maxb commented Jun 6, 2023

This is also an excellent point referring to the API: /sys/leases/revoke-force/:prefix or CLI: vault lease revoke -force -prefix .... Put simply perform a force revoke on the related mount (either by path, role, etc) and thereafter attempt to disable the mount which will be massively faster than the default behavior that's to attempt to try both of those for you with the last disable step being contingent on the first portion (revocations) succeeding.

I feel like the intent of my comment may not have been understood. The point I was trying to make, is that the existing revoke-force operation is not a fully satisfactory solution, because it may take an extreme amount of time if each revocation to be processed fails slowly.

@maxb
Copy link
Contributor

maxb commented Jun 7, 2023

See recent conversation in #9420 for an example of a user who was blocked because of slowly-failing revocation attempts when using revoke-force. In that case, the resolution was to manipulate other Vault configuration to turn the slow failures into fast failures.

@aphorise
Copy link
Contributor

aphorise commented Jun 8, 2023

My most common experience of this is with PKI certificates where consumer surpass 5 million or 10 million certs that they have no account of; neither having any sense of how long all those certs took to generate nor how far back the oldest may be. Separating by role or even different mounts for different TLD / sub-domains for example can help. While generally there are progression indicators when disabling mounts (with PKI on TRACE level) in my opinion it's not reasonable to expect a recursive process to eventually complete concurrent to other loads and activities are on-going at the same time.

Thinking aloud these sorts of clean-up and even export type activities may be better pursued by standby nodes (1 or 2 of out a set) or offline entirely via snapshots that a next leader could then negotiate on allowing for others to replicate accordingly.

@maxb
Copy link
Contributor

maxb commented Jun 8, 2023

I'm not really convinced that's applicable to this issue, though. The challenges of PKI are somewhat different to databases.

To my mind, this issue is really tracking two different things, neither of which are shared with the PKI secrets engine:

  1. vault secrets disable on a database engine is quite prone to failing in a non-obvious way which requires an administrator to discover and execute some flavour of vault lease revoke command. It would be potentially nice to users to give them EITHER easier discovery of what they need to do OR even a new switch for the vault secrets disable command which just takes care of it.

  2. Revoking database leases can involve reaching out to external services, so a single lease revocation can take an extremely long time - easily running in to Vault request processing timeouts. IMO, it would be nice to offer a new enhanced flavour of force-revoking leases, which totally skips even the attempt to execute backend specific destruction logic.

@heatherezell
Copy link
Contributor

Hi folks, is this still an issue in recent versions of Vault? Can we clarify the current state of the issue so we can bubble it up as needed? Thanks!

@raskchanky
Copy link
Contributor

Since it's been awhile with no response, I'm going to close this issue. Please reopen if there's more to add here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community-sentiment Tracking high-profile issues from the community performance secret/database
Projects
None yet
Development

No branches or pull requests