Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent overwriting states with other unrelated states #7026

Closed
wants to merge 2 commits into from

Conversation

apparentlymart
Copy link
Contributor

Through real-world use of Terraform in production I have encountered a variety of "gotchas" relating to state management. This PR is an attempt to tackle one of these where I think I have a reasonable path forward: protecting against overwriting an remote states.

One common way this occurs is through clumsy use of terraform remote config. One variant of this is to forget to enable remote state and then try to enable it afterwards:

  • terraform apply
  • oh whoops! I forgot to enable remote state and now I have duplicated my resources
  • terraform destroy (deletes all of the stuff in the local state, leaving an empty state)
  • terraform remote config ...
  • whoops! I've now overwritten that remote state with my empty local state.

Another variant is when switching from one remote to another:

  • terraform remote config ...
  • terraform apply
  • terraform remote config ...
  • whoops! I've now overwritten the second remote state with the contents of the first.

This PR first introduces the idea of "state lineage", which allows us to determine when one state is an earlier or later version of another state vs. when the two states are entirely unrelated. Comparing Serial is only expected to produce meaningful results when Lineage matches.

Building on this, when syncing local and remote state, fail with an error if the two states are of different lineage. This includes running terraform remote config to enable or re-configure remote state when a local state is already present.

As a special case, if local cache has different lineage but it is entirely empty of resources then just silently drop it and replace it with the remote state, since the loss of an empty state is trivial to recover from and this will reduce friction when users are bootstrapping a new config.

The primary goal here is to return an error when the user seems to be accidentally doing something dangerous, with as little impact as possible to "legitimate" workflows. In future version of Terraform we may make more fundamental changes to these features to help the user not make these mistakes in the first place, but this is intended as a short-term fix to reduce the risk of a state-related catastrophe.

This is the same set of changes as from #6540 except that the change to prevent applying a stale plan is removed, since it no longer works in light of the plan behavior change introduced in #6811.

@apparentlymart apparentlymart changed the title F remote state safety Prevent overwriting states with other unrelated states Jun 6, 2016
@bigkraig
Copy link
Contributor

bigkraig commented Jun 9, 2016

Yes!!

@phinze
Copy link
Contributor

phinze commented Jun 9, 2016

This LGTM on initial review - would love to get @mitchellh's thoughts on overall history and UX here as well as @jen20's eyes since he's been spending a lot of time upgrading state for 0.7!

@bigkraig
Copy link
Contributor

bigkraig commented Jun 9, 2016

@phinze @jen20 Since we are talking about state and 0.7, how about a look at #6732 too? It's preventing us from using 0.7

f.Close()
if err != nil {
t.Fatalf("err: %s", err)
}
Copy link
Contributor

@mitchellh mitchellh Jun 9, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The helper testStateFile should be used here to save you a lot of lines and cleanup. :)

func testStateFile(t *testing.T, s *terraform.State) string {

@phinze
Copy link
Contributor

phinze commented Jun 9, 2016

@bigkraig i know @jen20 was working on remote state today - i'll make sure we address that issue before the next RC 👍

@mitchellh
Copy link
Contributor

I made a few comments that need to be addressed. One other thing I'd recommend is actually documenting some of this behavior in perhaps a "state" internals page for: https://www.terraform.io/docs/internals/index.html

That's probably outside the scope of this PR but its something that I think that should be documented soon since the state is slowly becoming more complex [to protect users]

The lineage of a state is an identifier shared by a set of states whose
serials are meaningfully comparable because they are produced by
progressive Refresh/Apply operations from the same initial empty state.

This is initialized as a type-4 (random) UUID when a new state is
initialized and then preserved on all other changes.

Since states before this change will not have lineage but users may wish
to set a lineage for an existing state in order to get the safety
benefits it will grow to imply, an empty lineage is considered to be
compatible with all lineages.
Accidentally losing a state with resources in it can be anywhere from
annoying to catastrophic. A common cause of such a problem is accidentally
clobbering a remote state with an unrelated local state or vice-versa.

Here we introduce safety checks that exploit the new "lineage" concept
to ensure that once a state location is established it can only be
updated with new states from the same lineage as the initial state.

We also make the remote state caching mechanism treat a lineage mismatch
as a confict, thus ensuring that automatic state syncing will stop if
the user manages to get into a broken, mismatched state.

As a special exception, we allow *empty* states to be "clobbered"
regardless of lineage. This enables remote state to be configured
easily in the common case where earlier user actions have implicitly
caused an empty local state to be created, and also allows the remote
state left behind from a config that has been destroyed to be overwritten
by a new, unrelated state from a different config.

The priority here is preventing actions that should never occur, so the
UX is not polished. Later we may wish to make changes at higher
levels of abstraction to either prevent these situations from arising
in the first place (e.g. making remote configuration automatic based on
config) or giving the user more guidance on resolution.
@apparentlymart
Copy link
Contributor Author

Thanks for all the feedback! It looks like the feedback was on the remote state checking part, and I probably won't have time to iterate on this one more time before 0.7 final is cut, so I split the lineage part off into #7109 in the hope that it can get included in 0.7 in isolation. That will have no effect on the UX but will get this in the state format so that the UX change(s) can follow in a subsequent releases.

@lbernail
Copy link

I overwrote a remote state earlier today by mistake
I was able to recover thanks to versioning on s3

When trying to understand when a bogus local state would overwrite the remote one (my understanding is that today it is only based on serial number) I discovered the new lineage parameter in the state file which lead me to this PR

I must say that this will be really helpful!

@apparentlymart
Copy link
Contributor Author

I believe there are some more fundamental changes to remote state coming in the near future, and this is a pretty old patch anyway, so I'm going to close this out for now and maybe revisit this later once it's clearer how state management is changing.

@mitchellh mitchellh deleted the f-remote-state-safety branch December 13, 2016 20:45
@ghost
Copy link

ghost commented Apr 18, 2020

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@ghost ghost locked and limited conversation to collaborators Apr 18, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants