storage: err on snapshot of large ranges #7788

dt · 2016-07-12T20:32:39Z

This change is

petermattis · 2016-07-12T21:14:06Z

Review status: 0 of 4 files reviewed at latest revision, 1 unresolved discussion, some commit checks failed.

storage/replica_raftstorage.go, line 260 [r2] (raw file):

  rangeID := r.RangeID

  if r.needsSplitBySizeLocked() {

I was imagining we'd only do this if the snapshot size was excessive (i.e 3-4x the target range size), but perhaps it is always better to delay a snapshot if splitting needs to be done. @bdarnell?

Comments from Reviewable

tamird · 2016-07-12T21:27:57Z

Reviewed 1 of 1 files at r1, 3 of 3 files at r2, 1 of 1 files at r3.
Review status: all files reviewed at latest revision, 4 unresolved discussions, some commit checks failed.

storage/replica_raftstorage.go, line 258 [r2] (raw file):

// Snapshot requires that the replica lock is held.
func (r *Replica) Snapshot() (raftpb.Snapshot, error) {
  rangeID := r.RangeID

can be below the size check

storage/replica_raftstorage_test.go, line 34 [r2] (raw file):

const valSize = 1 << 10 // 1 KiB

func fillTestRange(t testing.TB, rep *Replica, rangeID roachpb.RangeID, snapSize int) {

isn't rangeID always rep's range ID? snapSize is also not a useful name in this "general" method

storage/replica_raftstorage_test.go, line 70 [r2] (raw file):

  fillTestRange(t, rep, rangeID, snapSize*2)

  if _, err := rep.Snapshot(); err == nil {

small nit: check that this is the right error, not just non-nil

Comments from Reviewable

dt · 2016-07-12T22:26:53Z

Review status: 2 of 4 files reviewed at latest revision, 4 unresolved discussions, some commit checks failed.

storage/replica_raftstorage.go, line 258 [r2] (raw file):

Previously, tamird (Tamir Duberstein) wrote…

can be below the size check

Done.

storage/replica_raftstorage_test.go, line 34 [r2] (raw file):

Previously, tamird (Tamir Duberstein) wrote…

isn't rangeID always rep's range ID? snapSize is also not a useful name in this "general" method

Done.

storage/replica_raftstorage_test.go, line 70 [r2] (raw file):

Previously, tamird (Tamir Duberstein) wrote…

small nit: check that this is the right error, not just non-nil

Done.

Comments from Reviewable

tamird · 2016-07-12T22:33:36Z

Reviewed 2 of 2 files at r4.
Review status: all files reviewed at latest revision, 2 unresolved discussions, some commit checks failed.

storage/replica_raftstorage_test.go, line 73 [r4] (raw file):

  if _, err := rep.Snapshot(); err != raft.ErrSnapshotTemporarilyUnavailable {
      t.Fatal("snapshot of a very large range should fail")

log the unexpected error

Comments from Reviewable

tamird · 2016-07-12T22:33:47Z

Review status: all files reviewed at latest revision, 3 unresolved discussions, some commit checks failed.

storage/replica_raftstorage_test.go, line 85 [r4] (raw file):

  defer stopper.Stop()

  const snapSize = 1 << 25 // 32 MiB

remove?

Comments from Reviewable

bdarnell · 2016-07-13T19:51:06Z

Reviewed 1 of 1 files at r1, 1 of 3 files at r2, 1 of 2 files at r4, 1 of 1 files at r5.
Review status: all files reviewed at latest revision, 3 unresolved discussions, some commit checks failed.

storage/replica_raftstorage.go, line 260 [r2] (raw file):

Previously, petermattis (Peter Mattis) wrote…

I was imagining we'd only do this if the snapshot size was excessive (i.e 3-4x the target range size), but perhaps it is always better to delay a snapshot if splitting needs to be done. @bdarnell?

We use the unmodified target size in `replicateQueue`; whatever we use should be consistent between the two. (this is just a fallback in case the range has grown between the decision to replicate and the time that the snapshot is actually sent)

I think it's better to be strict about this than to allow a multiple of the target size. Until we have streaming snapshots, we already have a 3-4x amplification factor in memory, and allowing 64MB * 4 * 4 = 1GB of memory to be consumed by a snapshot of a single range doesn't sound like a good idea.

Comments from Reviewable

petermattis · 2016-07-13T19:53:41Z

Review status: all files reviewed at latest revision, 3 unresolved discussions, some commit checks failed.

storage/replica_raftstorage.go, line 260 [r2] (raw file):

Previously, bdarnell (Ben Darnell) wrote…

We use the unmodified target size in replicateQueue; whatever we use should be consistent between the two. (this is just a fallback in case the range has grown between the decision to replicate and the time that the snapshot is actually sent)

I think it's better to be strict about this than to allow a multiple of the target size. Until we have streaming snapshots, we already have a 3-4x amplification factor in memory, and allowing 64MB * 4 * 4 = 1GB of memory to be consumed by a snapshot of a single range doesn't sound like a good idea.

True. My concern was that this limit will block raft from recovering an existing replica of a large range via a snapshot until that range has split.

Comments from Reviewable

bdarnell · 2016-07-13T20:01:00Z

Review status: all files reviewed at latest revision, 3 unresolved discussions, some commit checks failed.

storage/replica_raftstorage.go, line 260 [r2] (raw file):

Previously, petermattis (Peter Mattis) wrote…

True. My concern was that this limit will block raft from recovering an existing replica of a large range via a snapshot until that range has split.

Hmm. Yeah, it might be worth using a higher limit here since we could be in a state where we can't split until we've recovered a lagging follower.

Comments from Reviewable

tamird · 2016-07-13T22:15:59Z

Reviewed 1 of 1 files at r5.
Review status: all files reviewed at latest revision, 3 unresolved discussions, some commit checks failed.

Comments from Reviewable

petermattis · 2016-07-14T00:41:20Z

Review status: all files reviewed at latest revision, 3 unresolved discussions, some commit checks failed.

storage/replica_raftstorage.go, line 260 [r2] (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Hmm. Yeah, it might be worth using a higher limit here since we could be in a state where we can't split until we've recovered a lagging follower.

How could we be in that situation? We'll only trim the raft log to the quorum commit index. If there is a lagging replica we're already in a danger zone.

Comments from Reviewable

bdarnell · 2016-07-14T03:15:16Z

Review status: all files reviewed at latest revision, 3 unresolved discussions, some commit checks failed.

storage/replica_raftstorage.go, line 260 [r2] (raw file):

Previously, petermattis (Peter Mattis) wrote…

How could we be in that situation? We'll only trim the raft log to the quorum commit index. If there is a lagging replica we're already in a danger zone.

It's possible in edge cases when recovering from certain near-death experiences. For example, in a range on nodes A, B, and C, B dies, then some time later C dies (without any replica changes in between). B is revived, but C is unrecoverable. This range can recover if and only if A can send a snapshot to B.

I can't come up with any situation in which this matters that doesn't involve multiple failures, so maybe it's not a situation we want to promise to handle, but there are cases where we could theoretically recover but this limit would prevent it.

Comments from Reviewable

dt · 2016-07-14T19:08:53Z

Review status: 0 of 5 files reviewed at latest revision, 3 unresolved discussions, some commit checks failed.

storage/replica_raftstorage.go, line 260 [r2] (raw file):

Previously, bdarnell (Ben Darnell) wrote…

It's possible in edge cases when recovering from certain near-death experiences. For example, in a range on nodes A, B, and C, B dies, then some time later C dies (without any replica changes in between). B is revived, but C is unrecoverable. This range can recover if and only if A can send a snapshot to B.

I can't come up with any situation in which this matters that doesn't involve multiple failures, so maybe it's not a situation we want to promise to handle, but there are cases where we could theoretically recover but this limit would prevent it.

Just to be careful, I've bumped it up to 2x split size.

storage/replica_raftstorage_test.go, line 73 [r4] (raw file):

Previously, tamird (Tamir Duberstein) wrote…

log the unexpected error

Done.

Comments from Reviewable

petermattis · 2016-07-18T15:45:02Z

Review status: 0 of 5 files reviewed at latest revision, 3 unresolved discussions, some commit checks failed.

Comments from Reviewable

If a range needs to be split, return an err rather than attempting to generate a snapshot. This avoids generating excessively large snapshots. Suggested in cockroachdb#7581.

In a privately reported user issue, we've seen that [our attempts](cockroachdb#7788) at [preventing large snapshots](cockroachdb#7581) can result in replica unavailability. Our current approach to limiting large snapshots assumes is that its ok to block snapshots indefinitely while waiting for a range to first split. Unfortunately, this can create a dependency cycle where a range requires a snapshot to split (because it can't achieve an up-to-date quorum without it) but isn't allowed to perform a snapshot until its size is reduced below the threshold. This can result in unavailability even when a majority of replicas remain live. Currently, we still need this snapshot size limit because unbounded snapshots can result in OOM errors that crash entire nodes. However, once snapshots are streamed from disk to disk, never needing to buffer in-memory on the sending or receiving side, we should be able to remove any snapshot size limit (see cockroachdb#16954). As a holdover, this change introduces a `permitLargeSnapshots` flag on a replica which is set when the replica is too large to snapshot but observes splits failing. When set, the flag allows snapshots to ignore the size limit until the snapshot goes through and splits are able to succeed again. Release note: None

In a privately reported user issue, we've seen that [our attempts](cockroachdb#7788) at [preventing large snapshots](cockroachdb#7581) can result in replica unavailability. Our current approach to limiting large snapshots assumes is that its ok to block snapshots indefinitely while waiting for a range to first split. Unfortunately, this can create a dependency cycle where a range requires a snapshot to split (because it can't achieve an up-to-date quorum without it) but isn't allowed to perform a snapshot until its size is reduced below the threshold. This can result in unavailability even when a majority of replicas remain live. Currently, we still need this snapshot size limit because unbounded snapshots can result in OOM errors that crash entire nodes. However, once snapshots are streamed from disk to disk, never needing to buffer in-memory on the sending or receiving side, we should be able to remove any snapshot size limit (see cockroachdb#16954). As a holdover, this change introduces a `permitLargeSnapshots` flag on a replica which is set when the replica is too large to snapshot but observes splits failing. When set, the flag allows snapshots to ignore the size limit until the snapshot goes through and splits are able to succeed again. Release note (bug fix): Fixed a scenario where a range that is too big to snapshot can lose availability even with a majority of nodes alive.

dt force-pushed the snap branch from 20fa229 to 5c30cf0 Compare July 12, 2016 21:24

dt force-pushed the snap branch from 5c30cf0 to a428575 Compare July 12, 2016 22:26

dt force-pushed the snap branch from a428575 to 556d94e Compare July 12, 2016 23:23

dt force-pushed the snap branch 2 times, most recently from 457d172 to a5e1af3 Compare July 14, 2016 19:07

dt force-pushed the snap branch from a5e1af3 to 216ac2e Compare July 14, 2016 19:20

storage: make test helpers take TB interface

6df7613

dt force-pushed the snap branch from 216ac2e to 01961b0 Compare July 18, 2016 15:06

dt force-pushed the snap branch 5 times, most recently from 4659d9f to f13a7d3 Compare July 18, 2016 20:17

storage: err on snapshot of large ranges

3bdf92c

If a range needs to be split, return an err rather than attempting to generate a snapshot. This avoids generating excessively large snapshots. Suggested in cockroachdb#7581.

dt force-pushed the snap branch from f13a7d3 to 3bdf92c Compare July 18, 2016 20:18

dt merged commit 7841a4a into cockroachdb:master Jul 18, 2016

dt deleted the snap branch July 18, 2016 20:37

dt mentioned this pull request Jul 18, 2016

storage: avoid creating excessively large snapshots #7581

Closed

This was referenced Apr 24, 2018

storage: add permitLargeSnapshots flag to replica #20589

Merged

storage: Don't require splits to complete before running replicate queue #25047

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: err on snapshot of large ranges #7788

storage: err on snapshot of large ranges #7788

dt commented Jul 12, 2016 •

edited by petermattis

Loading

petermattis commented Jul 12, 2016

tamird commented Jul 12, 2016

dt commented Jul 12, 2016

tamird commented Jul 12, 2016

tamird commented Jul 12, 2016

bdarnell commented Jul 13, 2016

petermattis commented Jul 13, 2016

bdarnell commented Jul 13, 2016

tamird commented Jul 13, 2016

petermattis commented Jul 14, 2016

bdarnell commented Jul 14, 2016

dt commented Jul 14, 2016

petermattis commented Jul 18, 2016

storage: err on snapshot of large ranges #7788

storage: err on snapshot of large ranges #7788

Conversation

dt commented Jul 12, 2016 • edited by petermattis Loading

petermattis commented Jul 12, 2016

tamird commented Jul 12, 2016

dt commented Jul 12, 2016

tamird commented Jul 12, 2016

tamird commented Jul 12, 2016

bdarnell commented Jul 13, 2016

petermattis commented Jul 13, 2016

bdarnell commented Jul 13, 2016

tamird commented Jul 13, 2016

petermattis commented Jul 14, 2016

bdarnell commented Jul 14, 2016

dt commented Jul 14, 2016

petermattis commented Jul 18, 2016

dt commented Jul 12, 2016 •

edited by petermattis

Loading