Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: err on snapshot of large ranges #7788

Merged
merged 2 commits into from
Jul 18, 2016
Merged

storage: err on snapshot of large ranges #7788

merged 2 commits into from
Jul 18, 2016

Conversation

dt
Copy link
Member

@dt dt commented Jul 12, 2016

This change is Reviewable

@petermattis
Copy link
Collaborator

Review status: 0 of 4 files reviewed at latest revision, 1 unresolved discussion, some commit checks failed.


storage/replica_raftstorage.go, line 260 [r2] (raw file):

  rangeID := r.RangeID

  if r.needsSplitBySizeLocked() {

I was imagining we'd only do this if the snapshot size was excessive (i.e 3-4x the target range size), but perhaps it is always better to delay a snapshot if splitting needs to be done. @bdarnell?


Comments from Reviewable

@tamird
Copy link
Contributor

tamird commented Jul 12, 2016

Reviewed 1 of 1 files at r1, 3 of 3 files at r2, 1 of 1 files at r3.
Review status: all files reviewed at latest revision, 4 unresolved discussions, some commit checks failed.


storage/replica_raftstorage.go, line 258 [r2] (raw file):

// Snapshot requires that the replica lock is held.
func (r *Replica) Snapshot() (raftpb.Snapshot, error) {
  rangeID := r.RangeID

can be below the size check


storage/replica_raftstorage_test.go, line 34 [r2] (raw file):

const valSize = 1 << 10 // 1 KiB

func fillTestRange(t testing.TB, rep *Replica, rangeID roachpb.RangeID, snapSize int) {

isn't rangeID always rep's range ID? snapSize is also not a useful name in this "general" method


storage/replica_raftstorage_test.go, line 70 [r2] (raw file):

  fillTestRange(t, rep, rangeID, snapSize*2)

  if _, err := rep.Snapshot(); err == nil {

small nit: check that this is the right error, not just non-nil


Comments from Reviewable

@dt
Copy link
Member Author

dt commented Jul 12, 2016

Review status: 2 of 4 files reviewed at latest revision, 4 unresolved discussions, some commit checks failed.


storage/replica_raftstorage.go, line 258 [r2] (raw file):

Previously, tamird (Tamir Duberstein) wrote…

can be below the size check

Done.

storage/replica_raftstorage_test.go, line 34 [r2] (raw file):

Previously, tamird (Tamir Duberstein) wrote…

isn't rangeID always rep's range ID? snapSize is also not a useful name in this "general" method

Done.

storage/replica_raftstorage_test.go, line 70 [r2] (raw file):

Previously, tamird (Tamir Duberstein) wrote…

small nit: check that this is the right error, not just non-nil

Done.

Comments from Reviewable

@tamird
Copy link
Contributor

tamird commented Jul 12, 2016

Reviewed 2 of 2 files at r4.
Review status: all files reviewed at latest revision, 2 unresolved discussions, some commit checks failed.


storage/replica_raftstorage_test.go, line 73 [r4] (raw file):

  if _, err := rep.Snapshot(); err != raft.ErrSnapshotTemporarilyUnavailable {
      t.Fatal("snapshot of a very large range should fail")

log the unexpected error


Comments from Reviewable

@tamird
Copy link
Contributor

tamird commented Jul 12, 2016

Review status: all files reviewed at latest revision, 3 unresolved discussions, some commit checks failed.


storage/replica_raftstorage_test.go, line 85 [r4] (raw file):

  defer stopper.Stop()

  const snapSize = 1 << 25 // 32 MiB

remove?


Comments from Reviewable

@bdarnell
Copy link
Contributor

:lgtm:


Reviewed 1 of 1 files at r1, 1 of 3 files at r2, 1 of 2 files at r4, 1 of 1 files at r5.
Review status: all files reviewed at latest revision, 3 unresolved discussions, some commit checks failed.


storage/replica_raftstorage.go, line 260 [r2] (raw file):

Previously, petermattis (Peter Mattis) wrote…

I was imagining we'd only do this if the snapshot size was excessive (i.e 3-4x the target range size), but perhaps it is always better to delay a snapshot if splitting needs to be done. @bdarnell?

We use the unmodified target size in `replicateQueue`; whatever we use should be consistent between the two. (this is just a fallback in case the range has grown between the decision to replicate and the time that the snapshot is actually sent)

I think it's better to be strict about this than to allow a multiple of the target size. Until we have streaming snapshots, we already have a 3-4x amplification factor in memory, and allowing 64MB * 4 * 4 = 1GB of memory to be consumed by a snapshot of a single range doesn't sound like a good idea.


Comments from Reviewable

@petermattis
Copy link
Collaborator

Review status: all files reviewed at latest revision, 3 unresolved discussions, some commit checks failed.


storage/replica_raftstorage.go, line 260 [r2] (raw file):

Previously, bdarnell (Ben Darnell) wrote…

We use the unmodified target size in replicateQueue; whatever we use should be consistent between the two. (this is just a fallback in case the range has grown between the decision to replicate and the time that the snapshot is actually sent)

I think it's better to be strict about this than to allow a multiple of the target size. Until we have streaming snapshots, we already have a 3-4x amplification factor in memory, and allowing 64MB * 4 * 4 = 1GB of memory to be consumed by a snapshot of a single range doesn't sound like a good idea.

True. My concern was that this limit will block raft from recovering an existing replica of a large range via a snapshot until that range has split.

Comments from Reviewable

@bdarnell
Copy link
Contributor

Review status: all files reviewed at latest revision, 3 unresolved discussions, some commit checks failed.


storage/replica_raftstorage.go, line 260 [r2] (raw file):

Previously, petermattis (Peter Mattis) wrote…

True. My concern was that this limit will block raft from recovering an existing replica of a large range via a snapshot until that range has split.

Hmm. Yeah, it might be worth using a higher limit here since we could be in a state where we can't split until we've recovered a lagging follower.

Comments from Reviewable

@tamird
Copy link
Contributor

tamird commented Jul 13, 2016

Reviewed 1 of 1 files at r5.
Review status: all files reviewed at latest revision, 3 unresolved discussions, some commit checks failed.


Comments from Reviewable

@petermattis
Copy link
Collaborator

Review status: all files reviewed at latest revision, 3 unresolved discussions, some commit checks failed.


storage/replica_raftstorage.go, line 260 [r2] (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Hmm. Yeah, it might be worth using a higher limit here since we could be in a state where we can't split until we've recovered a lagging follower.

How could we be in that situation? We'll only trim the raft log to the quorum commit index. If there is a lagging replica we're already in a danger zone.

Comments from Reviewable

@bdarnell
Copy link
Contributor

Review status: all files reviewed at latest revision, 3 unresolved discussions, some commit checks failed.


storage/replica_raftstorage.go, line 260 [r2] (raw file):

Previously, petermattis (Peter Mattis) wrote…

How could we be in that situation? We'll only trim the raft log to the quorum commit index. If there is a lagging replica we're already in a danger zone.

It's possible in edge cases when recovering from certain near-death experiences. For example, in a range on nodes A, B, and C, B dies, then some time later C dies (without any replica changes in between). B is revived, but C is unrecoverable. This range can recover if and only if A can send a snapshot to B.

I can't come up with any situation in which this matters that doesn't involve multiple failures, so maybe it's not a situation we want to promise to handle, but there are cases where we could theoretically recover but this limit would prevent it.


Comments from Reviewable

@dt dt force-pushed the snap branch 2 times, most recently from 457d172 to a5e1af3 Compare July 14, 2016 19:07
@dt
Copy link
Member Author

dt commented Jul 14, 2016

Review status: 0 of 5 files reviewed at latest revision, 3 unresolved discussions, some commit checks failed.


storage/replica_raftstorage.go, line 260 [r2] (raw file):

Previously, bdarnell (Ben Darnell) wrote…

It's possible in edge cases when recovering from certain near-death experiences. For example, in a range on nodes A, B, and C, B dies, then some time later C dies (without any replica changes in between). B is revived, but C is unrecoverable. This range can recover if and only if A can send a snapshot to B.

I can't come up with any situation in which this matters that doesn't involve multiple failures, so maybe it's not a situation we want to promise to handle, but there are cases where we could theoretically recover but this limit would prevent it.

Just to be careful, I've bumped it up to 2x split size.

storage/replica_raftstorage_test.go, line 73 [r4] (raw file):

Previously, tamird (Tamir Duberstein) wrote…

log the unexpected error

Done.

Comments from Reviewable

@petermattis
Copy link
Collaborator

:lgtm:


Review status: 0 of 5 files reviewed at latest revision, 3 unresolved discussions, some commit checks failed.


Comments from Reviewable

@dt dt force-pushed the snap branch 5 times, most recently from 4659d9f to f13a7d3 Compare July 18, 2016 20:17
If a range needs to be split, return an err rather than attempting to generate a snapshot.
This avoids generating excessively large snapshots.

Suggested in cockroachdb#7581.
@dt dt merged commit 7841a4a into cockroachdb:master Jul 18, 2016
@dt dt deleted the snap branch July 18, 2016 20:37
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Dec 9, 2017
In a privately reported user issue, we've seen that [our attempts](cockroachdb#7788)
at [preventing large snapshots](cockroachdb#7581)
can result in replica unavailability. Our current approach to limiting
large snapshots assumes is that its ok to block snapshots indefinitely
while waiting for a range to first split. Unfortunately, this can create
a dependency cycle where a range requires a snapshot to split (because it
can't achieve an up-to-date quorum without it) but isn't allowed to perform
a snapshot until its size is reduced below the threshold. This can result
in unavailability even when a majority of replicas remain live.

Currently, we still need this snapshot size limit because unbounded snapshots
can result in OOM errors that crash entire nodes. However, once snapshots
are streamed from disk to disk, never needing to buffer in-memory on the
sending or receiving side, we should be able to remove any snapshot size
limit (see cockroachdb#16954).

As a holdover, this change introduces a `permitLargeSnapshots` flag on a
replica which is set when the replica is too large to snapshot but observes
splits failing. When set, the flag allows snapshots to ignore the size
limit until the snapshot goes through and splits are able to succeed
again.

Release note: None
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Dec 12, 2017
In a privately reported user issue, we've seen that [our attempts](cockroachdb#7788)
at [preventing large snapshots](cockroachdb#7581)
can result in replica unavailability. Our current approach to limiting
large snapshots assumes is that its ok to block snapshots indefinitely
while waiting for a range to first split. Unfortunately, this can create
a dependency cycle where a range requires a snapshot to split (because it
can't achieve an up-to-date quorum without it) but isn't allowed to perform
a snapshot until its size is reduced below the threshold. This can result
in unavailability even when a majority of replicas remain live.

Currently, we still need this snapshot size limit because unbounded snapshots
can result in OOM errors that crash entire nodes. However, once snapshots
are streamed from disk to disk, never needing to buffer in-memory on the
sending or receiving side, we should be able to remove any snapshot size
limit (see cockroachdb#16954).

As a holdover, this change introduces a `permitLargeSnapshots` flag on a
replica which is set when the replica is too large to snapshot but observes
splits failing. When set, the flag allows snapshots to ignore the size
limit until the snapshot goes through and splits are able to succeed
again.

Release note (bug fix): Fixed a scenario where a range that is too big
to snapshot can lose availability even with a majority of nodes alive.
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Dec 15, 2017
In a privately reported user issue, we've seen that [our attempts](cockroachdb#7788)
at [preventing large snapshots](cockroachdb#7581)
can result in replica unavailability. Our current approach to limiting
large snapshots assumes is that its ok to block snapshots indefinitely
while waiting for a range to first split. Unfortunately, this can create
a dependency cycle where a range requires a snapshot to split (because it
can't achieve an up-to-date quorum without it) but isn't allowed to perform
a snapshot until its size is reduced below the threshold. This can result
in unavailability even when a majority of replicas remain live.

Currently, we still need this snapshot size limit because unbounded snapshots
can result in OOM errors that crash entire nodes. However, once snapshots
are streamed from disk to disk, never needing to buffer in-memory on the
sending or receiving side, we should be able to remove any snapshot size
limit (see cockroachdb#16954).

As a holdover, this change introduces a `permitLargeSnapshots` flag on a
replica which is set when the replica is too large to snapshot but observes
splits failing. When set, the flag allows snapshots to ignore the size
limit until the snapshot goes through and splits are able to succeed
again.

Release note (bug fix): Fixed a scenario where a range that is too big
to snapshot can lose availability even with a majority of nodes alive.
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Dec 15, 2017
In a privately reported user issue, we've seen that [our attempts](cockroachdb#7788)
at [preventing large snapshots](cockroachdb#7581)
can result in replica unavailability. Our current approach to limiting
large snapshots assumes is that its ok to block snapshots indefinitely
while waiting for a range to first split. Unfortunately, this can create
a dependency cycle where a range requires a snapshot to split (because it
can't achieve an up-to-date quorum without it) but isn't allowed to perform
a snapshot until its size is reduced below the threshold. This can result
in unavailability even when a majority of replicas remain live.

Currently, we still need this snapshot size limit because unbounded snapshots
can result in OOM errors that crash entire nodes. However, once snapshots
are streamed from disk to disk, never needing to buffer in-memory on the
sending or receiving side, we should be able to remove any snapshot size
limit (see cockroachdb#16954).

As a holdover, this change introduces a `permitLargeSnapshots` flag on a
replica which is set when the replica is too large to snapshot but observes
splits failing. When set, the flag allows snapshots to ignore the size
limit until the snapshot goes through and splits are able to succeed
again.

Release note (bug fix): Fixed a scenario where a range that is too big
to snapshot can lose availability even with a majority of nodes alive.
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Dec 20, 2017
In a privately reported user issue, we've seen that [our attempts](cockroachdb#7788)
at [preventing large snapshots](cockroachdb#7581)
can result in replica unavailability. Our current approach to limiting
large snapshots assumes is that its ok to block snapshots indefinitely
while waiting for a range to first split. Unfortunately, this can create
a dependency cycle where a range requires a snapshot to split (because it
can't achieve an up-to-date quorum without it) but isn't allowed to perform
a snapshot until its size is reduced below the threshold. This can result
in unavailability even when a majority of replicas remain live.

Currently, we still need this snapshot size limit because unbounded snapshots
can result in OOM errors that crash entire nodes. However, once snapshots
are streamed from disk to disk, never needing to buffer in-memory on the
sending or receiving side, we should be able to remove any snapshot size
limit (see cockroachdb#16954).

As a holdover, this change introduces a `permitLargeSnapshots` flag on a
replica which is set when the replica is too large to snapshot but observes
splits failing. When set, the flag allows snapshots to ignore the size
limit until the snapshot goes through and splits are able to succeed
again.

Release note (bug fix): Fixed a scenario where a range that is too big
to snapshot can lose availability even with a majority of nodes alive.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants