storage: protect ComputeChecksum commands from replaying #29067

benesch · 2018-08-26T05:31:12Z

Previously, a ComputeChecksum command could apply twice with the same
ID. Consider the following sequence of events:

DistSender sends a ComputeChecksum request to a replica.
The request is succesfully evaluated and proposed, but a connection
error occurs.
DistSender retries the request, leaving the checksum ID unchanged!

This would result in two ComputeChecksum commands with the same checksum
ID in the Raft log. Somewhat amazingly, this typically wasn't
problematic. If all replicas were online and reasonably up-to-date,
they'd see the first ComputeChecksum command, compute its checksum, and
store it in the checksums map. When they saw the duplicated
ComputeChecksum command, they'd see that a checksum with that ID already
existed and ignore it. In effect, only the first ComputeChecksum command
for a given checksum ID mattered.

The problem occured when one replica saw one ComputeChecksum command but
not the other. There were two ways this could occur. A replica could go
offline after computing the checksum the first time; when it came back
online, it would have an empty checksum map, and the checksum computed
for the second ComputeChecksum command would be recorded instead. Or a
replica could receive a snapshot that advanced it past one
ComputeChecksum but not the other. In both cases, the replicas could
spuriously fail a consistency check.

A very similar problem occured with range merges because ComputeChecksum
requests are incorrectly ranged (see #29002). That means DistSender
might split a ComputeChecksum request in two. Consider what happens when
a consistency check occurs immediately after a merge: the
ComputeChecksum request is generated using the up-to-date, post-merge
descriptor, but DistSender might have the pre-merge descriptors cached,
and so it splits the batch in two. Both halves of the batch would get
routed to the same range, and both halves would have the same command
ID, resulting in the same duplicated ComputeChecksum command problem.

The fix for these problems is to assign the checksum ID when the
ComputeChecksum request is evaluated. If the request is retried, it will
be properly assigned a new checksum ID. Note that we don't need to worry
about reproposals causing duplicate commands, as the MaxLeaseIndex
prevents proposals from replay.

The version compatibility story here is straightforward. The
ReplicaChecksumVersion is bumped, so v2.0 nodes will turn
ComputeChecksum requests proposed by v2.1 nodes into a no-op, and
vice-versa. The consistency queue will spam some complaints into the log
about this--it will time out while collecting checksums--but this will
stop as soon as all nodes have been upgraded to the new version.†

Note that this commit takes the opportunity to migrate
storagebase.ReplicatedEvalResult.ComputeChecksum from
roachpb.ComputeChecksumRequest to a dedicated
storagebase.ComputeChecksum message. Separate types are more in line
with how the merge/split/change replicas triggers work and avoid
shipping unnecessary fields through Raft. Note that even though this
migration changes logic downstream of Raft, it's safe. v2.1 nodes will
turn any ComputeChecksum commands that were commited by v2.0 nodes into
no-ops, and vice-versa, but the only effect of this will be some
temporary consistency queue spam. As an added bonus, because we're
guaranteed that we'll never see duplicate v2.1-style ComputeChecksum
commands, we can properly fatal if we ever see a ComputeChecksum request
with a checksum ID that we've already computed.

† It would be possible to put the late-ID allocation behind a cluster
version to avoid the log spam, but that amounts to allowing v2.1 to
initiate known-buggy consistency checks. A bit of log spam seems
preferable.

Fix #28995.

cockroach-teamcity · 2018-08-26T05:31:19Z

This change is

tbg

Reviewed 11 of 11 files at r1.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale)

pkg/storage/batcheval/cmd_compute_checksum.go, line 49 at r1 (raw file):

	if args.Version != ReplicaChecksumVersion {
		log.Errorf(ctx, "Incompatible versions: e=%d, v=%d", ReplicaChecksumVersion, args.Version)

I would downgrade this from Errorf to Infof and make the message more benign.

log.Infof(ctx, "ignoring checksum request due to version mismatch (request: %d, local: %d)", args.Version, ReplicaChecksumVersin)

Previously, a ComputeChecksum command could apply twice with the same ID. Consider the following sequence of events: 1. DistSender sends a ComputeChecksum request to a replica. 2. The request is succesfully evaluated and proposed, but a connection error occurs. 3. DistSender retries the request, leaving the checksum ID unchanged! This would result in two ComputeChecksum commands with the same checksum ID in the Raft log. Somewhat amazingly, this typically wasn't problematic. If all replicas were online and reasonably up-to-date, they'd see the first ComputeChecksum command, compute its checksum, and store it in the checksums map. When they saw the duplicated ComputeChecksum command, they'd see that a checksum with that ID already existed and ignore it. In effect, only the first ComputeChecksum command for a given checksum ID mattered. The problem occured when one replica saw one ComputeChecksum command but not the other. There were two ways this could occur. A replica could go offline after computing the checksum the first time; when it came back online, it would have an empty checksum map, and the checksum computed for the second ComputeChecksum command would be recorded instead. Or a replica could receive a snapshot that advanced it past one ComputeChecksum but not the other. In both cases, the replicas could spuriously fail a consistency check. A very similar problem occured with range merges because ComputeChecksum requests are incorrectly ranged (see cockroachdb#29002). That means DistSender might split a ComputeChecksum request in two. Consider what happens when a consistency check occurs immediately after a merge: the ComputeChecksum request is generated using the up-to-date, post-merge descriptor, but DistSender might have the pre-merge descriptors cached, and so it splits the batch in two. Both halves of the batch would get routed to the same range, and both halves would have the same command ID, resulting in the same duplicated ComputeChecksum command problem. The fix for these problems is to assign the checksum ID when the ComputeChecksum request is evaluated. If the request is retried, it will be properly assigned a new checksum ID. Note that we don't need to worry about reproposals causing duplicate commands, as the MaxLeaseIndex prevents proposals from replay. The version compatibility story here is straightforward. The ReplicaChecksumVersion is bumped, so v2.0 nodes will turn ComputeChecksum requests proposed by v2.1 nodes into a no-op, and vice-versa. The consistency queue will spam some complaints into the log about this--it will time out while collecting checksums--but this will stop as soon as all nodes have been upgraded to the new version.† Note that this commit takes the opportunity to migrate storagebase.ReplicatedEvalResult.ComputeChecksum from roachpb.ComputeChecksumRequest to a dedicated storagebase.ComputeChecksum message. Separate types are more in line with how the merge/split/change replicas triggers work and avoid shipping unnecessary fields through Raft. Note that even though this migration changes logic downstream of Raft, it's safe. v2.1 nodes will turn any ComputeChecksum commands that were commited by v2.0 nodes into no-ops, and vice-versa, but the only effect of this will be some temporary consistency queue spam. As an added bonus, because we're guaranteed that we'll never see duplicate v2.1-style ComputeChecksum commands, we can properly fatal if we ever see a ComputeChecksum request with a checksum ID that we've already computed. † It would be possible to put the late-ID allocation behind a cluster version to avoid the log spam, but that amounts to allowing v2.1 to initiate known-buggy consistency checks. A bit of log spam seems preferable. Fix cockroachdb#28995.

benesch

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale)

pkg/storage/batcheval/cmd_compute_checksum.go, line 49 at r1 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

I would downgrade this from Errorf to Infof and make the message more benign.

log.Infof(ctx, "ignoring checksum request due to version mismatch (request: %d, local: %d)", args.Version, ReplicaChecksumVersin)
</blockquote></details>

Good idea. Done.


<!-- Sent from Reviewable.io -->

benesch · 2018-08-27T13:55:05Z

bors r=tschottdorf

tbg

Reviewed 1 of 1 files at r2.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale)

craig · 2018-08-27T14:06:47Z

Merge conflict (retrying...)

benesch · 2018-08-27T15:01:15Z

Uh, bors crashed.

bors r=tschottdorf

29067: storage: protect ComputeChecksum commands from replaying r=tschottdorf a=benesch Previously, a ComputeChecksum command could apply twice with the same ID. Consider the following sequence of events: 1. DistSender sends a ComputeChecksum request to a replica. 2. The request is succesfully evaluated and proposed, but a connection error occurs. 3. DistSender retries the request, leaving the checksum ID unchanged! This would result in two ComputeChecksum commands with the same checksum ID in the Raft log. Somewhat amazingly, this typically wasn't problematic. If all replicas were online and reasonably up-to-date, they'd see the first ComputeChecksum command, compute its checksum, and store it in the checksums map. When they saw the duplicated ComputeChecksum command, they'd see that a checksum with that ID already existed and ignore it. In effect, only the first ComputeChecksum command for a given checksum ID mattered. The problem occured when one replica saw one ComputeChecksum command but not the other. There were two ways this could occur. A replica could go offline after computing the checksum the first time; when it came back online, it would have an empty checksum map, and the checksum computed for the second ComputeChecksum command would be recorded instead. Or a replica could receive a snapshot that advanced it past one ComputeChecksum but not the other. In both cases, the replicas could spuriously fail a consistency check. A very similar problem occured with range merges because ComputeChecksum requests are incorrectly ranged (see #29002). That means DistSender might split a ComputeChecksum request in two. Consider what happens when a consistency check occurs immediately after a merge: the ComputeChecksum request is generated using the up-to-date, post-merge descriptor, but DistSender might have the pre-merge descriptors cached, and so it splits the batch in two. Both halves of the batch would get routed to the same range, and both halves would have the same command ID, resulting in the same duplicated ComputeChecksum command problem. The fix for these problems is to assign the checksum ID when the ComputeChecksum request is evaluated. If the request is retried, it will be properly assigned a new checksum ID. Note that we don't need to worry about reproposals causing duplicate commands, as the MaxLeaseIndex prevents proposals from replay. The version compatibility story here is straightforward. The ReplicaChecksumVersion is bumped, so v2.0 nodes will turn ComputeChecksum requests proposed by v2.1 nodes into a no-op, and vice-versa. The consistency queue will spam some complaints into the log about this--it will time out while collecting checksums--but this will stop as soon as all nodes have been upgraded to the new version.† Note that this commit takes the opportunity to migrate storagebase.ReplicatedEvalResult.ComputeChecksum from roachpb.ComputeChecksumRequest to a dedicated storagebase.ComputeChecksum message. Separate types are more in line with how the merge/split/change replicas triggers work and avoid shipping unnecessary fields through Raft. Note that even though this migration changes logic downstream of Raft, it's safe. v2.1 nodes will turn any ComputeChecksum commands that were commited by v2.0 nodes into no-ops, and vice-versa, but the only effect of this will be some temporary consistency queue spam. As an added bonus, because we're guaranteed that we'll never see duplicate v2.1-style ComputeChecksum commands, we can properly fatal if we ever see a ComputeChecksum request with a checksum ID that we've already computed. † It would be possible to put the late-ID allocation behind a cluster version to avoid the log spam, but that amounts to allowing v2.1 to initiate known-buggy consistency checks. A bit of log spam seems preferable. Fix #28995. 29083: storage: fix raft snapshots that span merges and splits r=tschottdorf a=benesch The code that handles Raft snapshots that span merges did not account for snapshots that spanned merges AND splits. Handle this case by allowing snapshot subsumption even when the snapshot's end key does not exactly match the end of an existing replica. See the commits within the patch for details. Fix #29080. Release note: None 29117: opt: fix LookupJoinDef interning, add tests r=RaduBerinde a=RaduBerinde Fixing an omission I noticed in internLookupJoinDef and adding missing tests for interning defs. Release note: None Co-authored-by: Nikhil Benesch <[email protected]> Co-authored-by: Radu Berinde <[email protected]>

craig · 2018-08-27T15:44:10Z

Build succeeded

GitHub CI (Cockroach)

29126: backport-2.1: another round of merge bug fixes r=tschottdorf a=benesch Backport: * 1/1 commits from "storage: fix raft snapshots that span merges and splits" (#29083) * 1/1 commits from "storage: deflake TestStoreRangeMergeReadoptedBothFollowers" (#29084) * 1/1 commits from "storage: protect ComputeChecksum commands from replaying" (#29067) * 1/1 commits from "storage: make ComputeChecksum a point request" (#29079) Please see individual PRs for details. /cc @cockroachdb/release Co-authored-by: Nikhil Benesch <[email protected]>

benesch requested review from tbg and a team August 26, 2018 05:31

benesch force-pushed the consistency-safe branch from 7405be8 to 8280640 Compare August 27, 2018 01:24

tbg approved these changes Aug 27, 2018

View reviewed changes

benesch force-pushed the consistency-safe branch from 8280640 to c952cb3 Compare August 27, 2018 13:54

benesch commented Aug 27, 2018

View reviewed changes

tbg approved these changes Aug 27, 2018

View reviewed changes

craig bot merged commit c952cb3 into cockroachdb:master Aug 27, 2018

benesch mentioned this pull request Aug 27, 2018

backport-2.1: another round of merge bug fixes #29126

Merged

benesch deleted the consistency-safe branch August 27, 2018 16:54

tbg mentioned this pull request Aug 29, 2018

storage: consistency check failed on cyan #29252

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: protect ComputeChecksum commands from replaying #29067

storage: protect ComputeChecksum commands from replaying #29067

benesch commented Aug 26, 2018

cockroach-teamcity commented Aug 26, 2018

tbg left a comment

benesch left a comment

benesch commented Aug 27, 2018

tbg left a comment

craig bot commented Aug 27, 2018

benesch commented Aug 27, 2018

craig bot commented Aug 27, 2018

storage: protect ComputeChecksum commands from replaying #29067

storage: protect ComputeChecksum commands from replaying #29067

Conversation

benesch commented Aug 26, 2018

cockroach-teamcity commented Aug 26, 2018

tbg left a comment

Choose a reason for hiding this comment

benesch left a comment

Choose a reason for hiding this comment

benesch commented Aug 27, 2018

tbg left a comment

Choose a reason for hiding this comment

craig bot commented Aug 27, 2018

Merge conflict (retrying...)

benesch commented Aug 27, 2018

craig bot commented Aug 27, 2018

Build succeeded