storage: CollectChecksum should not be ranged #29002

tbg · 2018-08-23T11:17:54Z

Seen in #28995, quoting for convenience:

Two more observations: ComputeChecksumRequest seems to be implemented in a bad way. It uses a key range and it seems like it could be split by DistSender:

func (*ComputeChecksumRequest) flags() int  { return isWrite | isRange }

Not sure what will happen if that actually occurs. Nothing good! I think we'll get this error from DistSender:

} else if lOK != rOK {
	return errors.Errorf("can not combine %T and %T", valLeft, valRight)
}

so it doesn't seem completely terrible, but either way, this sucks for splits, and it's even weirder for merges because the command will now declare a write only to parts of the keyspace. Again this seems fine because it really is a noop (and it shouldn't have to declare any part of the keyspace) but there's some cleanup to do.

I was also worried about something else but checked that it isn't a problem: We have an optimization that avoids sending noops through Raft, and ComputeChecksum looked like a noop. But the code does the right thing and actually sends it through Raft.

cockroach/pkg/storage/replica.go

Lines 3527 to 3528 in 80812f2

    
           // 1. proposal.command == nil indicates that the evaluation was a no-op 
        
           //    and that no Raft command needs to be proposed.

The text was updated successfully, but these errors were encountered:

Previously, a ComputeChecksum command could apply twice with the same ID. Consider the following sequence of events: 1. DistSender sends a ComputeChecksum request to a replica. 2. The request is succesfully evaluated and proposed, but a connection error occurs. 3. DistSender retries the request, leaving the checksum ID unchanged! This would result in two ComputeChecksum commands with the same checksum ID in the Raft log. Somewhat amazingly, this typically wasn't problematic. If all replicas were online and reasonably up-to-date, they'd see the first ComputeChecksum command, compute its checksum, and store it in the checksums map. When they saw the duplicated ComputeChecksum command, they'd see that a checksum with that ID already existed and ignore it. In effect, only the first ComputeChecksum command for a given checksum ID mattered. The problem occured when one replica saw one ComputeChecksum command but not the other. There were two ways this could occur. A replica could go offline after computing the checksum the first time; when it came back online, it would have an empty checksum map, and the checksum computed for the second ComputeChecksum command would be recorded instead. Or a replica could receive a snapshot that advanced it past one ComputeChecksum but not the other. In both cases, the replicas could spuriously fail a consistency check. A very similar problem occured with range merges because ComputeChecksum requests are incorrectly ranged (see cockroachdb#29002). That means DistSender might split a ComputeChecksum request in two. Consider what happens when a consistency check occurs immediately after a merge: the ComputeChecksum request is generated using the up-to-date, post-merge descriptor, but DistSender might have the pre-merge descriptors cached, and so it splits the batch in two. Both halves of the batch would get routed to the same range, and both halves would have the same command ID, resulting in the same duplicated ComputeChecksum command problem. The fix for these problems is to assign the checksum ID when the ComputeChecksum request is evaluated. If the request is retried, it will be properly assigned a new checksum ID. Note that we don't need to worry about reproposals causing duplicate commands, as the MaxLeaseIndex prevents proposals from replay. The version compatibility story here is straightforward. The ReplicaChecksumVersion is bumped, so v2.0 nodes will turn ComputeChecksum requests proposed by v2.1 nodes into a no-op, and vice-versa. The consistency queue will spam some complaints into the log about this--it will time out while collecting checksums--but this will stop as soon as all nodes have been upgraded to the new version.† Note that this commit takes the opportunity to migrate storagebase.ReplicatedEvalResult.ComputeChecksum from roachpb.ComputeChecksumRequest to a dedicated storagebase.ComputeChecksum message. Separate types are more in line with how the merge/split/change replicas triggers work and avoid shipping unnecessary fields through Raft. Note that even though this migration changes logic downstream of Raft, it's safe. v2.1 nodes will turn any ComputeChecksum commands that were commited by v2.0 nodes into no-ops, and vice-versa, but the only effect of this will be some temporary consistency queue spam. As an added bonus, because we're guaranteed that we'll never see duplicate v2.1-style ComputeChecksum commands, we can properly fatal if we ever see a ComputeChecksum request with a checksum ID that we've already computed. † It would be possible to put the late-ID allocation behind a cluster version to avoid the log spam, but that amounts to allowing v2.1 to initiate known-buggy consistency checks. A bit of log spam seems preferable. Fix cockroachdb#28995.

ComputeChecksum was previously implemented as a range request, which meant it could be split by DistSender, resulting in two ComputeChecksum requests with identical IDs! If those split ranges get routed to the same range (e.g. because the ranges were just merged), spurious checksum failures could occur (cockroachdb#28995). Plus, the ComputeChecksum request would not actually look at the range boundaries in the request header; it always operated on the range's entire keyspace at the time the request was applied. The fix is simple: make ComputeChecksum a point request. There are no version compatibility issues here; nodes with this commit are simply smarter about routing ComputeChecksum requests to only one range. Fix cockroachdb#29002. Release note: None

Previously, a ComputeChecksum command could apply twice with the same ID. Consider the following sequence of events: 1. DistSender sends a ComputeChecksum request to a replica. 2. The request is succesfully evaluated and proposed, but a connection error occurs. 3. DistSender retries the request, leaving the checksum ID unchanged! This would result in two ComputeChecksum commands with the same checksum ID in the Raft log. Somewhat amazingly, this typically wasn't problematic. If all replicas were online and reasonably up-to-date, they'd see the first ComputeChecksum command, compute its checksum, and store it in the checksums map. When they saw the duplicated ComputeChecksum command, they'd see that a checksum with that ID already existed and ignore it. In effect, only the first ComputeChecksum command for a given checksum ID mattered. The problem occured when one replica saw one ComputeChecksum command but not the other. There were two ways this could occur. A replica could go offline after computing the checksum the first time; when it came back online, it would have an empty checksum map, and the checksum computed for the second ComputeChecksum command would be recorded instead. Or a replica could receive a snapshot that advanced it past one ComputeChecksum but not the other. In both cases, the replicas could spuriously fail a consistency check. A very similar problem occured with range merges because ComputeChecksum requests are incorrectly ranged (see cockroachdb#29002). That means DistSender might split a ComputeChecksum request in two. Consider what happens when a consistency check occurs immediately after a merge: the ComputeChecksum request is generated using the up-to-date, post-merge descriptor, but DistSender might have the pre-merge descriptors cached, and so it splits the batch in two. Both halves of the batch would get routed to the same range, and both halves would have the same command ID, resulting in the same duplicated ComputeChecksum command problem. The fix for these problems is to assign the checksum ID when the ComputeChecksum request is evaluated. If the request is retried, it will be properly assigned a new checksum ID. Note that we don't need to worry about reproposals causing duplicate commands, as the MaxLeaseIndex prevents proposals from replay. The version compatibility story here is straightforward. The ReplicaChecksumVersion is bumped, so v2.0 nodes will turn ComputeChecksum requests proposed by v2.1 nodes into a no-op, and vice-versa. The consistency queue will spam some complaints into the log about this--it will time out while collecting checksums--but this will stop as soon as all nodes have been upgraded to the new version.† Note that this commit takes the opportunity to migrate storagebase.ReplicatedEvalResult.ComputeChecksum from roachpb.ComputeChecksumRequest to a dedicated storagebase.ComputeChecksum message. Separate types are more in line with how the merge/split/change replicas triggers work and avoid shipping unnecessary fields through Raft. Note that even though this migration changes logic downstream of Raft, it's safe. v2.1 nodes will turn any ComputeChecksum commands that were commited by v2.0 nodes into no-ops, and vice-versa, but the only effect of this will be some temporary consistency queue spam. As an added bonus, because we're guaranteed that we'll never see duplicate v2.1-style ComputeChecksum commands, we can properly fatal if we ever see a ComputeChecksum request with a checksum ID that we've already computed. † It would be possible to put the late-ID allocation behind a cluster version to avoid the log spam, but that amounts to allowing v2.1 to initiate known-buggy consistency checks. A bit of log spam seems preferable. Fix cockroachdb#28995.

29067: storage: protect ComputeChecksum commands from replaying r=tschottdorf a=benesch Previously, a ComputeChecksum command could apply twice with the same ID. Consider the following sequence of events: 1. DistSender sends a ComputeChecksum request to a replica. 2. The request is succesfully evaluated and proposed, but a connection error occurs. 3. DistSender retries the request, leaving the checksum ID unchanged! This would result in two ComputeChecksum commands with the same checksum ID in the Raft log. Somewhat amazingly, this typically wasn't problematic. If all replicas were online and reasonably up-to-date, they'd see the first ComputeChecksum command, compute its checksum, and store it in the checksums map. When they saw the duplicated ComputeChecksum command, they'd see that a checksum with that ID already existed and ignore it. In effect, only the first ComputeChecksum command for a given checksum ID mattered. The problem occured when one replica saw one ComputeChecksum command but not the other. There were two ways this could occur. A replica could go offline after computing the checksum the first time; when it came back online, it would have an empty checksum map, and the checksum computed for the second ComputeChecksum command would be recorded instead. Or a replica could receive a snapshot that advanced it past one ComputeChecksum but not the other. In both cases, the replicas could spuriously fail a consistency check. A very similar problem occured with range merges because ComputeChecksum requests are incorrectly ranged (see #29002). That means DistSender might split a ComputeChecksum request in two. Consider what happens when a consistency check occurs immediately after a merge: the ComputeChecksum request is generated using the up-to-date, post-merge descriptor, but DistSender might have the pre-merge descriptors cached, and so it splits the batch in two. Both halves of the batch would get routed to the same range, and both halves would have the same command ID, resulting in the same duplicated ComputeChecksum command problem. The fix for these problems is to assign the checksum ID when the ComputeChecksum request is evaluated. If the request is retried, it will be properly assigned a new checksum ID. Note that we don't need to worry about reproposals causing duplicate commands, as the MaxLeaseIndex prevents proposals from replay. The version compatibility story here is straightforward. The ReplicaChecksumVersion is bumped, so v2.0 nodes will turn ComputeChecksum requests proposed by v2.1 nodes into a no-op, and vice-versa. The consistency queue will spam some complaints into the log about this--it will time out while collecting checksums--but this will stop as soon as all nodes have been upgraded to the new version.† Note that this commit takes the opportunity to migrate storagebase.ReplicatedEvalResult.ComputeChecksum from roachpb.ComputeChecksumRequest to a dedicated storagebase.ComputeChecksum message. Separate types are more in line with how the merge/split/change replicas triggers work and avoid shipping unnecessary fields through Raft. Note that even though this migration changes logic downstream of Raft, it's safe. v2.1 nodes will turn any ComputeChecksum commands that were commited by v2.0 nodes into no-ops, and vice-versa, but the only effect of this will be some temporary consistency queue spam. As an added bonus, because we're guaranteed that we'll never see duplicate v2.1-style ComputeChecksum commands, we can properly fatal if we ever see a ComputeChecksum request with a checksum ID that we've already computed. † It would be possible to put the late-ID allocation behind a cluster version to avoid the log spam, but that amounts to allowing v2.1 to initiate known-buggy consistency checks. A bit of log spam seems preferable. Fix #28995. 29083: storage: fix raft snapshots that span merges and splits r=tschottdorf a=benesch The code that handles Raft snapshots that span merges did not account for snapshots that spanned merges AND splits. Handle this case by allowing snapshot subsumption even when the snapshot's end key does not exactly match the end of an existing replica. See the commits within the patch for details. Fix #29080. Release note: None 29117: opt: fix LookupJoinDef interning, add tests r=RaduBerinde a=RaduBerinde Fixing an omission I noticed in internLookupJoinDef and adding missing tests for interning defs. Release note: None Co-authored-by: Nikhil Benesch <[email protected]> Co-authored-by: Radu Berinde <[email protected]>

ComputeChecksum was previously implemented as a range request, which meant it could be split by DistSender, resulting in two ComputeChecksum requests with identical IDs! If those split ranges get routed to the same range (e.g. because the ranges were just merged), spurious checksum failures could occur (cockroachdb#28995). Plus, the ComputeChecksum request would not actually look at the range boundaries in the request header; it always operated on the range's entire keyspace at the time the request was applied. The fix is simple: make ComputeChecksum a point request. There are no version compatibility issues here; nodes with this commit are simply smarter about routing ComputeChecksum requests to only one range. Fix cockroachdb#29002. Release note: None

Previously, a ComputeChecksum command could apply twice with the same ID. Consider the following sequence of events: 1. DistSender sends a ComputeChecksum request to a replica. 2. The request is succesfully evaluated and proposed, but a connection error occurs. 3. DistSender retries the request, leaving the checksum ID unchanged! This would result in two ComputeChecksum commands with the same checksum ID in the Raft log. Somewhat amazingly, this typically wasn't problematic. If all replicas were online and reasonably up-to-date, they'd see the first ComputeChecksum command, compute its checksum, and store it in the checksums map. When they saw the duplicated ComputeChecksum command, they'd see that a checksum with that ID already existed and ignore it. In effect, only the first ComputeChecksum command for a given checksum ID mattered. The problem occured when one replica saw one ComputeChecksum command but not the other. There were two ways this could occur. A replica could go offline after computing the checksum the first time; when it came back online, it would have an empty checksum map, and the checksum computed for the second ComputeChecksum command would be recorded instead. Or a replica could receive a snapshot that advanced it past one ComputeChecksum but not the other. In both cases, the replicas could spuriously fail a consistency check. A very similar problem occured with range merges because ComputeChecksum requests are incorrectly ranged (see cockroachdb#29002). That means DistSender might split a ComputeChecksum request in two. Consider what happens when a consistency check occurs immediately after a merge: the ComputeChecksum request is generated using the up-to-date, post-merge descriptor, but DistSender might have the pre-merge descriptors cached, and so it splits the batch in two. Both halves of the batch would get routed to the same range, and both halves would have the same command ID, resulting in the same duplicated ComputeChecksum command problem. The fix for these problems is to assign the checksum ID when the ComputeChecksum request is evaluated. If the request is retried, it will be properly assigned a new checksum ID. Note that we don't need to worry about reproposals causing duplicate commands, as the MaxLeaseIndex prevents proposals from replay. The version compatibility story here is straightforward. The ReplicaChecksumVersion is bumped, so v2.0 nodes will turn ComputeChecksum requests proposed by v2.1 nodes into a no-op, and vice-versa. The consistency queue will spam some complaints into the log about this--it will time out while collecting checksums--but this will stop as soon as all nodes have been upgraded to the new version.† Note that this commit takes the opportunity to migrate storagebase.ReplicatedEvalResult.ComputeChecksum from roachpb.ComputeChecksumRequest to a dedicated storagebase.ComputeChecksum message. Separate types are more in line with how the merge/split/change replicas triggers work and avoid shipping unnecessary fields through Raft. Note that even though this migration changes logic downstream of Raft, it's safe. v2.1 nodes will turn any ComputeChecksum commands that were commited by v2.0 nodes into no-ops, and vice-versa, but the only effect of this will be some temporary consistency queue spam. As an added bonus, because we're guaranteed that we'll never see duplicate v2.1-style ComputeChecksum commands, we can properly fatal if we ever see a ComputeChecksum request with a checksum ID that we've already computed. † It would be possible to put the late-ID allocation behind a cluster version to avoid the log spam, but that amounts to allowing v2.1 to initiate known-buggy consistency checks. A bit of log spam seems preferable. Fix cockroachdb#28995.

ComputeChecksum was previously implemented as a range request, which meant it could be split by DistSender, resulting in two ComputeChecksum requests with identical IDs! If those split ranges get routed to the same range (e.g. because the ranges were just merged), spurious checksum failures could occur (cockroachdb#28995). Plus, the ComputeChecksum request would not actually look at the range boundaries in the request header; it always operated on the range's entire keyspace at the time the request was applied. The fix is simple: make ComputeChecksum a point request. There are no version compatibility issues here; nodes with this commit are simply smarter about routing ComputeChecksum requests to only one range. Fix cockroachdb#29002. Release note: None

29079: storage: make ComputeChecksum a point request r=tschottdorf a=benesch ComputeChecksum was previously implemented as a range request, which meant it could be split by DistSender, resulting in two ComputeChecksum requests with identical IDs! If those split ranges get routed to the same range (e.g. because the ranges were just merged), spurious checksum failures could occur (#28995). Plus, the ComputeChecksum request would not actually look at the range boundaries in the request header; it always operated on the range's entire keyspace at the time the request was applied. The fix is simple: make ComputeChecksum a point request. There are no version compatibility issues here; nodes with this commit are simply smarter about routing ComputeChecksum requests to only one range. Fix #29002. Release note: None 29145: workload: make split concurrency constant r=nvanbenschoten a=nvanbenschoten I had originally made the split concurrency for workload dynamic, based on the concurrency of the workload itself. This turned out to be a bad idea as it allowed for too much contention during pre-splitting and resulted in lots of split retries. The end result was that splits slowed down over time instead of staying at a constant rate. This change makes the split concurrency constant like it already is with fixture restoration: https://github.com/cockroachdb/cockroach/blob/master/pkg/ccl/workloadccl/fixture.go#L449 This results in pre-splits on large cluster being more stable and taking much less time (~50%). Release note: None Co-authored-by: Nikhil Benesch <[email protected]> Co-authored-by: Nathan VanBenschoten <[email protected]>

tbg added C-cleanup Tech debt, refactors, loose ends, etc. Solution not expected to significantly change behavior. A-kv-client Relating to the KV client and the KV interface. labels Aug 23, 2018

tbg added this to the 2.2 milestone Aug 23, 2018

benesch mentioned this issue Aug 26, 2018

storage: protect ComputeChecksum commands from replaying #29067

Merged

benesch mentioned this issue Aug 27, 2018

storage: make ComputeChecksum a point request #29079

Merged

craig bot closed this as completed in #29079 Aug 28, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: CollectChecksum should not be ranged #29002

storage: CollectChecksum should not be ranged #29002

tbg commented Aug 23, 2018

storage: CollectChecksum should not be ranged #29002

storage: CollectChecksum should not be ranged #29002

Comments

tbg commented Aug 23, 2018