Last update: April 10, 2019
Original author: Nikhil Benesch
This document serves as an end-to-end description of the implementation of range merges. The target reader is someone who is reasonably familiar with core but unfamiliar with either the "how" or the "why" of the range merge implementation.
The most complete documentation, of course, is in the code, tests, and the surrounding comments, but those pieces are necessarily split across several files and packages. That scattered knowledge is centralized here, without excessive detail that is likely to become stale.
A range merge begins when two adjacent ranges are selected to be merged together. For example, suppose our adjacent ranges are P and Q, somewhere in the middle of the keyspace:
--+-----+-----+--
| P | Q |
--+-----+-----+--
We'll call P the left-hand side (LHS) of the merge, and Q the right-hand side (RHS) of the merge. For reasons that will become clear later, we also refer to P as the subsuming range and Q as the subsumed range.
The merge is coordinated by the LHS. The coordinator begins by verifying that a) the two ranges are, in fact, adjacent, and b) that the replica sets of the two ranges are aligned. Replica set alignment is a term that is currently only relevant to merges; it means that the set of stores with replicas of the LHS exactly matches the set of stores with replicas of the RHS. For example, this replica set is aligned:
Store 1 Store 2 Store 3 Store 4
+-----+ +-----+ +-----+ +-----+
| P Q | | P Q | | | | P Q |
+-----+ +-----+ +-----+ +-----+
By requiring replica set alignment, the merge operation is reduced to a metadata update, albeit a tricky one, as the stores that will have a copy of the merged range PQ already have all the constituent data, by virtue of having a copy of both P and Q before the merge begins. Note that replicas of P and Q do not need to be fully up-to-date before the merge begins; they'll be caught up as necessary during the transfer of power.
After verifying that the merge is sensible, the coordinator transactionally updates the implicated range descriptors, adjusting P's range descriptor so that it extends to Q's end and deleting Q's range descriptor.
Then, the coordinator needs to atomically move responsibility for the data in the RHS to the LHS. This is tricky, as the lease on the LHS may be held by a different store than the lease on the RHS. The coordinator notifies the RHS that it is about to be subsumed and it is prohibited from serving any additional read or write traffic. Only when the coordinator has received an acknowledgement from every replica of the RHS, indicating that no traffic is possibly being served on the RHS, does the coordinator commit the merge transaction.
Like with splits, the merge transaction is committed with a special "commit trigger" that instructs the receiving store to update its in-memory bookkeeping to match the updates to the range descriptors in the transaction. The moment the merge transaction is considered committed, the merge is complete!
At the time of writing, merges are only initiated by the merge queue, which is responsible both for locating ranges that are in need of a merge and aligning their replica sets before initiating the merge.
The remaining sections cover each of these steps in more detail.
Not any two ranges can be merged. The first and most obvious criterion is that the two ranges must be adjacent. Suppose a simplified cluster that has only three ranges, A, B, and C:
+-----+-----+-----+
| A | B | C |
+-----+-----+-----+
Ranges A and B can be merged, as can ranges B and C, but not ranges A and C, as they are not adjacent. Note that adjacent ranges are equivalently referred to as "neighbors", as in, range B is range A's right-hand neighbor.
The second criterion is that the replica sets must be aligned. To illustrate, consider a four node cluster with the default 3x replication. The allocator has attempted to balance ranges as evenly as possible:
Node 1 Node 2 Node 3 Node 4
+-----+ +-----+ +-----+ +-----+
| P Q | | P | | Q | | P Q |
+-----+ +-----+ +-----+ +-----+
Notice how node 2 does not have a copy of Q, and node 3 does not have a copy of P. These replica sets are considered "misaligned." Aligning them requires rebalancing Q from node 3 to node 2, or rebalancing P from node 2 to node 3:
Node 1 Node 2 Node 3 Node 4
+-----+ +-----+ +-----+ +-----+
| P Q | | P Q | | | | P Q |
+-----+ +-----+ +-----+ +-----+
Node 1 Node 2 Node 3 Node 4
+-----+ +-----+ +-----+ +-----+
| P Q | | | | P Q | | P Q |
+-----+ +-----+ +-----+ +-----+
We explored an alternative merge implementation that did not require aligned replica sets, but found it to be unworkable. See the misaligned replica sets misstep for details.
A merge is initiated by sending a AdminMerge request to a range. Like other admin commands, DistSender will automatically route the request to the leaseholder of the range, but there is no guarantee that the store will retain its lease while the admin command is executing.
Note that an AdminMerge request takes no arguments, as there is no choice in what range will be merged. The recipient of the AdminMerge will always be the LHS, subsuming range, and its right neighbor at the moment that the AdminMerge command begins executing will always be the RHS, subsumed range.
It would have been reasonable to have instead used the RHS to coordinate the merge. That is, the RHS would have been the subsuming range, and the LHS would have been the subsumed range. Using the LHS to coordinate, however, yields a nice symmetry with splits, where the range that coordinates a split becomes the LHS of the split. Maintaining this symmetry means that a range's start key never changes during its lifetime, while its end key may change arbitrarily in response to splits and merges.
There is another reason to prefer using the LHS to coordinate involving an oddity of key encoding and range bounds. It is trivial for a range to send a request to its right neighbor, as it simply addresses the request to its end key, but it is difficult to send a request to its left neighbor, as there is no function to get the key that immediately precedes the range's start key. See the key encoding oddities section of the appendix for details.
At the time of writing, only the merge queue initiates merges, and it does so by bypassing DistSender and invoking the AdminMerge command directly on the local replica. At some point in the future, we may wish to expose manual merges via SQL, at which point the SQL layer will need to send proper AdminMerge requests through the KV API.
At present, AdminMerge requests are subject to a small race. It is possible for the ranges implicated by an AdminMerge request to split or merge between when the client decides to send an AdminMerge request and when the AdminMerge request is processed.
For example, suppose the client decides that P and Q should be merged and sends an AdminMerge request to P. It is possible that, before the AdminMerge request is processed, P splits into P1 and P2. The AdminMerge request will thus result in P1 and P2 merging together, and not the desired P and Q.
The race could have been avoided if the AdminMerge request required that the descriptors for the implicated ranges were provided as arguments to the request. Then the merge could be aborted if the merge transaction discovered that either of the implicated ranges did not match the corresponding descriptor in the AdminMerge request arguments, forming a sort of optimistic lock.
Fortunately, the race is rare in practice. If it proves to be a problem, the scheme described above would be easy to implement while maintaining backwards compatibility.
The merge transaction piggybacks on CockroachDB's serializability to provide much of the necessary synchronization for the bookkeeping updates. For example, merges cannot occur concurrently with any splits or replica changes on the implicated ranges, because the merge transaction will naturally conflict with those split transaction and change replicas transactions, as both transactions will attempt to write updated range descriptors and conflict. No additional code was needed to enforce this, as our standard transaction conflict detection mechanisms kick in here (write intents, the timestamp cache, the span latch manager, etc.).
Note that there was one surprising synchronization problem that was not immediately handled by serializability. See range descriptor generations for details.
The standard KV operations that the merge transaction performs are:
- Reading the LHS descriptor and RHS descriptor, and verifying that their replica sets are aligned.
- Updating the local and meta copy of the LHS descriptor to reflect the widened end key.
- Deleting the local and meta copy of the RHS descriptor.
- Writing an entry to the
system.rangelog
table.
These operations are the essence of a merge, and in fact update all necessary on-disk data! All the remaining complexity exists to update in-memory metadata while the cluster is live.
Note that the merge transaction's KV operations are not fundamentally dependent and so could theoretically be performed in any order. There are, however, several implementation details that enforce some ordering constraints.
First, the merge transaction record needs to be located on the LHS. Specifically, the transaction record needs to live on the subsuming range, as the commit trigger that actually applies the merge to the replica's in-memory state runs on the range where the transaction record lives. The transaction record is created on the range that the transaction writes first; therefore, the merge transaction is careful to update the local copy of the LHS descriptor as its first operation, since the local copy of the LHS descriptor lives on the LHS.
UPDATE in v21.1:
As of v21.1, the merge transaction uses a locking read on the LHS range's transaction record as its first course of action. This eliminates thrashing and livelock in the face of concurrent merge attempts, which we typically only see in testing scenarios but which we'd like to handle effectively. It also has the effect of locating the transaction record on the LHS earlier in the transaction. Because of this, the merge transaction no longer needs to be quite as deliberate about the order in which it updates descriptors later on because that order does not impact the location of the transaction record.
Second, the merge transaction must ensure that, when it issues the delete request to remove the local copy of the RHS descriptor, the resulting intent is actually written to disk. (See the transfer of power subsection for why this is required.) Thanks to transactional pipelining, KV writes can return early, before their intents have actually been laid down. The intents are not required to make it to disk until the moment before the transaction commits. The merge transaction simply disables pipelining to avoid this hazard.
As the last step before the commit, the merge transaction needs to freeze the RHS, then wait for every replica of the RHS to apply all outstanding commands. This ensures that, when the merge commits, every LHS replica can blindly assume that it has perfectly up-to-date data for the RHS. To quickly recap, this is guaranteed because 1) the replica sets were aligned when the merge transaction started, 2) rebalances that would misalign the replica sets will conflict with the merge transaction, causing one of the transactions to abort, 3) the RHS is frozen and cannot process any new commands, and 4) every replica of the RHS is caught up on all commands. The process of freezing the RHS and waiting for every replica to catch up is covered more thoroughly in the transfer of power subsection.
Finally, the merge transaction commits, attaching a special merge commit trigger to the end transaction request. This trigger has three responsibilities:
-
It ensures the end transaction request knows which intents it can resolve locally. Intents that live on the RHS range would naively appear to belong to a different range than the one containing the transaction record (i.e., the LHS), but if the merge is committing then the LHS is subsuming the RHS and thus the intents can be resolved locally.
In fact, it's critical that these intents are considered local, because local intents are resolved synchronously while remote intents are resolved asynchronously. We need to maintain the invariant that, when a store boots up and discovers an intent on its local copy of a range descriptor, it can simply ignore the intent. Because we enforce that these intents are resolved synchronously with the commit of the merge transaction, we are guaranteed that, if we see an intent on a local range descriptor, this replica has not yet applied the
EndTransaction
request for the merge transaction, and it is therefore safe to load the replica. If the intent were instead resolved asynchronously, we could observe the state where theEndTransaction
request for the merge had applied but the intent resolution had not applied, in which case we would attempt to load both the post-merge subsuming replica, and the subsumed replica, which would overlap and crash the node. -
It adjusts the LHS's MVCCStats to incorporate the subsumed range's MVCCstats.
-
It copies necessary range-ID local data from the RHS and LHS, rekeying each key to use LHS's range ID. At the moment, the only necessary data is the transaction abort span.
-
It attaches a merge payload to the replicated proposal result. When each replica of the LHS applies the command, it will notice the merge payload and adjust the store's in-memory state to match the changes to the on-disk state that were committed by the transaction. This entails atomically removing the RHS replica from the store and widening the LHS replica.
This operation involves a delicate dance of acquiring store locks and locks for both replicas, in a certain order, at various points in the Raft command application flow. The details are too intricate to be worth describing here, especially considering that these tangled interactions between a store and its replicas are due for a refactor. The best thing to do, if you're interested in the details, is to trace through all references to
storagepb.ReplicatedEvalResult.Merge
.
The trickiest part of the merge transaction is making the LHS responsible for the keyspace that was previously owned by the RHS. The transfer of power must be performed atomically; otherwise, all manner of consistency violations can occur. This is hopefully obvious, but here's a quick example to drive the point home. Suppose P and Q simultaneously consider themselves responsible for the key q. Q could then allow a write to q at time 1 at the same time that P allowed a read of q at time 2. Consistency requires that either the read see the write, as the read is executing at a higher timestamp, or that the write is bumped to time 3. But because P and Q have separate span latch managers, no synchronization will occur, and the read might fail to see the write!
The transfer of power is complicated by the fact that there is no guarantee that the leases on the LHS and the RHS will be aligned, nor is there any straightforward way to provide such a guarantee. (Aligned leaseholders would allow the merge transaction to use a simple in-memory lock to make the transfer of power atomic.) The leaseholder of either range might fail at any moment, at which point the lease can be acquired, after it expires, by any other live member of the range.
Since leaseholder assignment is infeasible, the merge transaction implements what is essentially a distributed lock.
The lock is initialized by sending a Subsume request to the RHS. This is a single-purpose request that exists solely for use in the merge transaction. It is unlikely to ever be useful in another situation, and (ab)uses several implementation details to provide the necessary synchronization.
When the Subsume request returns, the RHS has made three important promises:
- There are no commands in flight.
- Any future requests will block until the merge transaction completes. If the merge transaction commits, the requests will be bounced with a RangeNotFound error. If the merge transaction aborts, the requests will be processed as normal.
- If the RHS loses its lease, the new leaseholder will adhere to promises 1 and 2.
The Subsume request provides promise 1 by declaring that it reads and writes all addressable keys in the range. This is a bald-faced lie, as the Subsume request only reads one key and writes no keys, but it forces synchronization with all latches in the span latch manager, as no other commands can possibly execute in parallel with a command that claims to write all keys.
It provides promise 2 by flipping a bit on the replica that indicates that a subsumption is in progress. When the bit is active, the replica blocks processing of all requests.
Importantly, the bit needs to be cleared when the merge transaction completes, so that the requests are not blocked forever. This is the responsibility of the merge watcher goroutine. Determining whether a transaction has committed or not is conceptually simple, but the details are brutally complicated. See the transaction record GC section for details.
Note that the Subsume request only runs on the leaseholder, and therefore the merge bit is only set on the leaseholder and the watcher goroutine only runs on the leaseholder. This is perfectly safe, as none of the follower replicas can process requests.
Promise 3 is actually not provided by the Subsume request at all, but by a hook in the lease acquisition code path. Whenever a replica acquires a lease, it checks to see whether its local descriptor has a deletion intent. If it does, it can infer that a subsumption is in progress, as nothing else leaves a deletion intent on a range descriptor. In that case, the replica, instead of serving traffic, flips the merge bit and launches its own merge watcher goroutine, just as the Subsume command would have. This means there can actually be multiple replicas of the RHS with the merge bit set and a merge watcher goroutine running—assuming the old leaseholder did not crash but lost its lease for other reasons—but this does not cause any problems.
The LHS might be advanced over the command that commits a merge with a snapshot.
That means that all the complicated bookkeeping that normally takes place when a
replica processes a command with a non-nil ReplicatedEvalResult.Merge
is
entirely skipped! Most problematically, the snapshot will need to widen the
receiving replica, but there will be a replica of the RHS in the way—remember,
this is guaranteed by replica set alignment. In fact, the snapshot could be
advancing over several merges, or a combination of several merges and splits, in
which case there will be several RHSes to subsume.
This turns out to be relatively straightforward to handle. If an initialized
replica receives a snapshot that widens it, it can infer that a merge occurred,
and it simply subsumes all replicas that are overlapped by the snapshot in one
shot. This requires the same delicate synchronization dance, mentioned at the
end of the merge transaction section, to update bookkeeping
information. After all, applying a widening snapshot is simply the bulk version
of applying a merge command directly. The details are too complicated to go into
here, but you can begin your own exploration by starting with this call to
Replica.maybeAcquireSnapshotMergeLock
and tracing how the
returned subsumedRepls
value is used.
The merge queue, like most of the other queues in the system, runs on every store and periodically scans all replicas for which that store holds the lease. For each replica, the merge queue evaluates whether it should be merged with its right neighbor. Looking rightward is a natural fit for the reasons described in the key encoding oddities section of the appendix.
In some ways, the merge queue has an easy job. For any given range P and its right neighbor Q, the merge queue synthesizes the hypothetical merged range PQ and asks whether the split queue would immediately split that merged range. If the split queue would immediately split PQ, then obviously P and Q should not be merged; otherwise, the ranges should be merged! This means that any improvement to our split heuristics also improves our merge heuristics with essentially no extra work. For example, load-based splitting hardly required any changes to the merge queue.
Note that, to avoid thrashing, ranges at or above the minimum size threshold (8MB) are never considered for a merge. The minimum size threshold is configurable on a per-zone basis.
Unfortunately, constructing the hypothetical merged range PQ requires information about Q that only Q's leaseholder maintains, like the amount of load that Q is currently experiencing. The merge queue must send a RangeStats RPC to collect this information from Q's leaseholder, because there is no guarantee that the current store is Q's leaseholder—or that the current store even has a replica of Q at all.
To prevent unnecessary RPC chatter, the merge queue uses several heuristics to void sending RangeStats requests when it seems like the merge is unlikely to be permitted. For example, if it determines that P and Q store different tables, the split between them is mandatory and it won't bother sending the RangeStats requests. Similarly, if P is above the minimum size threshold, it doesn't bother asking about Q.
There was one knotty race that was not immediately eliminated by transaction serializability. Suppose we have our standard aligned replica set situation:
Store 1 Store 2 Store 3 Store 4
+-----+ +-----+ +-----+ +-----+
| P Q | | P Q | | | | P Q |
+-----+ +-----+ +-----+ +-----+
In an unfortunate twist of fate, a rebalance of P from store 2 to store 3 begins at the same time as a merge of P and Q begins. Let's quickly cover the valid outcomes of this race.
-
The rebalance commits before the merge. The merge must abort, as the replica sets of P and Q are no longer aligned.
-
The merge commits before the rebalance starts. The rebalance should voluntarily abort, as the decision to rebalance P needs to be updated in light of the merge. It is not, however, a correctness problem if the rebalance commits; it simply results in rebalancing a larger range than may have been intended.
-
The merge commits before the rebalance ends, but after the rebalance has sent a preemptive snapshot to store 3. The rebalance must abort, as otherwise the preemptive snapshot it sent to store 3 is a ticking time bomb.
To see why, suppose the rebalance commits. Since the preemptive snapshot predates the commit of the merge transaction, the new replica on store 3 will need to be streamed the Raft command that commits the merge transaction. But applying this merge command is disastrous, as store 3 does not have a replica of Q to merge! This is a very subtle way in which replica set alignment can be subverted.
Guaranteeing the correct outcome in case 1 is easy. The merge transaction simply checks for replica set alignment by transactionally reading the range descriptor for P and the range descriptor for Q and verifying that they list the same replicas. Serializability guarantees the rest.
Case 2 is similarly easy to handle. The rebalance transaction simply verifies that the range descriptor used to make the rebalance decision matches the range descriptor that it reads transactionally.
Case 3, however, has an extremely subtle pitfall. It seems like the solution for case 2 should apply: simply abort the transaction if the range descriptor changes between when the preemptive snapshot is sent and when the rebalance transaction starts. But, as it turns out, this is not quite foolproof. What if, between when the preemptive snapshot is sent and when the rebalance transaction starts, P and Q merge together and then split at exactly the same key? The range descriptor for P will look entirely unchanged to the rebalance transaction!
The solution was to add a generation counter to the range descriptor:
message RangeDescriptor {
// ...
// generation is incremented on every split and every merge, i.e., whenever
// the end_key of this range changes. It is initialized to zero when the range
// is first created.
optional int64 generation = 6;
}
It is no longer possible for a range descriptor to be unchanged by a sequence of splits and merges, as every split and merge will bump the generation counter. Rebalances can thus detect if a merge commits between when the preemptive snapshot is sent and when the transaction begins, and abort accordingly.
An early implementation allowed merges between ranges with misaligned replica sets. The intent was to simplify matters by avoiding replica rebalancing.
Consider again our example misaligned replica set:
Store 1 Store 2 Store 3 Store 4
+-----+ +-----+ +-----+ +-----+
| P Q | | P | | Q | | P Q |
+-----+ +-----+ +-----+ +-----+
P: (s1, s2, s4)
Q: (s1, s3, s4)
Note that there are two perspectives shown here. The store boxes represent the replicas that are actually present on that store, from the perspective of the store itself. The descriptor tuples at the bottom represent the stores that are considered to be members of the range, from the perspective of the most recently committed range descriptor.
Now, to merge P and Q in this situation without aligning their replica sets, store 2 needed to be provided a copy of store 3's data. To accomplish this, a copy of Q's data was stashed in the merge trigger, and P would write this data into its store when applying the merge trigger.
There was, sadly, a large and unforeseen problem with lagging replicas. Suppose store 2 loses network connectivity a moment before ranges P and Q are merged. Note that store 2 is not required for P and Q to merge, because only a quorum is required on the LHS to commit a merge. Now the situation looks like this:
Store 1 Store 2 Store 3 Store 4
+-----+ xxxxxxx +-----+ +-----+
| PQ | | P | | | | PQ |
+-----+ xxxxxxx +-----+ +-----+
PQ: (s1, s2, s4)
There is nothing stopping the newly merged PQ range from immediately splitting into P and Q'. Note that P is the same range as the original P (i.e., it has the same range ID) and so the replica on P is still considered a member, while Q' is a new range, with a new ID, that is unrelated to Q:
Store 1 Store 2 Store 3 Store 4
+-----+ xxxxxxx +-----+ +-----+
| P Q'| | P | | | | P Q'|
+-----+ xxxxxxx +-----+ +-----+
P: (s1, s2, s4)
Q': (s1, s2, s4)
When store 2 comes back online, it will start catching up on missed messages. But notice how the meta ranges consider store 2 to be a member of Q', because it was a member of P before the split. The leaseholder for Q' will notice that store 2's replica is out of date and send over a snapshot so that store 2 can initialize its replica... and all that might happen before store 2 manages to apply the merge command for PQ. If so, applying the merge command for PQ will explode, because the keyspace of the merged range PQ intersects with the keyspace of Q'!
By requiring aligned replica sets, we sidestep this problem. The RHS is, in effect, a lock on the post-merge keyspace. Suppose we find ourselves in the analogous situation with replica sets aligned:
Store 1 Store 2 Store 3 Store 4
+-----+ xxxxxxx +-----+ +-----+
| P Q'| | P Q | | | | P Q'|
+-----+ xxxxxxx +-----+ +-----+
P: (s1, s2, s4)
Q': (s1, s2, s4)
Here, PQ split into P and Q' immediately after merging, but notice how store 2 has a replica of both P and Q because we required replica set alignment during the merge. That replica of Q prevents store 2 from initializing a replica of Q' until either store 2's replica of P applies the merge command (to PQ) and the split command (to P and Q'), or store 2's replica of P is rebalanced away.
Per the discussion in the last section, we use the replica of the RHS as a lock on the keyspace extension. This means that we need to be careful not to GC this replica too early.
It's easiest to see why this is a problem if we consider the case where one replica is extremely slow in applying a merge:
Store 1 Store 2 Store 3 Store 4
+-----+ +-----+ +-----+ +-----+
| PQ | | PQ | | | | P Q |
+-----+ +-----+ +-----+ +-----+
PQ: (s1, s2, s4)
Here, P and Q have just merged. Store 4 hasn't yet processed the merge while stores 1 and 2 have.
The replica GC queue is continually scanning for replicas that are no longer a member of their range. What if the replica GC queue on store 4 scans its replica of Q at this very moment? It would notice that the Q range has been merged away and, conceivably, conclude that Q could be garbage collected. This would be disastrous, as when P finally applied the merge trigger it would no longer have a replica of Q to subsume!
One potential solution would be for the replica GC queue to refuse to GC replicas for ranges that have been merged away. But that could result in replicas getting permanently stuck. Suppose that, before store 4 applies the merge transaction, the PQ range is rebalanced away to store 3:
Store 1 Store 2 Store 3 Store 4
+-----+ +-----+ +-----+ +-----+
| PQ | | PQ | | PQ | | P Q |
+-----+ +-----+ +-----+ +-----+
PQ: (s1, s2, s3)
Store 4's replica of P will likely never hear about the merge, as it is no longer a member of the range and therefore not receiving any additional Raft messages from the leader, so it will never subsume Q. The replica GC queue must be capable of garbage collecting Q in this case. Otherwise Q will be stuck on store 4 forever, permanently preventing the store from ever acquiring a new replica that overlaps with that keyspace.
Solving this problem turns out to be quite tricky. What the replica GC queue wants to know when it discovers that Q's range has been subsumed is whether the local replica of Q might possibly still be subsumed by its local left neighbor P. It can't just ask the local P whether it's about to apply a merge, since P might be lagging behind, as it is here, and have no idea that a merge is about to occur.
So the problem reduces to proving that P cannot apply a merge trigger that will subsume Q. The chosen approach is to fetch the current range descriptor for P from the meta index. If that descriptor exactly matches the local descriptor, thanks to range descriptor generations, we are assured that there are no merge triggers that P has yet to apply, and Q can safely be GC'd.
Note that it is possible to form long chains of replicas that can only be GC'd from left to right; the GC queue is not aware of these dependencies and therefore processes such chains extremely inefficiently (i.e., by processing replicas in an arbitrary order instead of the necessary order). These chains turn out to be extremely rare in practice.
There is one additional subtlety here. Suppose we have two adjacent ranges, O and Q. O has just split into O and P, but store 3 is lagging and has not yet processed the split.
STATE 1
Store 1 Store 2 Store 3 Store 4
+-------+ +-------+ +-------+ +-------+
| O P Q | | O P Q | | O Q | | |
+-------+ +-------+ +-------+ +-------+
O: (s1, s2, s3)
P: (s1, s2, s3)
Q: (s1, s2, s3)
At this point, suppose the leader for the new range P decides that store 3 will need a snapshot to catch up, and starts sending the snapshot over the network. This will be important later. At the same time, P and Q merge while store 3 is still lagging.
It may seem strange that this merge is permitted, but notice how the replica sets are aligned according to the descriptors, even though store 3 does not physically have a replica of P yet. Here's the new state of the world:
STATE 2
Store 1 Store 2 Store 3 Store 4
+-------+ +-------+ +-------+ +-------+
| O PQ | | O PQ | | O Q | | |
+-------+ +-------+ +-------+ +-------+
O: (s1, s2, s3)
PQ: (s1, s2, s3)
Finally, O is rebalanced from store 3 to store 4 and garbage collected on store 4:
STATE 3
Store 1 Store 2 Store 3 Store 4
+-------+ +-------+ +-------+ +-------+
| O PQ | | O PQ | | Q | | O |
+-------+ +-------+ +-------+ +-------+
O: (s1, s2, s4)
PQ: (s1, s2, s3)
The replica GC queue might reasonably think that store 3's replica of Q is out of date, as Q has no left neighbor that could subsume it. But, at any moment in time, store 3 could finish receiving the snapshot for P that was started between state 1 and state 2. Crucially, this snapshot predates the merge, so it will need to apply the merge trigger... and the replica for Q had better be present on the store!
This hazard is avoided by requiring that all replicas of the LHS are initialized before a merge begins. This prevents a transition from state 1 to state 2, as the merge of P and Q cannot occur until store 3 initializes its replica of P. The AdminMerge command will wait a few seconds in the hope that store 3 catches up quickly; otherwise, it will refuse to launch the merge transaction. It is therefore impossible to end up in a dangerous state, like state 3, and it is thus safe for the replica GC queue to GC Q if its left neighbor is generationally up to date.
The merge watcher goroutine needs to wait until the merge transaction completes, and determine whether or not the transaction committed or aborted. This turns out to be brutally complicated, thanks to the aggressive garbage collection of transaction records.
The watcher goroutine begins by sending a PushTxn request to the merge transaction. It can easily discover the necessary arguments for the PushTxn request, that is, the ID and key of the current merge transaction record, because they're recorded in the intent that the merge transaction has left on the RHS's local copy of the descriptor.
If the PushTxn request reports that the merge transaction committed, we're guaranteed that the merge transaction did, in fact, complete. That means that we can mark the RHS replica as destroyed, bounce all requests back to DistSender (where they'll get retried on the subsuming range), and clean up the watcher goroutine. The RHS replica will be cleaned up either when the LHS replica subsumes it or when the replica GC queue notices that it has been abandoned.
If the PushTxn request instead reports that the merge transaction aborted, we're not guaranteed that the merge transaction actually aborted. The merge transaction may have committed so quickly that its transaction record was garbage collected before our PushTxn request arrived. The PushTxn incorrectly interprets this state to mean that the transaction was aborted, when, in fact, it was committed and GCed. To be fair, we're somewhat abusing the PushTxn request here. Outside of merges, a PushTxn request is only sent when a pending intent is discovered, and transaction records can't be GCed until all their intents have been resolved.
So we need some way to determine whether a merge transaction was actually aborted. What we do is look for the effects of the merge transaction in meta2. If the merge aborted, we'll see our own range descriptor, with our range ID, in meta2. By contrast, if the merge committed, we'll see a range descriptor for a different range in meta2.
This complexity is extremely unfortunate, and turns what should be a simple goroutine, spawned on the RHS leader for every merge transaction
go func() {
<-txn.Done() // wait for txn to complete
if txn.Committed() {
repl.MarkDestroyed("replica subsumed")
}
repl.UnblockRequests()
}
into 150 lines of hard to follow code.
The largest conceptual incongruity with the current merge implementation is the fact that it requires unanimous consent from all replicas, instead of a majority quorum, like everything else in Raft. Further confusing matters, only the RHS needs unanimous consent; a merge can proceed with only majority consent from the LHS. In fact, it's even a bit more subtle: while only a majority of the LHS replicas need to vote on the merge command, all LHS and RHS replicas need to confirm that they are initialized for the merge to start.
There is no theoretical reason that merges need unanimous consent, but the complexity of the implementation quickly skyrockets without it. For example, suppose you adjusted the transfer of power so that only a majority of replicas on the RHS need to be fully up to date before the merge commits. Now, when applying the merge trigger, the LHS needs to check to see if its copy of the RHS is up to date; if it's not, the LHS needs to throw away its entire Raft state and demand a snapshot from the leader because its copy of the RHS may never catch up (its Raft group may have already been destroyed by this point). This is both unsightly—our code is worse off every time we reach into Raft—and less efficient than the existing implementation, as it requires sending a potentially multi-megabyte snapshot if one replica of the RHS is just a little bit behind in applying the latest commands.
It's possible that these problems could be mitigated while retaining the ability to merge with a minority of replicas offline, but an obvious solution did not present itself. On the bright side, having too many ranges is unlikely to cause acute performance problems; that is, a situation where a merge is critical to the health of a cluster is difficult to imagine. Unlike large ranges, which can appear suddenly and require an immediate split or log truncation, merges are only required when there are on the order of tens of thousands of excessively small ranges, which takes a long time to build up.
This section is a recap of the various mechanisms, which are described in detail above, that work together to ensure that merges do not violate consistency.
The first broad safety mechanism is replica set alignment, which is required so that every store participating in the merge has a copy of both halves of the data in the merged range. Replica sets are first optimistically aligned by the merge queue. The replicas might drift apart, e.g., because the ranges in question were also targeted for a rebalance by the replicate queue, so the merge transaction verifies that the replica sets are still aligned from within the transaction. If a concurrent split or rebalance were to occur on the implicated ranges, transactional isolation kicks in and aborts one of the transactions, so we know that the replica sets are still aligned at the moment that the merge commits.
Crucially, we need to maintain alignment until the merge applies on all replicas that were present at the time of the merge. This is enforced by refusing to GC a replica of the RHS of a merge unless it can be proven that the store does not have a replica of the LHS that predates the merge, nor will it acquire a replica of the LHS that predates the merge. Proving that it does not currently have a replica of the LHS that predates the merge is fairly straightforward: we simply prove that the local left neighbor's generation is the newest generation, as indicated by the LHS's meta descriptor. Proving that the store will never acquire a replica of the LHS that predates the merge is harder— there could be a snapshot in flight that the LHS is entirely unaware of. So instead we require that replicas of the LHS in a merge are initialized on every store before the merge can begin.
The second broad safety mechanism is the range freeze. This ensures that the subsuming range and the subsumed range do not serve traffic at the same time, which would lead to clear consistency violations. The mechanism works by tying the freeze to the lifetime of the merge transaction; the merge will not commit until all replicas of the RHS are verified to be frozen, and the replicas of the RHS will not unfreeze unless the merge transaction is verified to be aborted. Lease transfers are freeze-aware, so the freeze will persist even if the lease moves around on the RHS during the merge or if the leaseholder restarts. The implementation of the freeze ab(uses) the span latch manager, to flush out in-flight commands on the RHS, an intent on the local range descriptor, to ensure the freeze persists if the lease is transferred, and an RPC that repeatedly polls the RHS to wait until it is fully caught up.
The appendix contains digressions that are not directly pertinent to range merges, but are not covered in documentation elsewhere.
Lexicographic ordering of keys of unbounded length has the interesting property that it is always possible to construct the key that immediately succeeds a given key, but it is not always possible to construct the key that immediately precedes a given key.
In the following diagrams \xNN
represents a byte whose value in hexadecimal is
NN
. The null byte is thus \x00
and the maximal byte is thus \xff
.
Now consider a sequence of keys that has no gaps:
a
a\x00
a\x00\x00
No gaps means that there are no possible keys that can sort between any of the
members of the sequence. For example, there is, simply, no key that sorts
between a
and a\x00
.
Because we can construct such a sequence, we must have next and previous
operations over the sequence, which, given a key, construct the immediately
following key and the immediately preceding key, respectively. We can see from
the diagram that the next operation appends a null byte (\x00
), while the
previous operation strips off that null byte.
But what if we want to perform the previous operation on a key that does not end
in a null byte? For example, what is the key that immediately precedes b
? It's
not a
, because a\x00
sorts between a
and b
. Similarly, it's not a\xff
,
because a\xff\xff
sorts between a\xff
and b
. This process continues
inductively until we conclude the key that immediately precedes b
is
a\xff\xff\xff...
, where there are an infinite number of trailing \xff
bytes.
It is not possible to represent this key in CockroachDB without infinite space. You could imagine designing the key encoding with an additional bit that means, "pretend this key has an infinite number of trailing maximal bytes," but CockroachDB does not have such a bit.
The upshot is that it is trivial to advance in the keyspace using purely lexical operations, but it is impossible to reverse in the keyspace with purely lexical operations.
This problem pervades the system. Given a range that spans from StartKey
,
inclusive, to EndKey
, exclusive, it is trivial to address a request to
following range, but not the preceding range. To route a request to a range,
we must construct a key that lives inside that range. Constructing such a key
for the following range is trivial, as the end key of a range is, by definition,
contained in the following range. But constructing such a key for the preceding
range would require constructing the key that immediately precedes StartKey
,
which is not possible with CockroachDB's key encoding.