Skip to content

Commit

Permalink
"Add description of Raft log, update write scenario, and emphasize ne…
Browse files Browse the repository at this point in the history
…twork and disk i/o bottlnecks"
  • Loading branch information
jseldess committed Aug 10, 2018
1 parent 12e0759 commit 2a85b3b
Showing 1 changed file with 11 additions and 7 deletions.
18 changes: 11 additions & 7 deletions v2.0/performance-tuning.md
Original file line number Diff line number Diff line change
Expand Up @@ -52,10 +52,11 @@ Concept | Description
**Node** | An individual machine running CockroachDB. Many nodes join together to create your cluster.
**Range** | CockroachDB stores all user data (tables, indexes, etc.) and almost all system data in a giant sorted map of key-value pairs. This keyspace is divided into "ranges", contiguous chunks of the keyspace, so that every key can always be found in a single range.<br><br>From a SQL perspective, a table and its secondary indexes initially map to a single range, where each key-value pair in the range represents a single row in the table (also called the primary index because the table is sorted by the primary key) or a single row in a secondary index. As soon as a range reaches 64 MiB in size, it splits into two ranges. This process continues as the table and its indexes continue growing.
**Replica** | CockroachDB replicates each range (3 times by default) and stores each replica on a different node.
**Leaseholder** | For each range, one of the replicas holds the "range lease". This replica, referred to as the "leaseholder", is the one that receives and coordinates all read and write requests for the range.<br><br>Unlike writes, read requests access the leaseholder and send the results to the client without needing to coordinate with any of the other range replicas. This reduces the network round trips involved and is possible because the leaseholder is guaranteed to be up-to-date due to the fact that all write requests also go to the leaseholder. Read performance is therefore
**Raft Leader** | For each range, one of the replicas is the "leader" for write requests. Via the [Raft consensus protocol](architecture/replication-layer.html#raft), this replica ensures that a majority of replicas (the leader and enough followers) agree before committing the write. The Raft leader is almost always the same replica as the leaseholder.
**Leaseholder** | For each range, one of the replicas holds the "range lease". This replica, referred to as the "leaseholder", is the one that receives and coordinates all read and write requests for the range.<br><br>Unlike writes, read requests access the leaseholder and send the results to the client without needing to coordinate with any of the other range replicas. This reduces the network round trips involved and is possible because the leaseholder is guaranteed to be up-to-date due to the fact that all write requests also go to the leaseholder.
**Raft Leader** | For each range, one of the replicas is the "leader" for write requests. Via the [Raft consensus protocol](architecture/replication-layer.html#raft), this replica ensures that a majority of replicas (the leader and enough followers) agree, based on their Raft logs, before committing the write. The Raft leader is almost always the same replica as the leaseholder.
**Raft Log** | Each replica of a range has a time-ordered log of writes to the range that the replicas have agreed on. This log exists on-disk with each replica and is the source of truth for consistent replication.

As mentioned above, when a query is executed, the cluster routes the request to the leaseholder for the range containing the relevant data. If the query touches multiple ranges, the request goes to multiple leaseholders. For a read request, only the leaseholder of the relevant range retrieves the data. Therefore, For a write request, the Raft consensus protocol dictates that a majority of the replicas of the relevant range must agree before the write is committed.
As mentioned above, when a query is executed, the cluster routes the request to the leaseholder for the range containing the relevant data. If the query touches multiple ranges, the request goes to multiple leaseholders. For a read request, only the leaseholder of the relevant range retrieves the data. For a write request, the Raft consensus protocol dictates that a majority of the replicas of the relevant range must agree before the write is committed.

Let's consider how these mechanics play out in some hypothetical queries.

Expand All @@ -81,6 +82,8 @@ If the query is received by the node that has the leaseholder for the relevant r

<img src="{{ 'images/v2.0/perf_tuning_concepts2.png' | relative_url }}" alt="Perf tuning concepts" style="max-width:100%" />

In addition to SQL tuning techniques

#### Write scenario

Now imagine a simple write scenario where a query is executed against node 3 to write to table 1:
Expand All @@ -91,8 +94,8 @@ In this case:

1. Node 3 (the gateway node) receives the request to write to table 1.
2. The leaseholder for table 1 is on node 1, so the request is routed there.
3. The leaseholder is the same replica as the Raft leader (as is typical), so it applies the provisional write to itself and then proposes the write to its follower replicas on nodes 2 and 3.
4. As soon as one of the followers acknowledges the provisional write (and thus a majority of replicas agree), the write is committed by the two agreeing replicas. In this diagram, the follower on node 2 acknowledged the provisional write, but it could just as well have been the follower on node 3. Also note that the follower not involved in the consensus agreement usually commits the write very soon after the others.
3. The leaseholder is the same replica as the Raft leader (as is typical), so it simultaneously appends the write to its Raft log and notifies its follower replicas on nodes 2 and 3. The followers also append the write to their Raft logs.
4. As soon as one follower acknowledges the Raft log change (and thus a majority of replicas agree based on identical Raft logs), the write is committed to the key-values on the agreeing replicas. In this diagram, the follower on node 2 acknowledged the write, but it could just as well have been the follower on node 3.
5. Node 1 returns acknowledgement of the commit to node 3.
6. Node 3 responds to the client.

Expand All @@ -102,9 +105,10 @@ Just as in the read scenario, if the write request is received by the node that

#### Network and I/O bottlenecks

With the above examples in mind, it's important to consider two main performance bottlenecks: network and I/O.
With the above examples in mind, it's always important to consider network latency and disk I/O as potential performance bottlenecks. In summary:

- For a read, the request always goes from the gateway node to the leaseholder.
- For reads, hops between the gateway node and the leaseholder add latency.
- For writes, hops between the gateway node and the leaseholder/Raft leader, and hops between the leaseholder/Raft leader and Raft followers, add latency. In addition, since Raft log entries are persisted to disk before a write is committed, disk I/O is important.

## Single-region deployment

Expand Down

0 comments on commit 2a85b3b

Please sign in to comment.