-
Notifications
You must be signed in to change notification settings - Fork 48
Interpreting the Results
This section shows some results obtained running a geo-replicated benchmark on AWS as explained in the previous section, and provides insights on how to interpret the graphs. In all graphs generated by fab plot
, each dot represents the average of several runs and the error bars represent one stdev. Remember that the benchmarking scripts run one client per node (on the same machine as the node), and the total input rate is equally shared across all clients.
We benchmark a testbed on AWS made of 5 data centers (N. Virginia, California, Stockholm, Tokyo, and Sydney) and each validator is instantiated on a md5.8xlarge
instance; the content of our settings.json file looks as follows:
{
"key": {
"name": "aws",
"path": "/absolute/key/path"
},
"ports": {
"consensus": 8000,
"mempool": 7000,
"front": 6000
},
"repo": {
"name": "hotstuff",
"url": "https://github.com/asonnino/hotstuff.git",
"branch": "main"
},
"instances": {
"type": "m5d.8xlarge",
"regions": ["us-east-1", "eu-north-1", "ap-southeast-2", "us-west-1", "ap-northeast-1"]
}
}
We set the following benchmark and nodes parameters (in the fabfile) for all benchmarks:
bench_params = {
'nodes': [...],
'rate': [...],
'tx_size': 512,
'faults': 0,
'duration': 300,
'runs': 2,
}
node_params = {
'consensus': {
'timeout_delay': 30_000,
'sync_retry_delay': 100_000,
'max_payload_size': 1_000,
'min_block_delay': 100
},
'mempool': {
'queue_capacity': 100_000,
'sync_retry_delay': 100_000,
'max_payload_size': 500_000,
'min_block_delay': 100
}
}
Once you have enough results, you can aggregate and plot them. Running fab plot
aggregates the results of multiple runs of fab remote
and produces 3 graphs, called 'latency', 'tps', and 'robustness'. The two key metrics we want to measure are throughput (how much load can the system handle) and latency (how fast a transaction is completed). The folder results contains the raw data used to generate the plots below, feel free to used them as baseline for your future systems.
The first graph ('latency') plots the latency versus the throughput. It shows that the latency is low until a fairly neat threshold after which it drastically increases. Determining this threshold is crucial to understand the limits of the system. For instance, the graph shows that a 20-nodes deployments can easily process up to 60,000 tx/s without impacting latency.
At low contention latency is constant and throughput can vary simply by changing the load.
This happens because there is a fixed minimum cost to commit a transaction and at low contention the queue delay is zero, hence 'whatever comes in, comes out directly'. At high contention however, throughput is constant (and sometimes decreases) and latency can vary simply by changing the load. This is because the system is overloaded so adding more load makes the wait queues grow to infinity. Even more counterintuitively the latency will seem to vary based on the length of the experiment. This is an artifact of infinitely growing queues, and it is the reason why we should run our benchmarks for a long time (we configured our bench_params
to run each benchmark for 5 minute).
The reason why the latency is slightly higher at low load (i.e., 10,000 tx/s) is because our choice of nodes_params
works well for high loads but is not optimal for low loads (the block size is probably too large).
One final challenge is comparing apples-to-apples between different deployments of the system. The challenge here is again that latency and throughput are interdependent, as a result a throughput/number of nodes chart could be tricky to produce fairly. The way to do it is to define a maximum latency and measure the throughput at this point instead of simply pushing every system to its peak throughput (where latency is meaningless).
The second graph ('tps') plots the maximum achievable throughput under a maximum latency for different numbers of nodes. This graph shows that we can process up to 60,000 tx/s with 20 nodes with about 1 second latency in a geo-replicated environment.
The last graph ('rabustness') plots the throughput versus the input rate (and provides no information on latency). This graph is a bit redundant given the other two but clearly shows the threshold where the system saturates and the throughput decreases. This threshold is crucial to determine how to configure a rate-limiter to block excess transactions from entering the system. For instance, we can see that a 30-nodes deployment can process about 45,000 tx/s before saturating, we should thus ensure that no more than 45,000 tx/s enter the mempool or the system may become unstable.
We could draw a number of observations from our experiments.
Under high load, there is a probability that one node (usually the one in Australia, that is the most isolated from the rest of the network) falls behind and never manages to sync up with the others. When this happens, latency greatly increases and throughput decreases since the slow node keeps being elected as leader and is slow to propose. The higher the input rate, the greater the probability that one node falls behind. This is why we see sometimes huge error bars in the graphs: performance is good during "lucky runs" when all nodes stay synchronized, but tumble when they don't.
We also observe that there is a fairly neat threshold over which the probability that one node falls behind increases (e.g., the threshold is around 60k TPS for 20 nodes). The sensitivity to that threshold is also why performance with 4 nodes is much lower than with 20 nodes: the threshold is pushed back with larger networks because more nodes share the total load.
We found that catching up on missed consensus blocks is very fast and rarely needed; a lot of time (sometimes over 20 sec) is however spent by the mempool to catch up on missing payloads. What desynchronizes the node in Australia is missing past payloads. These observations show that building a better mempool will likely improve performance.
About 2-3x in performance is due to a careful choice of a number of parameters internal to the mempool and consensus (the node_params
).
We observed that the system is particularly sensitive to min_block_delay
, and max_payload_size
in both consensus and mempool instantiations.
TODO