Interpreting the Results

This section shows some results obtained running a geo-replicated benchmark on AWS as explained in the previous section, and provides insights on how to interpret the graphs. In all graphs generated by fab plot, each dot represents the average of several runs and the error bars represent one stdev. Remember that the benchmarking scripts run one client per node (on the same machine as the node), and the total input rate is equally shared across all clients.

Example testbed

We benchmark a testbed on AWS made of 5 data centers (N. Virginia, California, Stockholm, Tokyo, and Sydney) and each validator is instantiated on a md5.8xlarge instance; the content of our settings.json file looks as follows:

{
    "key": {
        "name": "aws",
        "path": "/absolute/key/path"
    },
    "ports": {
        "consensus": 8000,
        "mempool": 7000,
        "front": 6000
    },
    "repo": {
        "name": "hotstuff",
        "url": "https://github.com/asonnino/hotstuff.git",
        "branch": "main"
    },
    "instances": {
        "type": "m5d.8xlarge",
        "regions": ["us-east-1", "eu-north-1", "ap-southeast-2", "us-west-1", "ap-northeast-1"]
    }
}

We set the following benchmark and nodes parameters (in the fabfile) for all benchmarks:

bench_params = {
    'nodes': [...],
    'rate': [...],
    'tx_size': 512,
    'faults': 0,
    'duration': 300,
    'runs': 2,
}

node_params = {
    'consensus': {
        'timeout_delay': 30_000,
        'sync_retry_delay': 100_000,
        'max_payload_size': 1_000,
        'min_block_delay': 100
    },
    'mempool': {
        'queue_capacity': 100_000,
        'sync_retry_delay': 100_000,
        'max_payload_size': 500_000,
        'min_block_delay': 100
    }
}

Reading the plots

Once you have enough results, you can aggregate and plot them. Running fab plot aggregates the results of multiple runs of fab remote and produces 3 graphs, called 'latency', 'tps', and 'robustness'. The two key metrics we want to measure are throughput (how much load can the system handle) and latency (how fast a transaction is completed). The folder results contains the raw data used to generate the plots below, feel free to used them as baseline for your future systems.

Latency plot

The first graph ('latency') plots the latency versus the throughput. It shows that the latency is low until a fairly neat threshold after which it drastically increases. Determining this threshold is crucial to understand the limits of the system. For instance, the graph shows that a 20-nodes deployments can easily process up to 60,000 tx/s without impacting latency.

At low contention latency is constant and throughput can vary simply by changing the load. This happens because there is a fixed minimum cost to commit a transaction and at low contention the queue delay is zero, hence 'whatever comes in, comes out directly'. At high contention however, throughput is constant (and sometimes decreases) and latency can vary simply by changing the load. This is because the system is overloaded so adding more load makes the wait queues grow to infinity. Even more counterintuitively the latency will seem to vary based on the length of the experiment. This is an artifact of infinitely growing queues, and it is the reason why we should run our benchmarks for a long time (we configured our bench_params to run each benchmark for 5 minute).

The reason why the latency is slightly higher at low load (i.e., 10,000 tx/s) is because our choice of nodes_params works well for high loads but is not optimal for low loads (the block size is probably too large).

Throughput plot

One final challenge is comparing apples-to-apples between different deployments of the system. The challenge here is again that latency and throughput are interdependent, as a result a throughput/number of nodes chart could be tricky to produce fairly. The way to do it is to define a maximum latency and measure the throughput at this point instead of simply pushing every system to its peak throughput (where latency is meaningless).

The second graph ('tps') plots the maximum achievable throughput under a maximum latency for different numbers of nodes. This graph shows that we can process up to 60,000 tx/s with 20 nodes with about 1 second latency in a geo-replicated environment.

Robustness plot

The last graph ('rabustness') plots the throughput versus the input rate (and provides no information on latency). This graph is a bit redundant given the other two but clearly shows the threshold where the system saturates and the throughput decreases. This threshold is crucial to determine how to configure a rate-limiter to block excess transactions from entering the system. For instance, we can see that a 30-nodes deployment can process about 45,000 tx/s before saturating, we should thus ensure that no more than 45,000 tx/s enter the mempool or the system may become unstable.

Observations

We could draw a number of observations from our experiments.

1. When increasing the input rate over a certain threshold, the system becomes unstable.

Under high load, there is a probability that one node (usually the one in Australia, that is the most isolated from the rest of the network) falls behind and never manages to sync up with the others. When this happens, latency greatly increases and throughput decreases since the slow node keeps being elected as leader and is slow to propose. The higher the input rate, the greater the probability that one node falls behind. This is why we see sometimes huge error bars in the graphs: performance is good during "lucky runs" when all nodes stay synchronized, but tumble when they don't.

We also observe that there is a fairly neat threshold over which the probability that one node falls behind increases (e.g., the threshold is around 60k TPS for 20 nodes). The sensitivity to that threshold is also why performance with 4 nodes is much lower than with 20 nodes: the threshold is pushed back with larger networks because more nodes share the total load.

2. What limits performance is not the consensus: it is the mempool.

We found that catching up on missed consensus blocks is very fast and rarely needed; a lot of time (sometimes over 20 sec) is however spent by the mempool to catch up on missing payloads. What desynchronizes the node in Australia is missing past payloads. These observations show that building a better mempool will likely improve performance.

3. The system is very sensitive to a number of configuration parameters.

About 2-3x in performance is due to a careful choice of a number of parameters internal to the mempool and consensus (the node_params). We observed that the system is particularly sensitive to min_block_delay, and max_payload_size in both consensus and mempool instantiations.

Benchmarks with faults

TODO

Provide feedback

Saved searches

Use saved searches to filter your results more quickly