This is the environment for benchmark.
machine | OS | Compiler | build | Cpu | Memory | Disk |
---|---|---|---|---|---|---|
machineA | win10 64 pro | MSVC 19.00.24215.1 for x86 | release | i7 8(4*2)core 4.0GHZ | 16G | SSD |
machineB | Mac 64 High Seria 10.13.6 | Apple LLVM version 10.0.0 (clang-1000.10.44.4) | -O3 | i7 8(4*2)core 2.2GHZ | 16G | SSD |
linux | Linux centOS 7 64 | g++ (GCC) 8.3.1 20190311 | -O3 | Intel Xeon 8 core 3.2GHz | 16G | SSD |
Home router: tplink TL-WR880N 450Mbps. Latency:
- machineA --> machineB : 1~100ms(unstable)
- machineB --> machineA : 1~5ms(stable)
Leader is deployed on machineA, all followers are deployed on machineB.
--do_heartbeat=false --iterating_wait_timeo_us=2000000 --port=10010 --leader_append_entries_rpc_timeo_ms=5000 --leader_commit_entries_rpc_timeo_ms=5000 --client_cq_num=2 --client_thread_num=2 --notify_cq_num=2 --notify_cq_threads=4 --call_cq_num=2 --call_cq_threads=2 --iterating_threads=2 --client_pool_size=400000
--checking_heartbeat=false --iterating_wait_timeo_us=50000 --disorder_msg_timeo_ms=100000 --port=${port} --notify_cq_num=1 --notify_cq_threads=4 --call_cq_num=1 --call_cq_threads=4 --iterating_threads=2
Using the TestLeaderServiceClient.Benchmark
test case in src\gtest\service\test_leader_service.h
with arguments: --gtest_filter=TestLeaderServiceClient.Benchmark --client_write_timo_ms=5000 --benchmark_client_cq_num=1 --benchmark_client_polling_thread_num_per_cq=2 --leader_svc_benchmark_req_count=80000 --benchmark_client_entrusting_thread_num=1
Which is, sending 4w requests in an asynchronous way, counting overall write throughput. (Reading performance is not considered here). The influential factors being tested are : data length and number of followers.
First of all, the latency for a specific reqeust comprises three parts:
- round trip time on the network.
- waiting to be processed on the serer side, especially under heavy loads. This is usually implemented with a queue that holding the requests.
- the business logic processing time, in raft, it usually contains of :
- appending logs.
- replicating requests to the majority of the cluster.
If we saturating the server, the time spending on the second part will drastically increasing, resulting in a high average latency but that number is meaningless since measuring latency containing part2
cannot truly reflecting the real processing abilities of the server, what it can only tell about us is that : Oh, man, the server have already doing its best now... And any kind of server will get a point like that when it reach it's processing limitations. So here when we're talkng about latency we only forcus on the time spending on part1 + part3
.
Latency result under windows-mac case is unreliable since the network is unstable.
- Throughput result under linux generally greater than the above one for ~50%-70%.
- Latency result under linux is stable at 1ms~2ms for all the above cases.
Here is an image for a quick-start on linux under tencent cloud service, explicitly ask the author to shared it cause it's not supported to be public at the moment.
First, let write an example to see the performance when there is no logic but only grpc framework.
- leader: receive requests from client and broadcasting them to its followers, here we got 2 followers.
- follower: ping-pong server, do nothing but return an empty msg upon receiving requests from the leader.
Leader's code is here.
We'll get a result of ~2.0w/s throughput with the same environment and deployment of the win-mac test case. So we can almost concude that the bottleneck is on the grpc framework itself according to this experiment. Better practices for how to utilize grpc is still hard to figure out due to grpc is not as good as you might imagine.