batch messages can be too large in new raft client implementation #9714

BusyJay · 2021-02-25T10:06:52Z

Bug Report

New raft client implementation tends to batch more messages than before and we found it exceeded the max grpc messages in internal tests.

[2021/02/25 17:42:42.658 +08:00] [ERROR] [raft_client.rs:431] ["connection aborted"] [addr=] [receiver_err="Some(RpcFailure(RpcStatus { stat
us: 8-RESOURCE_EXHAUSTED, details: Some(\"Sent message larger than max (10705556 vs. 10485760)\") }))"] [sink_error=Some(RemoteStopped)] [store_id=7]

Calculating the message size using Message::compute_size should make it correct at the cost with extra CPU usage. We may also check what fields can be missed as even max batch size (128) is reached, there was about 1K extra size per message.

The text was updated successfully, but these errors were encountered:

hicqu · 2021-02-26T07:35:49Z

How about remove the size limit but only keep the batch count limit? So the total size of a batch won't be too exorbitant and we can save CPU times from compute_size.

BusyJay · 2021-02-26T09:27:51Z

Perhaps we should bench what size is a reasonable good size.

BusyJay · 2021-09-30T10:10:06Z

I believe #8926 introduce the regression, /cc @sticnarf.

Raft client doesn't check for raft messages' context, so it's highly possible a ReadIndex requests that have a lot of ranges exceed the 10MiB limit. Raft client only allows at most 128 requests in a batch, so if there the ranges' sizes exceed 80K per messages, it will trigger the error.

This also explains why the error occurs more often than the past since v5.0.0.

sticnarf · 2021-10-08T02:10:28Z

@BusyJay Does it happen only when there are replica reads? The additional contexts should only exist when using replica reads.

Due to the tikv#9714, the context field could have large key ranges which could make the estimate very inaccurate. Signed-off-by: tonyxuqqi <[email protected]>

…push (#11056) * raft_client: check context size in BatchMessageBuffer::push Due to the #9714, the context field could have large key ranges which could make the estimate very inaccurate. Signed-off-by: tonyxuqqi <[email protected]> * add extra_ctx into check as well Signed-off-by: tonyxuqqi <[email protected]> Co-authored-by: Ti Chi Robot <[email protected]>

BusyJay · 2021-11-25T16:44:34Z

I prints the whole messages in a test case that can reproduce the issue frequently. It turns out it's because there are too many entries batch into one messages. A batch with 30 messages can contains about 45199 entries. Each entry has about 11 extra bytes overhead, which means there are at least 497189 extra bytes, which is very close to the default extra buffer 524288.

Given now we have collected all bytes length from the pb message, now I feel confident to disable the size hard limit on the connection level to solve the issue for all. We still need a way to monitor the abnormal message size though.

If there are many entries in a message, the estimated size of message can be way smaller than the actual size. This PR fixes the error by also counting index and term in estimation. It also remove the hard limit as the estimation is closed enough. Close tikv#9714. Signed-off-by: Jay Lee <[email protected]>

* add test case Signed-off-by: Jay Lee <[email protected]> * count term and index tag If there are many entries in a message, the estimated size of message can be way smaller than the actual size. This PR fixes the error by also counting index and term in estimation. It also remove the hard limit as the estimation is closed enough. Close #9714. Signed-off-by: Jay Lee <[email protected]> Co-authored-by: Ti Chi Robot <[email protected]>

Signed-off-by: Jay Lee <[email protected]>

* cherry pick #11493 to release-5.0 Signed-off-by: ti-srebot <[email protected]> * solve conflict Signed-off-by: Jay Lee <[email protected]> * Ref #9714. Signed-off-by: Jay Lee <[email protected]> Co-authored-by: Jay <[email protected]> Co-authored-by: Jay Lee <[email protected]> Co-authored-by: Ti Chi Robot <[email protected]> Co-authored-by: qupeng <[email protected]>

…push (#11056) (#11065) close #9714, ref #9714, ref #11056 Signed-off-by: ti-srebot <[email protected]> Co-authored-by: tonyxuqqi <[email protected]> Co-authored-by: Yilin Chen <[email protected]> Co-authored-by: Ti Chi Robot <[email protected]>

close #9714, ref #11493 Signed-off-by: ti-srebot <[email protected]> Signed-off-by: Yilin Chen <[email protected]> Signed-off-by: Jay Lee <[email protected]> Co-authored-by: Jay <[email protected]> Co-authored-by: Yilin Chen <[email protected]> Co-authored-by: Ti Chi Robot <[email protected]> Co-authored-by: Jay Lee <[email protected]>

close #9714, ref #11493 Signed-off-by: ti-srebot <[email protected]> Co-authored-by: Jay <[email protected]> Co-authored-by: Ti Chi Robot <[email protected]>

BusyJay added component/gRPC Component: gRPC type/bug The issue is confirmed as a bug. labels Feb 25, 2021

jebter added the severity/major label Feb 27, 2021

tonyxuqqi mentioned this issue Aug 13, 2021

raft_client: Lift the default value of max_grpc_send_msg_len to 64MB #10724

Closed

Lily2025 mentioned this issue Aug 20, 2021

add troubleshooting case description for insert or update slow in user doc #10790

Closed

This was referenced Oct 14, 2021

raft_client: check Message::context size in BatchMessageBuffer::push BusyJay/tikv#10

Closed

raft_client: check context and extra_ctx size in BatchMessageBuffer::push #11056

Merged

ti-chi-bot closed this as completed in #11056 Oct 15, 2021

BusyJay reopened this Nov 25, 2021

This was referenced Nov 26, 2021

Upgrade hung more than 1 hours when upgrade from 5.0.1 to 5.3.0 #11484

Closed

raftclient: count term and index in estimated size #11493

Merged

ti-chi-bot closed this as completed in #11493 Dec 1, 2021

ti-srebot mentioned this issue Dec 1, 2021

raftclient: count term and index in estimated size (#11493) #11532

Merged

This was referenced Dec 1, 2021

raftclient: count term and index in estimated size (#11493) #11533

Merged

raftclient: count term and index in estimated size (#11493) #11534

Closed

raftclient: count term and index in estimated size (#11493) #11535

Merged

BusyJay added a commit to ti-srebot/tikv that referenced this issue Dec 17, 2021

Ref tikv#9714.

355ac9d

Signed-off-by: Jay Lee <[email protected]>

Little-Wallace mentioned this issue Jan 7, 2022

After a tikv failure is recovered, the leader of this tikv is always zero #11075

Closed

jebter added affects-5.1 This bug affects 5.1.x versions. affects-5.2 This bug affects 5.2.x versions. affects-5.3 This bug affects 5.3.x versions. labels Jan 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

batch messages can be too large in new raft client implementation #9714

batch messages can be too large in new raft client implementation #9714

BusyJay commented Feb 25, 2021

hicqu commented Feb 26, 2021 •

edited

Loading

BusyJay commented Feb 26, 2021

BusyJay commented Sep 30, 2021 •

edited

Loading

sticnarf commented Oct 8, 2021

BusyJay commented Nov 25, 2021 •

edited

Loading

batch messages can be too large in new raft client implementation #9714

batch messages can be too large in new raft client implementation #9714

Comments

BusyJay commented Feb 25, 2021

Bug Report

hicqu commented Feb 26, 2021 • edited Loading

BusyJay commented Feb 26, 2021

BusyJay commented Sep 30, 2021 • edited Loading

sticnarf commented Oct 8, 2021

BusyJay commented Nov 25, 2021 • edited Loading

hicqu commented Feb 26, 2021 •

edited

Loading

BusyJay commented Sep 30, 2021 •

edited

Loading

BusyJay commented Nov 25, 2021 •

edited

Loading