Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

batch messages can be too large in new raft client implementation #9714

Closed
BusyJay opened this issue Feb 25, 2021 · 5 comments · Fixed by #11056 or #11493
Closed

batch messages can be too large in new raft client implementation #9714

BusyJay opened this issue Feb 25, 2021 · 5 comments · Fixed by #11056 or #11493
Labels
affects-5.1 This bug affects 5.1.x versions. affects-5.2 This bug affects 5.2.x versions. affects-5.3 This bug affects 5.3.x versions. component/gRPC Component: gRPC severity/major type/bug The issue is confirmed as a bug.

Comments

@BusyJay
Copy link
Member

BusyJay commented Feb 25, 2021

Bug Report

New raft client implementation tends to batch more messages than before and we found it exceeded the max grpc messages in internal tests.

[2021/02/25 17:42:42.658 +08:00] [ERROR] [raft_client.rs:431] ["connection aborted"] [addr=] [receiver_err="Some(RpcFailure(RpcStatus { stat
us: 8-RESOURCE_EXHAUSTED, details: Some(\"Sent message larger than max (10705556 vs. 10485760)\") }))"] [sink_error=Some(RemoteStopped)] [store_id=7]

Calculating the message size using Message::compute_size should make it correct at the cost with extra CPU usage. We may also check what fields can be missed as even max batch size (128) is reached, there was about 1K extra size per message.

@BusyJay BusyJay added component/gRPC Component: gRPC type/bug The issue is confirmed as a bug. labels Feb 25, 2021
@hicqu
Copy link
Contributor

hicqu commented Feb 26, 2021

How about remove the size limit but only keep the batch count limit? So the total size of a batch won't be too exorbitant and we can save CPU times from compute_size.

@BusyJay
Copy link
Member Author

BusyJay commented Feb 26, 2021

Perhaps we should bench what size is a reasonable good size.

@BusyJay
Copy link
Member Author

BusyJay commented Sep 30, 2021

I believe #8926 introduce the regression, /cc @sticnarf.

Raft client doesn't check for raft messages' context, so it's highly possible a ReadIndex requests that have a lot of ranges exceed the 10MiB limit. Raft client only allows at most 128 requests in a batch, so if there the ranges' sizes exceed 80K per messages, it will trigger the error.

This also explains why the error occurs more often than the past since v5.0.0.

@sticnarf
Copy link
Contributor

sticnarf commented Oct 8, 2021

@BusyJay Does it happen only when there are replica reads? The additional contexts should only exist when using replica reads.

tonyxuqqi added a commit to tonyxuqqi/tikv that referenced this issue Oct 14, 2021
Due to the tikv#9714, the context field could have large key ranges which could make the estimate very inaccurate.

Signed-off-by: tonyxuqqi <[email protected]>
ti-chi-bot added a commit that referenced this issue Oct 15, 2021
…push (#11056)

* raft_client: check context size in BatchMessageBuffer::push

Due to the #9714, the context field could have large key ranges which could make the estimate very inaccurate.

Signed-off-by: tonyxuqqi <[email protected]>

* add extra_ctx into check as well

Signed-off-by: tonyxuqqi <[email protected]>

Co-authored-by: Ti Chi Robot <[email protected]>
@BusyJay
Copy link
Member Author

BusyJay commented Nov 25, 2021

I prints the whole messages in a test case that can reproduce the issue frequently. It turns out it's because there are too many entries batch into one messages. A batch with 30 messages can contains about 45199 entries. Each entry has about 11 extra bytes overhead, which means there are at least 497189 extra bytes, which is very close to the default extra buffer 524288.

Given now we have collected all bytes length from the pb message, now I feel confident to disable the size hard limit on the connection level to solve the issue for all. We still need a way to monitor the abnormal message size though.

@BusyJay BusyJay reopened this Nov 25, 2021
BusyJay added a commit to BusyJay/tikv that referenced this issue Nov 26, 2021
If there are many entries in a message, the estimated size of message
can be way smaller than the actual size. This PR fixes the error by
also counting index and term in estimation.

It also remove the hard limit as the estimation is closed enough.

Close tikv#9714.

Signed-off-by: Jay Lee <[email protected]>
ti-chi-bot added a commit that referenced this issue Dec 1, 2021
* add test case

Signed-off-by: Jay Lee <[email protected]>

* count term and index tag

If there are many entries in a message, the estimated size of message
can be way smaller than the actual size. This PR fixes the error by
also counting index and term in estimation.

It also remove the hard limit as the estimation is closed enough.

Close #9714.

Signed-off-by: Jay Lee <[email protected]>

Co-authored-by: Ti Chi Robot <[email protected]>
BusyJay added a commit to ti-srebot/tikv that referenced this issue Dec 17, 2021
Signed-off-by: Jay Lee <[email protected]>
ti-chi-bot added a commit that referenced this issue Dec 17, 2021
* cherry pick #11493 to release-5.0

Signed-off-by: ti-srebot <[email protected]>

* solve conflict

Signed-off-by: Jay Lee <[email protected]>

* Ref #9714.

Signed-off-by: Jay Lee <[email protected]>

Co-authored-by: Jay <[email protected]>
Co-authored-by: Jay Lee <[email protected]>
Co-authored-by: Ti Chi Robot <[email protected]>
Co-authored-by: qupeng <[email protected]>
@jebter jebter added affects-5.1 This bug affects 5.1.x versions. affects-5.2 This bug affects 5.2.x versions. affects-5.3 This bug affects 5.3.x versions. labels Jan 18, 2022
ti-chi-bot added a commit that referenced this issue Feb 10, 2022
…push (#11056) (#11065)

close #9714, ref #9714, ref #11056

Signed-off-by: ti-srebot <[email protected]>

Co-authored-by: tonyxuqqi <[email protected]>
Co-authored-by: Yilin Chen <[email protected]>
Co-authored-by: Ti Chi Robot <[email protected]>
ti-chi-bot added a commit that referenced this issue Feb 10, 2022
close #9714, ref #11493

Signed-off-by: ti-srebot <[email protected]>
Signed-off-by: Yilin Chen <[email protected]>
Signed-off-by: Jay Lee <[email protected]>

Co-authored-by: Jay <[email protected]>
Co-authored-by: Yilin Chen <[email protected]>
Co-authored-by: Ti Chi Robot <[email protected]>
Co-authored-by: Jay Lee <[email protected]>
ti-chi-bot added a commit that referenced this issue Feb 18, 2022
close #9714, ref #11493

Signed-off-by: ti-srebot <[email protected]>

Co-authored-by: Jay <[email protected]>
Co-authored-by: Ti Chi Robot <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects-5.1 This bug affects 5.1.x versions. affects-5.2 This bug affects 5.2.x versions. affects-5.3 This bug affects 5.3.x versions. component/gRPC Component: gRPC severity/major type/bug The issue is confirmed as a bug.
Projects
None yet
4 participants