backoff pd api when fails #6556

hihihuhu · 2023-06-05T19:20:07Z

Enhancement Task

particularly, like pd get member request should have a backoff, otherwise it could overload pd and prevent it recover from some temporary issues
like for v6.5.1, https://github.com/tikv/pd/blob/v6.5.1/client/base_client.go#L306

ref tikv/pd#6556, close #14964 Signed-off-by: Ryan Leung <[email protected]> Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>

ref tikv/pd#6556, close tikv#14964 Signed-off-by: Ryan Leung <[email protected]> Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com> Signed-off-by: tonyxuqqi <[email protected]>

nolouch · 2023-07-04T02:24:41Z

This updateMember use a memberUpdateInterval(1 min), I think we can see it as a backoff mechanism.

nolouch · 2023-07-04T03:16:31Z

and for tikv's pd client, @rleungx has increased the retry interval. tikv/tikv#14954. if not works, we may need consider backoff and increase the max retry time. cc @rleungx

hihihuhu · 2023-07-13T23:13:03Z

@nolouch thanks for the reply, the behavior we observe it there are excessive, a few thousand qps, getMember calls from tidb components to pd when pd leader is already having issue

in tidb, it could also actively triggers CheckLeader in additional to the periodical one https://github.com/tikv/pd/blob/v6.5.1/client/base_client.go#L135C2-L135C2. for exmaple, it looks like when tso allocation fails it would schedule a CheckLeader

i would be great to reduce has some backoff for this particular scenario, because the pd leader is already having issue at that point, any further load could make thing worse

nolouch · 2023-07-24T04:03:50Z

Got, On the TiDB side, all requests should already have a backoff mechanism via client-go's backoff but some paths may not be covered. like you said the CheckLeader. We will sort out the calling side and do some optimization.

nolouch · 2023-07-27T10:36:24Z

BTW, tikv/tikv#13673, in pd-client v2 , we do not need to retry in the inner of the client, so we no need backoff to reduce the requests. then this problem can be significantly improved.

…15191) ref tikv/pd#6556, close #15184 The store heartbeat will report periodically, no need to do retires - do not retry the store heartbeat - change `remain_reconnect_count` as `remain_request_count` - fix some metrics Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>

ref tikv/pd#6556, close tikv#15184 Signed-off-by: ti-chi-bot <[email protected]>

…15191) (#15231) ref tikv/pd#6556, close #15184 The store heartbeat will report periodically, no need to do retires - do not retry the store heartbeat - change `remain_reconnect_count` as `remain_request_count` - fix some metrics Signed-off-by: ti-chi-bot <[email protected]> Signed-off-by: nolouch <[email protected]> Co-authored-by: ShuNing <[email protected]> Co-authored-by: nolouch <[email protected]>

nolouch · 2023-08-23T02:50:50Z

TiDB Side

#6978 try to reduce the GetMemeber Request. we can see the preliminary test results:

The RPC call was reduced from 3.22k to 170 ops, which is relative to the TiDB numbers and client requests for triaging checkLeader. This reduction could be more significant in larger cluster scenarios.

And more tests are necessary to ensure that no further issues arise.

close #5739, ref #6556 Signed-off-by: Ryan Leung <[email protected]> Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>

nolouch · 2023-08-24T09:42:17Z

TiKV Side

tikv/tikv#15429 try to reduce the retries in tikv side(all tidb is no workload), details test can see in PR, the result like:

Before

After

ref #6556 Signed-off-by: husharp <[email protected]>

ref tikv/pd#6556, close #15428 pc_client: add store-level backoff for the reconnect retries Signed-off-by: nolouch <[email protected]> Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>

ref tikv/pd#6556, close tikv#15428 Signed-off-by: ti-chi-bot <[email protected]>

…15191) (#15232) ref tikv/pd#6556, close #15184 The store heartbeat will report periodically, no need to do retires - do not retry the store heartbeat - change `remain_reconnect_count` as `remain_request_count` - fix some metrics Signed-off-by: ti-chi-bot <[email protected]> Signed-off-by: nolouch <[email protected]> Co-authored-by: ShuNing <[email protected]> Co-authored-by: nolouch <[email protected]>

ref tikv/pd#6556, close #15428 pc_client: add store-level backoff for the reconnect retries Signed-off-by: ti-chi-bot <[email protected]> Signed-off-by: nolouch <[email protected]> Co-authored-by: ShuNing <[email protected]> Co-authored-by: nolouch <[email protected]>

ref tikv/pd#6556, close #15428 pc_client: add store-level backoff for the reconnect retries Signed-off-by: ti-chi-bot <[email protected]> Signed-off-by: nolouch <[email protected]> Co-authored-by: ShuNing <[email protected]> Co-authored-by: nolouch <[email protected]> Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>

…46769) ref tikv/pd#6556, close #46770

…46769) (#46955) ref tikv/pd#6556, close #46770

hihihuhu added the type/enhancement The issue or PR belongs to an enhancement. label Jun 5, 2023

AndreMouche mentioned this issue Jun 14, 2023

grpc/api: should we limit the concurrency request access to PD/ETCD in PD #6604

Closed

rleungx mentioned this issue Jun 16, 2023

pd_client: reduce PD reconnection tikv/tikv#14954

Merged

rleungx mentioned this issue Jul 24, 2023

server: add gRPC rate limit #6834

Merged

nolouch mentioned this issue Jul 25, 2023

pd_client: reduce store heartbeat retires to prevent heartbeat storm tikv/tikv#15191

Merged

ti-chi-bot mentioned this issue Jul 28, 2023

pd_client: reduce store heartbeat retires to prevent heartbeat storm (#15191) tikv/tikv#15230

Open

ti-chi-bot pushed a commit to ti-chi-bot/tikv that referenced this issue Jul 28, 2023

This is an automated cherry-pick of tikv#15191

c9e7ffd

ref tikv/pd#6556, close tikv#15184 Signed-off-by: ti-chi-bot <[email protected]>

ti-chi-bot mentioned this issue Jul 28, 2023

pd_client: reduce store heartbeat retires to prevent heartbeat storm (#15191) tikv/tikv#15231

Merged

ti-chi-bot pushed a commit to ti-chi-bot/tikv that referenced this issue Jul 28, 2023

This is an automated cherry-pick of tikv#15191

aacd692

ref tikv/pd#6556, close tikv#15184 Signed-off-by: ti-chi-bot <[email protected]>

ti-chi-bot mentioned this issue Jul 28, 2023

pd_client: reduce store heartbeat retires to prevent heartbeat storm (#15191) tikv/tikv#15232

Merged

ti-chi-bot pushed a commit to ti-chi-bot/tikv that referenced this issue Jul 28, 2023

This is an automated cherry-pick of tikv#15191

c885d5e

ref tikv/pd#6556, close tikv#15184 Signed-off-by: ti-chi-bot <[email protected]>

nolouch mentioned this issue Aug 11, 2023

reduce pd api calls when fails #6949

Closed

7 tasks

HuSharp mentioned this issue Aug 23, 2023

client: support add ready for resp and add backoff mechanism #6974

Closed

HuSharp mentioned this issue Aug 23, 2023

client: support backoff mechanism for memberLoop #6978

Merged

ti-chi-bot bot added a commit that referenced this issue Aug 24, 2023

server: add gRPC rate limit (#6834)

fbd386a

close #5739, ref #6556 Signed-off-by: Ryan Leung <[email protected]> Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>

nolouch mentioned this issue Aug 24, 2023

pd_client: add backoff for the reconnect retries tikv/tikv#15429

Merged

This was referenced Aug 25, 2023

Detect leader health and automatically do failover #6403

Closed

client: add backoff for member loop #6995

Merged

ti-chi-bot bot pushed a commit that referenced this issue Aug 29, 2023

client: support backoff mechanism for memberLoop (#6978)

71e8929

ref #6556 Signed-off-by: husharp <[email protected]>

ti-chi-bot pushed a commit to ti-chi-bot/tikv that referenced this issue Aug 30, 2023

This is an automated cherry-pick of tikv#15429

1a60e69

ref tikv/pd#6556, close tikv#15428 Signed-off-by: ti-chi-bot <[email protected]>

This was referenced Aug 30, 2023

pd_client: add backoff for the reconnect retries (#15429) tikv/tikv#15471

Merged

pd_client: add backoff for the reconnect retries (#15429) tikv/tikv#15472

Merged

ti-chi-bot pushed a commit to ti-chi-bot/tikv that referenced this issue Aug 30, 2023

This is an automated cherry-pick of tikv#15429

89a5439

ref tikv/pd#6556, close tikv#15428 Signed-off-by: ti-chi-bot <[email protected]>

ti-chi-bot mentioned this issue Aug 31, 2023

client: add backoff for member loop (#6995) #7020

Merged

nolouch added affects-6.6 affects-7.1 This bug affects the 7.1.x(LTS) versions. labels Aug 31, 2023

nolouch added affects-6.5 This bug affects the 6.5.x(LTS) versions. and removed affects-6.6 labels Aug 31, 2023

HuSharp mentioned this issue Sep 8, 2023

gomod: update pdclient to support backoff mechanism for memberLoop pingcap/tidb#46769

Merged

2 tasks

nolouch closed this as completed Sep 12, 2023

ti-chi-bot bot pushed a commit to pingcap/tidb that referenced this issue Sep 13, 2023

gomod: update pdclient to support backoff mechanism for memberLoop (#…

d3c75ee

…46769) ref tikv/pd#6556, close #46770

This was referenced Sep 13, 2023

gomod: update pdclient to support backoff mechanism for memberLoop (#46769) pingcap/tidb#46954

Closed

gomod: update pdclient to support backoff mechanism for memberLoop (#46769) pingcap/tidb#46955

Merged

ti-chi-bot bot pushed a commit to pingcap/tidb that referenced this issue Sep 18, 2023

gomod: update pdclient to support backoff mechanism for memberLoop (#…

38fb0eb

…46769) (#46955) ref tikv/pd#6556, close #46770

niubell mentioned this issue Oct 18, 2023

v7.1.2: add release notes pingcap/docs-cn#15256

Merged

18 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

backoff pd api when fails #6556

backoff pd api when fails #6556

hihihuhu commented Jun 5, 2023

nolouch commented Jul 4, 2023 •

edited

Loading

nolouch commented Jul 4, 2023 •

edited

Loading

hihihuhu commented Jul 13, 2023

nolouch commented Jul 24, 2023 •

edited

Loading

nolouch commented Jul 27, 2023 •

edited

Loading

nolouch commented Aug 23, 2023 •

edited

Loading

nolouch commented Aug 24, 2023 •

edited

Loading

backoff pd api when fails #6556

backoff pd api when fails #6556

Comments

hihihuhu commented Jun 5, 2023

Enhancement Task

nolouch commented Jul 4, 2023 • edited Loading

nolouch commented Jul 4, 2023 • edited Loading

hihihuhu commented Jul 13, 2023

nolouch commented Jul 24, 2023 • edited Loading

nolouch commented Jul 27, 2023 • edited Loading

nolouch commented Aug 23, 2023 • edited Loading

TiDB Side

nolouch commented Aug 24, 2023 • edited Loading

TiKV Side

Before

After

nolouch commented Jul 4, 2023 •

edited

Loading

nolouch commented Jul 4, 2023 •

edited

Loading

nolouch commented Jul 24, 2023 •

edited

Loading

nolouch commented Jul 27, 2023 •

edited

Loading

nolouch commented Aug 23, 2023 •

edited

Loading

nolouch commented Aug 24, 2023 •

edited

Loading