Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Return # of state transition when generating last replication tasks #4352

Merged
merged 3 commits into from
May 19, 2023
Merged

Return # of state transition when generating last replication tasks #4352

merged 3 commits into from
May 19, 2023

Conversation

wxing1292
Copy link
Contributor

@wxing1292 wxing1292 commented May 17, 2023

What changed?

  • Return # of state transition when generating last replication tasks
  • When force replicate workflows, rate limit by state transition count

Why?
Caller can have better idea how many state transitions will happen when applying this replication tasks (including catch up)

How did you test it?
N/A

Potential risks
N/A

Is hotfix candidate?
N/A

@wxing1292 wxing1292 requested review from meiliang86 and yux0 May 17, 2023 17:27
@wxing1292 wxing1292 requested a review from a team as a code owner May 17, 2023 17:27
// ignore NotFound error
switch err.(type) {
case nil:
_ = rateLimiter.ReserveN(time.Now(), int(resp.StateTransitionCount))
Copy link
Contributor

@yux0 yux0 May 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the ST count excceds the burst rate, it won't reserve the token.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"// The returned Reservation’s OK() method returns false if n exceeds the Limiter's burst size."

you are definitely right, but what should be the behavior?
return error is not a valid option (i assume)
logging error?

}

if r.mutableState.GetExecutionState().State == enumsspb.WORKFLOW_EXECUTION_STATE_COMPLETED {
return &tasks.SyncWorkflowStateTask{
// TaskID, VisibilityTimestamp is set by shard
WorkflowKey: r.mutableState.GetWorkflowKey(),
Version: lastItem.GetVersion(),
}, nil
}, 1, nil
} else {
return &tasks.HistoryReplicationTask{
// TaskID, VisibilityTimestamp is set by shard
WorkflowKey: r.mutableState.GetWorkflowKey(),
FirstEventID: executionInfo.LastFirstEventId,
NextEventID: lastItem.GetEventId() + 1,
Version: lastItem.GetVersion(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it make sense to include StateTransitionCount as part of HistoryReplicationTask?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, state transition is basically "how many LWT / Tx" has this workflow caused so far

this metrics can be useful for predicting the # of LWT.

do you have any use case for this metrics?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cleaner IMO. avoiding the return types of the function becomes (a, b, c....). Leave u on it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a go feature, not a bug ...

switch err.(type) {
case nil:
stateTransitionCount := resp.StateTransitionCount
for stateTransitionCount > 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

puzzled by the need of for loop here. not knowing how rateLimiter is implemented, would ReserveN(, stateTransitionCount) fail if stateTransitionCount > rateLimiter.Burst()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, the returned reservation will return Ok() == false

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we know the upper bound of ratio between stateTransitionCount and Replication Tasks? If we know, say X, we could probably set burst to X*RPS instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the upper limit of number of history events is 50k.
on avg, i would say 2 history events per LWT. so the upper limit of # of ST per WF is acound 25K.

using 25K as burst is meaningless

@@ -252,24 +254,37 @@ func (a *activities) checkHandoverOnce(ctx context.Context, waitRequest waitHand
return readyShardCount == len(resp.Shards), nil
}

func (a *activities) generateWorkflowReplicationTask(ctx context.Context, wKey definition.WorkflowKey) error {
func (a *activities) generateWorkflowReplicationTask(ctx context.Context, rateLimiter quotas.RateLimiter, wKey definition.WorkflowKey) error {
if err := rateLimiter.WaitN(ctx, 1); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rateLimiter.WaitN(ctx, 1) -> rateLimiter.Wait(ctx)?

for stateTransitionCount > 0 {
token := util.Min(int(stateTransitionCount), rateLimiter.Burst())
stateTransitionCount -= int64(token)
_ = rateLimiter.ReserveN(time.Now(), token)
Copy link
Contributor

@hehaifengcn hehaifengcn May 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you still intend to wait on the token? I think ReserveN only returns reservation. You still need to call time.Sleep(r.Delay()) to wait according to the doc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since we need to call the API to know the # of LWTs (tokens to consume)
here the logic should just reserve those tokens, next call (WaitN function) will be blocked

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I c. Can you add some comment to help understanding? Are you suggesting WaitN(, 1) at line 258 will wait on all reserved tokens? WaitN comment says "WaitN blocks until lim permits n events to happen." so I assume it will just consume 1 token?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you suggesting WaitN(, 1) at line 258 will wait on all reserved tokens?

plz take a look at the rate limiter doc: https://pkg.go.dev/golang.org/x/time/rate#Limiter.ReserveN

@@ -574,6 +574,7 @@ message GenerateLastHistoryReplicationTasksRequest {
}

message GenerateLastHistoryReplicationTasksResponse {
int64 state_transition_count = 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any upgrade/downgrade concern?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

old caller will get 0 here

and the system still use WaitN(ctx, 1) (before making the API call)
so for old logic, this will be a noop

for new logic, tokens consumed will be 1 per generate task call + n state transition from call result

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"old caller will get 0 here"
You mean new history client calls old history service?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

@wxing1292 wxing1292 merged commit ed49604 into temporalio:master May 19, 2023
@wxing1292 wxing1292 deleted the generate-replicaion-task-hint branch May 19, 2023 08:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants