Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add special rate limiter for namespace replication inducing APIs #4455

Conversation

bergundy
Copy link
Member

@bergundy bergundy commented Jun 7, 2023

Added special rate limiting for APIs that may insert replication tasks into a namespace replication queue.
The replication queue is used to propagate critical failover messages and this mapping prevents flooding the
queue and delaying failover.

Closes #4240

@bergundy bergundy requested a review from a team as a code owner June 7, 2023 22:41
// The limit is evenly distributed among available internal-frontend service instances. If this is set, it
// overwrites the per instance limit configured with "internal-frontend.namespaceRPS.namespaceReplicationInducingAPIs".
// This config is EXPERIMENTAL and may be changed or removed in a later release.
InternalFrontendGlobalNamespaceNamespaceReplicationInducingAPIsRPS = "internal-frontend.globalNamespaceRPS.namespaceReplicationInducingAPIs"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need this, do we? can we just set it to unlimited or 1000 rps? the worker role does do UpdateWorkerBuildIdCompatibility, but it's limited by the rate of the scavenger. and for UpdateNamespace we probably don't want to slow it down?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have to put some value there, might as well have this dynamic config option.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

eh, each option adds more mental overhead. if we agree it doesn't make sense I'd prefer not to have an option

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll make a function that returns 0 instead of making it configurable.

// FrontendMaxNamespaceNamespaceReplicationInducingAPIsRPSPerInstance is a per host/per namespace RPS limit for
// namespace replication inducing APIs (e.g. UpdateNamespace, UpdateWorkerBuildIdCompatibility).
// This config is EXPERIMENTAL and may be changed or removed in a later release.
FrontendMaxNamespaceNamespaceReplicationInducingAPIsRPSPerInstance = "frontend.namespaceRPS.namespaceReplicationInducingAPIs"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the overall plan is to deprecate the non-global ones and keep only the global ones, since they're easier to configure (don't have to think about how many instances you have, they divide the rate as it gets scaled). so for new keys, we could just do a global one

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I thought the global one was still experimental.
Let's leave it for now and remove once global becomes the default.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or we could do it now and have less work to do later? there's no backwards compatibility to worry about

// FrontendNamespaceReplicationInducingAPIsRPS limits the per second request rate for namespace replication inducing
// APIs (e.g. UpdateNamespace, UpdateWorkerBuildIdCompatibility).
// This config is EXPERIMENTAL and may be changed or removed in a later release.
FrontendNamespaceReplicationInducingAPIsRPS = "frontend.rps.namespaceReplicationInducingAPIs"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems not that useful. the point is it's a global resource. so we really want one "global" total limit (automatically divided across frontends) and one global per-namespace limit that's set lower (to prevent any one ns from hogging the entire global limit)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, I'll remove this

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but now there's no overall limit, there's only per-namespace limits. we need both if we want this to be effective.

(the naming is bad.. in this context "global" means "across all frontends", not "across all namespaces". we want an "overall global" i.e. "across all namespaces + across all frontends" limit.)

// The replication queue is used to propagate critical failover messages and this mapping prevents flooding the
// queue and delaying failover.
NamespaceReplicationInducingAPIToPriority = map[string]int{
"UpdateNamespace": 0,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about RegisterNamespace?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't care about that one, it's called once per namespace so it can't flood the per namespace replication queue.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

? there is no per-namespace queue, there's only one queue

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I was wrong.

@bergundy bergundy force-pushed the namespace-replication-inducing-rate-limiter branch from 00aa856 to 5658607 Compare June 8, 2023 18:59
Copy link
Contributor

@wxing1292 wxing1292 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the intention is to limit the call rate of e.g. update namespace,
why not simply rate limit register / update namespace to rate / burst: 0.00166666666 & 1?

this PR seems to special handle the workflow versioning story with namespace

@yux0
Copy link
Contributor

yux0 commented Jun 9, 2023

Have you consider use a separate queue?

@bergundy
Copy link
Member Author

bergundy commented Jun 9, 2023

Have you consider use a separate queue?

Many times but decided not to for the time being.
I believe we’ll end up there eventually.

@yux0
Copy link
Contributor

yux0 commented Jun 9, 2023

Have you consider use a separate queue?

Many times but decided not to for the time being. I believe we’ll end up there eventually.

If this is the case, why not do it now? Then we don't need to worry about namespace replication being flooded?

Comment on lines +459 to +460
// guessedSetId := hashBuildId(buildId)
// return guessedSetId, nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is intentional?

@bergundy bergundy merged commit d44aa1e into temporalio:master Jun 10, 2023
@bergundy bergundy deleted the namespace-replication-inducing-rate-limiter branch June 10, 2023 03:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add rate limiting to UpdateWorkerBuildIdCompatibility
5 participants