-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add balanced scoring middleware to improve client-side load-balancing based on server responses #208
Conversation
Generate changelog in
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 3 of 8 files at r1, 5 of 5 files at r2, 1 of 1 files at r3, all commit messages.
Reviewable status: all files reviewed, 9 unresolved discussions (waiting on @advayakrishna)
changelog/@unreleased/pr-208.v2.yml, line 4 at r3 (raw file):
improvement: description: Add balanced scoring middleware to improve client-side load-balancing based on server responses
Can you add a bit more about what to expect from the behavior change?
conjure-go-client/httpclient/client.go, line 58 at r3 (raw file):
metricsMiddleware Middleware uris []string
is this field still used or does the new middleware encapsulate it?
conjure-go-client/httpclient/client_builder.go, line 113 at r3 (raw file):
time.Now().UnixNano
Isn't this calling Now()
only once and just providing the function to convert that (static) time to nanos? Should this be func() int64 { return time.Now().UnixNano() }
?
conjure-go-client/httpclient/internal/balanced_scorer.go, line 32 at r3 (raw file):
) type BalancedURIScoringMiddleware interface {
nit: Should we remove "Balanced" from the interface name to indicate that the API is not tied to how it calculates the order?
conjure-go-client/httpclient/internal/balanced_scorer.go, line 37 at r3 (raw file):
} var _ BalancedURIScoringMiddleware = (*balancedScorer)(nil)
not necessary, asserted by the constructor
conjure-go-client/httpclient/internal/balanced_scorer.go, line 48 at r3 (raw file):
} func NewBalancedURIScoringMiddleware(uris []string, nanoClock func() int64) BalancedURIScoringMiddleware {
Can you add a comment indicating what this is doing and a link to the Java implementation that it copies? Let's also specify a bit about the algorithm: 5xx is 100 times as bad as 4xx, how the summing of inflight and errors works, etc
conjure-go-client/httpclient/internal/balanced_scorer.go, line 86 at r3 (raw file):
resp, err := next.RoundTrip(req) if resp == nil || err != nil { return nil, err
If we get an error here (e.g. connection refused
), should we record that somewhere in the score?
conjure-go-client/httpclient/internal/balanced_scorer_test.go, line 33 at r3 (raw file):
} func TestBalancedScoring(t *testing.T) {
maybe add an unstarted server to test connection refused
conjure-go-client/httpclient/internal/course_exponential_decay_reservoir_test.go, line 29 at r3 (raw file):
} r := NewCourseExponentialDecayReservoir(clock, 10) assert.InDelta(t, r.Get(), 0.0, 0.001)
nit: for all these assertions, the arg order should be expected, actual
which affects the failure messages
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: all files reviewed, 10 unresolved discussions (waiting on @advayakrishna)
a discussion (no related file):
Small performance regression (<5%) when client is used with a single URI
Should we check the length and short-circuit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 3 of 10 files reviewed, 10 unresolved discussions (waiting on @bmoylan)
a discussion (no related file):
Previously, bmoylan (Brad Moylan) wrote…
Small performance regression (<5%) when client is used with a single URI
Should we check the length and short-circuit?
I added logic to do this, it improved things a little bit
changelog/@unreleased/pr-208.v2.yml, line 4 at r3 (raw file):
Previously, bmoylan (Brad Moylan) wrote…
Can you add a bit more about what to expect from the behavior change?
Yeah will do
conjure-go-client/httpclient/client.go, line 58 at r3 (raw file):
Previously, bmoylan (Brad Moylan) wrote…
is this field still used or does the new middleware encapsulate it?
No, can be removed
conjure-go-client/httpclient/client_builder.go, line 113 at r3 (raw file):
Previously, bmoylan (Brad Moylan) wrote…
time.Now().UnixNano
Isn't this calling
Now()
only once and just providing the function to convert that (static) time to nanos? Should this befunc() int64 { return time.Now().UnixNano() }
?
Yeah you're right will fix
conjure-go-client/httpclient/internal/balanced_scorer.go, line 32 at r3 (raw file):
Previously, bmoylan (Brad Moylan) wrote…
nit: Should we remove "Balanced" from the interface name to indicate that the API is not tied to how it calculates the order?
Yeah
conjure-go-client/httpclient/internal/balanced_scorer.go, line 37 at r3 (raw file):
Previously, bmoylan (Brad Moylan) wrote…
not necessary, asserted by the constructor
K
conjure-go-client/httpclient/internal/balanced_scorer.go, line 48 at r3 (raw file):
Previously, bmoylan (Brad Moylan) wrote…
Can you add a comment indicating what this is doing and a link to the Java implementation that it copies? Let's also specify a bit about the algorithm: 5xx is 100 times as bad as 4xx, how the summing of inflight and errors works, etc
Yeah
conjure-go-client/httpclient/internal/balanced_scorer.go, line 86 at r3 (raw file):
Previously, bmoylan (Brad Moylan) wrote…
If we get an error here (e.g.
connection refused
), should we record that somewhere in the score?
Yeah
conjure-go-client/httpclient/internal/balanced_scorer_test.go, line 33 at r3 (raw file):
Previously, bmoylan (Brad Moylan) wrote…
maybe add an unstarted server to test connection refused
Added
conjure-go-client/httpclient/internal/course_exponential_decay_reservoir_test.go, line 29 at r3 (raw file):
Previously, bmoylan (Brad Moylan) wrote…
nit: for all these assertions, the arg order should be
expected, actual
which affects the failure messages
Swapped
conjure-go-client/httpclient/client.go, line 164 at r4 (raw file):
maybe a comment over this saying it has to come before decoding errors so we can access the raw status code |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 7 of 7 files at r4, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @advayakrishna)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a few small comments, but otherwise looks great. Thanks for contributing this feature!
I do have some hesitation about this breaking our internal transactional client, which contains similar middleware but pins to hosts until there is an error in order to guarantee all requests route to a single host for each transaction. Dialogue exposes a way to change this routing scheme which we may want to do. See the StickyEndpointChannels from dialogue which wraps the balanced score tracker to provide this guarantee.
This change makes it apparent that we need to update the request retrier, since that also contains code to detect failures similar to this PR. There is some coupling between the retrier and this change, but from reading through, I don't believe there are any conflicts as the uris
provided to the retrier are static. However I still think there is some work left to consolidate both of these paths.
scores[uri] = info.computeScore() | ||
} | ||
// Pre-shuffle to avoid overloading first URI when no request are in-flight | ||
rand.Shuffle(len(uris), func(i, j int) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this actually required given the URIs are retrieved by iterating over the uriInfos
map where the order is not guaranteed? I suppose it can't hurt to keep, but one thing to note is that this rand.Shuffle
uses the global rand source, so unless the clients application seeds the global source then the values returned will always be the same for the same input. We don't see this given the unordered nature of a map, but if we decide to keep this, it might be best to do one of the following:
- take a the source as part of the constructor
- create a new random generator with a custom seed (based on current time?)
User: u.User, | ||
Host: u.Host, | ||
} | ||
return uCopy.String() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note: Not needed now, but we could create a reverse index of URL to URI rather than re-construct the URL string each time. There may be edge cases I am not thinking about, but just noting for future reference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The case I couldn't figure out how to handle is when the path is updated as a RequestParam
, then the URL would not be the same as the base URI provided
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe all you care about is the desired host you're connecting to and thus you could leave off the path (maybe a wild assumption). You could pre-parse the baseURI provided and store all elements that you grab from the request URL which can be used as the key for the reverse index.
Something like this
type urlToUri map[url.URL]string
func New(baseURIs []string) {
// ...
for _, uri := range baseURIs {
parsedURL, _ := url.Parse(uri)
urlToUri[url.URL{
Scheme: parsedURL.Scheme,
Opaque: parsedURL.Opaque,
User: parsedURL.User,
Host: parsedURL.Host,
}] = uri
}
// ...
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just re-iterating that we don't need to do this now as it's just an optimization, but could be something we do down the road
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah makes sense
} | ||
|
||
func isGlobalQosStatus(statusCode int) bool { | ||
return statusCode == StatusCodeRetryOther || statusCode == StatusCodeUnavailable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we also add the check for !tooManyRequests
(i.e. 429)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dialogue code is effectively (308 || 429 || 503) && !429
which I reduced to 308 || 503
so this should be equivalent.
lastDecay int64 | ||
nanoClock func() int64 | ||
decayIntervalNanoseconds int64 | ||
mu sync.Mutex |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you re-order this struct so the mutex sits on top of the value it's protecting, which appears to just be value
(docs)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Before this PR
CGR client does not account for server responses when load-balancing across multiple URIs and just randomizes order of URIs when retrying.
This can lead to undesirably retrying a URI many times even when the server is unavailable, such as during a node restart.
After this PR
Add middleware based on Dialogue's 'balanced' load balancing strategy that uses an exponentially decaying reservoir to track recent response errors and prioritize URIs with fewer in-flight requests and recent failures.
Fixes #194
Benchmark results:
==COMMIT_MSG==
Add balanced scoring middleware to improve client-side load-balancing based on server responses
==COMMIT_MSG==
Possible downsides?
Small performance regression (<5%) when client is used with a single URI
This change is