Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add balanced scoring middleware to improve client-side load-balancing based on server responses #208

Merged
merged 11 commits into from
Sep 24, 2021

Conversation

advayakrishna
Copy link
Contributor

@advayakrishna advayakrishna commented Sep 21, 2021

Before this PR

CGR client does not account for server responses when load-balancing across multiple URIs and just randomizes order of URIs when retrying.

This can lead to undesirably retrying a URI many times even when the server is unavailable, such as during a node restart.

After this PR

Add middleware based on Dialogue's 'balanced' load balancing strategy that uses an exponentially decaying reservoir to track recent response errors and prioritize URIs with fewer in-flight requests and recent failures.

Fixes #194

Benchmark results:

benchmark                                                                  old ns/op     new ns/op     delta
BenchmarkAllocWithBytesBufferPool/NoByteBufferPool/count=1-16              93798         98068         +4.55%
BenchmarkAllocWithBytesBufferPool/NoByteBufferPool/count=10-16             931710        991491        +6.42%
BenchmarkAllocWithBytesBufferPool/NoByteBufferPool/count=100-16            9291976       9621014       +3.54%
BenchmarkAllocWithBytesBufferPool/WithByteBufferPool/count=1-16            91936         96066         +4.49%
BenchmarkAllocWithBytesBufferPool/WithByteBufferPool/count=10-16           903859        977292        +8.12%
BenchmarkAllocWithBytesBufferPool/WithByteBufferPool/count=100-16          9042090       9767101       +8.02%
BenchmarkUnavailableURIs/OneAvailableServer/count=10-16                    949802        963606        +1.45%
BenchmarkUnavailableURIs/OneAvailableServer/count=100-16                   9191157       9566737       +4.09%
BenchmarkUnavailableURIs/OneAvailableServer/count=1000-16                  91739632      96296739      +4.97%
BenchmarkUnavailableURIs/FourAvailableServers/count=10-16                  943496        1009042       +6.95%
BenchmarkUnavailableURIs/FourAvailableServers/count=100-16                 9426424       9882186       +4.83%
BenchmarkUnavailableURIs/FourAvailableServers/count=1000-16                93641142      100149462     +6.95%
BenchmarkUnavailableURIs/OneOutOfFourUnavailableServers/count=10-16        1224338       1003200       -18.06%
BenchmarkUnavailableURIs/OneOutOfFourUnavailableServers/count=100-16       12311181      10127655      -17.74%
BenchmarkUnavailableURIs/OneOutOfFourUnavailableServers/count=1000-16      122187011     102564133     -16.06%
BenchmarkUnavailableURIs/OneOutOfThreeUnavailableServers/count=10-16       1310920       1015474       -22.54%
BenchmarkUnavailableURIs/OneOutOfThreeUnavailableServers/count=100-16      13159841      10126297      -23.05%
BenchmarkUnavailableURIs/OneOutOfThreeUnavailableServers/count=1000-16     130296842     100600932     -22.79%
BenchmarkUnavailableURIs/OneOutOfTwoUnavailableServers/count=10-16         1474337       993779        -32.59%
BenchmarkUnavailableURIs/OneOutOfTwoUnavailableServers/count=100-16        14900138      9999934       -32.89%
BenchmarkUnavailableURIs/OneOutOfTwoUnavailableServers/count=1000-16       147037762     100873436     -31.40%

benchmark                                                                  old allocs     new allocs     delta
BenchmarkAllocWithBytesBufferPool/NoByteBufferPool/count=1-16              105            105            +0.00%
BenchmarkAllocWithBytesBufferPool/NoByteBufferPool/count=10-16             1051           1051           +0.00%
BenchmarkAllocWithBytesBufferPool/NoByteBufferPool/count=100-16            10519          10517          -0.02%
BenchmarkAllocWithBytesBufferPool/WithByteBufferPool/count=1-16            104            104            +0.00%
BenchmarkAllocWithBytesBufferPool/WithByteBufferPool/count=10-16           1041           1041           +0.00%
BenchmarkAllocWithBytesBufferPool/WithByteBufferPool/count=100-16          10416          10416          +0.00%
BenchmarkUnavailableURIs/OneAvailableServer/count=10-16                    1051           1051           +0.00%
BenchmarkUnavailableURIs/OneAvailableServer/count=100-16                   10514          10516          +0.02%
BenchmarkUnavailableURIs/OneAvailableServer/count=1000-16                  105175         105167         -0.01%
BenchmarkUnavailableURIs/FourAvailableServers/count=10-16                  1051           1121           +6.66%
BenchmarkUnavailableURIs/FourAvailableServers/count=100-16                 10517          11215          +6.64%
BenchmarkUnavailableURIs/FourAvailableServers/count=1000-16                105194         112162         +6.62%
BenchmarkUnavailableURIs/OneOutOfFourUnavailableServers/count=10-16        1475           1121           -24.00%
BenchmarkUnavailableURIs/OneOutOfFourUnavailableServers/count=100-16       14868          11218          -24.55%
BenchmarkUnavailableURIs/OneOutOfFourUnavailableServers/count=1000-16      146090         112173         -23.22%
BenchmarkUnavailableURIs/OneOutOfThreeUnavailableServers/count=10-16       1611           1121           -30.42%
BenchmarkUnavailableURIs/OneOutOfThreeUnavailableServers/count=100-16      16192          11216          -30.73%
BenchmarkUnavailableURIs/OneOutOfThreeUnavailableServers/count=1000-16     160540         112181         -30.12%
BenchmarkUnavailableURIs/OneOutOfTwoUnavailableServers/count=10-16         1877           1121           -40.28%
BenchmarkUnavailableURIs/OneOutOfTwoUnavailableServers/count=100-16        18845          11217          -40.48%
BenchmarkUnavailableURIs/OneOutOfTwoUnavailableServers/count=1000-16       188889         112191         -40.60%

benchmark                                                                  old bytes     new bytes     delta
BenchmarkAllocWithBytesBufferPool/NoByteBufferPool/count=1-16              7990          8046          +0.70%
BenchmarkAllocWithBytesBufferPool/NoByteBufferPool/count=10-16             79878         80515         +0.80%
BenchmarkAllocWithBytesBufferPool/NoByteBufferPool/count=100-16            803411        806921        +0.44%
BenchmarkAllocWithBytesBufferPool/WithByteBufferPool/count=1-16            7898          7950          +0.66%
BenchmarkAllocWithBytesBufferPool/WithByteBufferPool/count=10-16           78979         79677         +0.88%
BenchmarkAllocWithBytesBufferPool/WithByteBufferPool/count=100-16          791049        795755        +0.59%
BenchmarkUnavailableURIs/OneAvailableServer/count=10-16                    79910         80433         +0.65%
BenchmarkUnavailableURIs/OneAvailableServer/count=100-16                   799244        806119        +0.86%
BenchmarkUnavailableURIs/OneAvailableServer/count=1000-16                  7998631       8057533       +0.74%
BenchmarkUnavailableURIs/FourAvailableServers/count=10-16                  80503         82500         +2.48%
BenchmarkUnavailableURIs/FourAvailableServers/count=100-16                 805206        824364        +2.38%
BenchmarkUnavailableURIs/FourAvailableServers/count=1000-16                8069882       8258605       +2.34%
BenchmarkUnavailableURIs/OneOutOfFourUnavailableServers/count=10-16        117464        82460         -29.80%
BenchmarkUnavailableURIs/OneOutOfFourUnavailableServers/count=100-16       1183451       825758        -30.22%
BenchmarkUnavailableURIs/OneOutOfFourUnavailableServers/count=1000-16      11607855      8265169       -28.80%
BenchmarkUnavailableURIs/OneOutOfThreeUnavailableServers/count=10-16       129511        82496         -36.30%
BenchmarkUnavailableURIs/OneOutOfThreeUnavailableServers/count=100-16      1299289       823909        -36.59%
BenchmarkUnavailableURIs/OneOutOfThreeUnavailableServers/count=1000-16     12871795      8266066       -35.78%
BenchmarkUnavailableURIs/OneOutOfTwoUnavailableServers/count=10-16         152420        82647         -45.78%
BenchmarkUnavailableURIs/OneOutOfTwoUnavailableServers/count=100-16        1529061       823380        -46.15%
BenchmarkUnavailableURIs/OneOutOfTwoUnavailableServers/count=1000-16       15334106      8261072       -46.13%

==COMMIT_MSG==
Add balanced scoring middleware to improve client-side load-balancing based on server responses
==COMMIT_MSG==

Possible downsides?

Small performance regression (<5%) when client is used with a single URI


This change is Reviewable

@changelog-app
Copy link

changelog-app bot commented Sep 21, 2021

Generate changelog in changelog/@unreleased

Type

  • Feature
  • Improvement
  • Fix
  • Break
  • Deprecation
  • Manual task
  • Migration

Description

Add balanced scoring middleware to improve client-side load-balancing based on server responses

Check the box to generate changelog(s)

  • Generate changelog entry

Copy link
Contributor

@bmoylan bmoylan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 3 of 8 files at r1, 5 of 5 files at r2, 1 of 1 files at r3, all commit messages.
Reviewable status: all files reviewed, 9 unresolved discussions (waiting on @advayakrishna)


changelog/@unreleased/pr-208.v2.yml, line 4 at r3 (raw file):

improvement:
  description: Add balanced scoring middleware to improve client-side load-balancing
    based on server responses

Can you add a bit more about what to expect from the behavior change?


conjure-go-client/httpclient/client.go, line 58 at r3 (raw file):

	metricsMiddleware      Middleware

	uris                          []string

is this field still used or does the new middleware encapsulate it?


conjure-go-client/httpclient/client_builder.go, line 113 at r3 (raw file):

time.Now().UnixNano

Isn't this calling Now() only once and just providing the function to convert that (static) time to nanos? Should this be func() int64 { return time.Now().UnixNano() }?


conjure-go-client/httpclient/internal/balanced_scorer.go, line 32 at r3 (raw file):

)

type BalancedURIScoringMiddleware interface {

nit: Should we remove "Balanced" from the interface name to indicate that the API is not tied to how it calculates the order?


conjure-go-client/httpclient/internal/balanced_scorer.go, line 37 at r3 (raw file):

}

var _ BalancedURIScoringMiddleware = (*balancedScorer)(nil)

not necessary, asserted by the constructor


conjure-go-client/httpclient/internal/balanced_scorer.go, line 48 at r3 (raw file):

}

func NewBalancedURIScoringMiddleware(uris []string, nanoClock func() int64) BalancedURIScoringMiddleware {

Can you add a comment indicating what this is doing and a link to the Java implementation that it copies? Let's also specify a bit about the algorithm: 5xx is 100 times as bad as 4xx, how the summing of inflight and errors works, etc


conjure-go-client/httpclient/internal/balanced_scorer.go, line 86 at r3 (raw file):

	resp, err := next.RoundTrip(req)
	if resp == nil || err != nil {
		return nil, err

If we get an error here (e.g. connection refused), should we record that somewhere in the score?


conjure-go-client/httpclient/internal/balanced_scorer_test.go, line 33 at r3 (raw file):

}

func TestBalancedScoring(t *testing.T) {

maybe add an unstarted server to test connection refused


conjure-go-client/httpclient/internal/course_exponential_decay_reservoir_test.go, line 29 at r3 (raw file):

	}
	r := NewCourseExponentialDecayReservoir(clock, 10)
	assert.InDelta(t, r.Get(), 0.0, 0.001)

nit: for all these assertions, the arg order should be expected, actual which affects the failure messages

Copy link
Contributor

@bmoylan bmoylan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 10 unresolved discussions (waiting on @advayakrishna)

a discussion (no related file):

Small performance regression (<5%) when client is used with a single URI

Should we check the length and short-circuit?


Copy link
Contributor Author

@advayakrishna advayakrishna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 3 of 10 files reviewed, 10 unresolved discussions (waiting on @bmoylan)

a discussion (no related file):

Previously, bmoylan (Brad Moylan) wrote…

Small performance regression (<5%) when client is used with a single URI

Should we check the length and short-circuit?

I added logic to do this, it improved things a little bit



changelog/@unreleased/pr-208.v2.yml, line 4 at r3 (raw file):

Previously, bmoylan (Brad Moylan) wrote…

Can you add a bit more about what to expect from the behavior change?

Yeah will do


conjure-go-client/httpclient/client.go, line 58 at r3 (raw file):

Previously, bmoylan (Brad Moylan) wrote…

is this field still used or does the new middleware encapsulate it?

No, can be removed


conjure-go-client/httpclient/client_builder.go, line 113 at r3 (raw file):

Previously, bmoylan (Brad Moylan) wrote…
time.Now().UnixNano

Isn't this calling Now() only once and just providing the function to convert that (static) time to nanos? Should this be func() int64 { return time.Now().UnixNano() }?

Yeah you're right will fix


conjure-go-client/httpclient/internal/balanced_scorer.go, line 32 at r3 (raw file):

Previously, bmoylan (Brad Moylan) wrote…

nit: Should we remove "Balanced" from the interface name to indicate that the API is not tied to how it calculates the order?

Yeah


conjure-go-client/httpclient/internal/balanced_scorer.go, line 37 at r3 (raw file):

Previously, bmoylan (Brad Moylan) wrote…

not necessary, asserted by the constructor

K


conjure-go-client/httpclient/internal/balanced_scorer.go, line 48 at r3 (raw file):

Previously, bmoylan (Brad Moylan) wrote…

Can you add a comment indicating what this is doing and a link to the Java implementation that it copies? Let's also specify a bit about the algorithm: 5xx is 100 times as bad as 4xx, how the summing of inflight and errors works, etc

Yeah


conjure-go-client/httpclient/internal/balanced_scorer.go, line 86 at r3 (raw file):

Previously, bmoylan (Brad Moylan) wrote…

If we get an error here (e.g. connection refused), should we record that somewhere in the score?

Yeah


conjure-go-client/httpclient/internal/balanced_scorer_test.go, line 33 at r3 (raw file):

Previously, bmoylan (Brad Moylan) wrote…

maybe add an unstarted server to test connection refused

Added


conjure-go-client/httpclient/internal/course_exponential_decay_reservoir_test.go, line 29 at r3 (raw file):

Previously, bmoylan (Brad Moylan) wrote…

nit: for all these assertions, the arg order should be expected, actual which affects the failure messages

Swapped

@bmoylan
Copy link
Contributor

bmoylan commented Sep 24, 2021


conjure-go-client/httpclient/client.go, line 164 at r4 (raw file):

		// the raw response.
		c.metricsMiddleware,
		c.uriScorer,

maybe a comment over this saying it has to come before decoding errors so we can access the raw status code

@bmoylan
Copy link
Contributor

bmoylan commented Sep 24, 2021

Looks good to me! Not going to approve yet so @nmiyake and/or @tabboud have a chance to look before merging.

Copy link
Contributor

@bmoylan bmoylan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 7 of 7 files at r4, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @advayakrishna)

Copy link
Contributor

@tabboud tabboud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few small comments, but otherwise looks great. Thanks for contributing this feature!

I do have some hesitation about this breaking our internal transactional client, which contains similar middleware but pins to hosts until there is an error in order to guarantee all requests route to a single host for each transaction. Dialogue exposes a way to change this routing scheme which we may want to do. See the StickyEndpointChannels from dialogue which wraps the balanced score tracker to provide this guarantee.

This change makes it apparent that we need to update the request retrier, since that also contains code to detect failures similar to this PR. There is some coupling between the retrier and this change, but from reading through, I don't believe there are any conflicts as the uris provided to the retrier are static. However I still think there is some work left to consolidate both of these paths.

scores[uri] = info.computeScore()
}
// Pre-shuffle to avoid overloading first URI when no request are in-flight
rand.Shuffle(len(uris), func(i, j int) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this actually required given the URIs are retrieved by iterating over the uriInfos map where the order is not guaranteed? I suppose it can't hurt to keep, but one thing to note is that this rand.Shuffle uses the global rand source, so unless the clients application seeds the global source then the values returned will always be the same for the same input. We don't see this given the unordered nature of a map, but if we decide to keep this, it might be best to do one of the following:

  • take a the source as part of the constructor
  • create a new random generator with a custom seed (based on current time?)

User: u.User,
Host: u.Host,
}
return uCopy.String()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: Not needed now, but we could create a reverse index of URL to URI rather than re-construct the URL string each time. There may be edge cases I am not thinking about, but just noting for future reference.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The case I couldn't figure out how to handle is when the path is updated as a RequestParam, then the URL would not be the same as the base URI provided

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe all you care about is the desired host you're connecting to and thus you could leave off the path (maybe a wild assumption). You could pre-parse the baseURI provided and store all elements that you grab from the request URL which can be used as the key for the reverse index.
Something like this

type urlToUri map[url.URL]string

func New(baseURIs []string) {
	// ...
	for _, uri := range baseURIs {
		parsedURL, _ := url.Parse(uri)
		urlToUri[url.URL{
			Scheme: parsedURL.Scheme,
			Opaque: parsedURL.Opaque,
			User:   parsedURL.User,
			Host:   parsedURL.Host,
		}] = uri
	}
	// ...
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just re-iterating that we don't need to do this now as it's just an optimization, but could be something we do down the road

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah makes sense

}

func isGlobalQosStatus(statusCode int) bool {
return statusCode == StatusCodeRetryOther || statusCode == StatusCodeUnavailable
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also add the check for !tooManyRequests (i.e. 429)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dialogue code is effectively (308 || 429 || 503) && !429 which I reduced to 308 || 503 so this should be equivalent.

lastDecay int64
nanoClock func() int64
decayIntervalNanoseconds int64
mu sync.Mutex
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you re-order this struct so the mutex sits on top of the value it's protecting, which appears to just be value (docs)

Copy link
Contributor

@tabboud tabboud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve http client retry behavior across requests when a URI(s) is unavailable
4 participants