storage: Fix handling of convergeScore in selection of best candidates #23036

a-robinson · 2018-02-24T04:18:36Z

Also add testing of rebalancing in clusters with differently sized
localities as a follow-up to #22412.

Release note (bug fix): Fix the occasional selection of sub-optimal
rebalance targets.

This should be cherry-picked to release-2.0 due to the bug I found in (candidateList).best() while adding the test. I'm pretty shocked that that bug has been there for so long without being hit in any test case or noticed due to how unusual it looks.

The new test fails miserably on release-1.1 (see below), but reliably succeeds now due to #22412.

For fun, the output of running this against release-1.1:

--- FAIL: TestAllocatorRebalanceDifferentLocalitySizes (0.01s)
	allocator_test.go:2685: 1: RebalanceTarget([(n2,s2):? (n3,s3):? (n5,s5):?]) expected s1; got <nil>:
	allocator_test.go:2685: 2: RebalanceTarget([(n1,s1):? (n4,s4):? (n5,s5):?]) expected s3; got <nil>:
	allocator_test.go:2685: 4: RebalanceTarget([(n1,s1):? (n5,s5):? (n6,s6):?]) expected s3; got <nil>:
	allocator_test.go:2685: 5: RebalanceTarget([(n2,s2):? (n5,s5):? (n6,s6):?]) expected s3; got <nil>:
	allocator_test.go:2685: 6: RebalanceTarget([(n3,s3):? (n5,s5):? (n6,s6):?]) expected s1; got <nil>:
	allocator_test.go:2685: 7: RebalanceTarget([(n4,s4):? (n5,s5):? (n6,s6):?]) expected s1; got <nil>:
	allocator_test.go:2752: 3: RebalanceTarget([(n1,s1):? (n3,s3):? (n6,s6):?]) expected s9; got store_id:5 attrs:<> node:<node_id:5 address:<network_field:"" address_field:"" > attrs:<> locality:<tiers:<key:"locale" value:"3" > > ServerVersion:<major_val:0 minor_val:0 patch:0 unstable:0 > > capacity:<capacity:100 available:90 used:0 logical_bytes:10 range_count:10 lease_count:0 writes_per_second:0 bytes_per_replica:<p10:0 p25:0 p50:0 p75:0 p90:0 > writes_per_replica:<p10:0 p25:0 p50:0 p75:0 p90:0 > > : {"Target":"s5, valid:true, constraint:1.00, converges:1, balance:1.00(ranges=1, bytes=0.00, writes=0.00), rangeCount:10, logicalBytes:10 B, writesPerSecond:0.00, details:(diversity=1.00, preferred=0)","Existing":"[\ns6, valid:true, constraint:1.00, converges:1, balance:1.00(ranges=1, bytes=0.00, writes=0.00), rangeCount:20, logicalBytes:20 B, writesPerSecond:0.00, details:(diversity=1.00, preferred=0)\ns1, valid:true, constraint:1.00, converges:0, balance:-1.00(ranges=-1, bytes=0.00, writes=0.00), rangeCount:50, logicalBytes:50 B, writesPerSecond:0.00, details:(diversity=1.00, preferred=0)\ns3, valid:true, constraint:1.00, converges:0, balance:-1.00(ranges=-1, bytes=0.00, writes=0.00), rangeCount:50, logicalBytes:50 B, writesPerSecond:0.00, details:(diversity=1.00, preferred=0)]","RangeBytes":0,"RangeWritesPerSecond":0}
	allocator_test.go:2752: 8: RebalanceTarget([(n5,s5):? (n6,s6):? (n7,s7):?]) expected s9; got <nil>:
	allocator_test.go:2752: 13: RebalanceTarget([(n1,s1):? (n5,s5):? (n10,s10):?]) expected s0; got store_id:9 attrs:<> node:<node_id:9 address:<network_field:"" address_field:"" > attrs:<> locality:<tiers:<key:"locale" value:"4" > > ServerVersion:<major_val:0 minor_val:0 patch:0 unstable:0 > > capacity:<capacity:100 available:70 used:0 logical_bytes:30 range_count:30 lease_count:0 writes_per_second:0 bytes_per_replica:<p10:0 p25:0 p50:0 p75:0 p90:0 > writes_per_replica:<p10:0 p25:0 p50:0 p75:0 p90:0 > > : {"Target":"s9, valid:true, constraint:1.00, converges:1, balance:1.00(ranges=1, bytes=0.00, writes=0.00), rangeCount:30, logicalBytes:30 B, writesPerSecond:0.00, details:(diversity=1.00, preferred=0)","Existing":"[\ns5, valid:true, constraint:1.00, converges:1, balance:1.00(ranges=1, bytes=0.00, writes=0.00), rangeCount:10, logicalBytes:10 B, writesPerSecond:0.00, details:(diversity=1.00, preferred=0)\ns10, valid:true, constraint:1.00, converges:0, balance:-1.00(ranges=-1, bytes=0.00, writes=0.00), rangeCount:40, logicalBytes:40 B, writesPerSecond:0.00, details:(diversity=1.00, preferred=0)\ns1, valid:true, constraint:1.00, converges:0, balance:-1.00(ranges=-1, bytes=0.00, writes=0.00), rangeCount:50, logicalBytes:50 B, writesPerSecond:0.00, details:(diversity=1.00, preferred=0)]","RangeBytes":0,"RangeWritesPerSecond":0}

cockroach-teamcity · 2018-02-24T04:18:41Z

This change is

petermattis · 2018-02-24T15:31:54Z

Review status: 0 of 2 files reviewed at latest revision, 1 unresolved discussion, some commit checks failed.

pkg/storage/allocator_scorer.go, line 312 at r1 (raw file):

		if cl[i].necessary != cl[0].necessary ||
			cl[i].diversityScore < cl[0].diversityScore ||
			cl[i].convergesScore < cl[0].convergesScore {

Could/should we be using candidate.less() here?

Also, I think the previous code was trying to handle the case where cl[i].diversityScore > cl[0].diversityScore, but this code doesn't. Is that a problem?

Comments from Reviewable

a-robinson · 2018-02-24T19:02:34Z

Review status: 0 of 2 files reviewed at latest revision, 1 unresolved discussion, some commit checks failed.

pkg/storage/allocator_scorer.go, line 312 at r1 (raw file):

Could/should we be using candidate.less() here?

No. This function requires that its input is already sorted by candidate.less, so there isn't much interesting to be learned by using candidate.less(). We could call candidate.compare() with some threshold that its return value must be greater than, but it'd just be a magic number and would be less readable.

The goal of this function is to pick out all the candidates that are the "best", where "best" includes some wiggle room to include not just the absolute single best store, but also some others that are similarly good but may have a few more replicas on them, because if every rebalance action deterministically chose the absolute single best store than cluster-wide rebalancing would get bottlenecked on that store (and would overwhelm it before snapshot rate limiting was added). Instead, we randomly choose from among the best in candidateList.selectGood() to create a bit more of a spread of where snapshots are going.

Also, I think the previous code was trying to handle the case where cl[i].diversityScore > cl[0].diversityScore, but this code doesn't. Is that a problem?

It isn't possible for cl[i].diversityScore > cl[0].diversityScore unless cl[i].necessary == false and cl[0].necessary == true, in which case we would have already returned earlier due to the necessary condition. And before the necessary code was added, it would have been completely impossible for cl[i].diversityScore > cl[0].diversityScore due to the required sorted order of the input.

Comments from Reviewable

petermattis · 2018-02-24T22:20:57Z

Review status: 0 of 2 files reviewed at latest revision, 1 unresolved discussion, some commit checks failed.

pkg/storage/allocator_scorer.go, line 312 at r1 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

Could/should we be using candidate.less() here?

No. This function requires that its input is already sorted by candidate.less, so there isn't much interesting to be learned by using candidate.less(). We could call candidate.compare() with some threshold that its return value must be greater than, but it'd just be a magic number and would be less readable.

The goal of this function is to pick out all the candidates that are the "best", where "best" includes some wiggle room to include not just the absolute single best store, but also some others that are similarly good but may have a few more replicas on them, because if every rebalance action deterministically chose the absolute single best store than cluster-wide rebalancing would get bottlenecked on that store (and would overwhelm it before snapshot rate limiting was added). Instead, we randomly choose from among the best in candidateList.selectGood() to create a bit more of a spread of where snapshots are going.

Also, I think the previous code was trying to handle the case where cl[i].diversityScore > cl[0].diversityScore, but this code doesn't. Is that a problem?

It isn't possible for cl[i].diversityScore > cl[0].diversityScore unless cl[i].necessary == false and cl[0].necessary == true, in which case we would have already returned earlier due to the necessary condition. And before the necessary code was added, it would have been completely impossible for cl[i].diversityScore > cl[0].diversityScore due to the required sorted order of the input.

Ok, I guess the inequality comparisons are confusing to read. Would it be equivalent to structure this loop body as:

if cl[i].necessary == cl[0].necessary &&
  cl[i].diversityScore == cl[0].diversityScore &&
  cl[i].convergesScore == cl[0].diversityScore {
  continue
}
return cl[:i]

This would make it clearer to me that we're trying to find the prefix that is equal on necessary, diversityScore and convergesScore.

Comments from Reviewable

Also add testing of rebalancing in clusters with differently sized localities as a follow-up to cockroachdb#22412. Release note (bug fix): Fix the occasional selection of sub-optimal rebalance targets.

a-robinson · 2018-02-25T05:46:06Z

TFTR, I've also deflaked the 2 test cases that were flaky by slightly changing the test structure.

Review status: 0 of 2 files reviewed at latest revision, 1 unresolved discussion.

pkg/storage/allocator_scorer.go, line 312 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Ok, I guess the inequality comparisons are confusing to read. Would it be equivalent to structure this loop body as:
if cl[i].necessary == cl[0].necessary &&
  cl[i].diversityScore == cl[0].diversityScore &&
  cl[i].convergesScore == cl[0].diversityScore {
  continue
}
return cl[:i]
This would make it clearer to me that we're trying to find the prefix that is equal on necessary, diversityScore and convergesScore.

Sure, done. Also cleaned up candidateList.worst() similarly.

Comments from Reviewable

petermattis · 2018-02-25T17:09:56Z

Review status: 0 of 2 files reviewed at latest revision, 1 unresolved discussion, all commit checks successful.

Comments from Reviewable

a-robinson requested review from BramGruneir, petermattis and a team February 24, 2018 04:18

storage: Fix handling of convergeScore in selection of best candidates

06ef142

Also add testing of rebalancing in clusters with differently sized localities as a follow-up to cockroachdb#22412. Release note (bug fix): Fix the occasional selection of sub-optimal rebalance targets.

a-robinson force-pushed the localitytests branch from 50e4f74 to 06ef142 Compare February 25, 2018 05:45

a-robinson merged commit 3866e55 into cockroachdb:master Feb 26, 2018

a-robinson mentioned this pull request Feb 26, 2018

cherrypick-2.0: config/storage: Expressive zone config follow-ups #23081

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: Fix handling of convergeScore in selection of best candidates #23036

storage: Fix handling of convergeScore in selection of best candidates #23036

a-robinson commented Feb 24, 2018

cockroach-teamcity commented Feb 24, 2018

petermattis commented Feb 24, 2018

a-robinson commented Feb 24, 2018

petermattis commented Feb 24, 2018

a-robinson commented Feb 25, 2018

petermattis commented Feb 25, 2018

storage: Fix handling of convergeScore in selection of best candidates #23036

storage: Fix handling of convergeScore in selection of best candidates #23036

Conversation

a-robinson commented Feb 24, 2018

cockroach-teamcity commented Feb 24, 2018

petermattis commented Feb 24, 2018

a-robinson commented Feb 24, 2018

petermattis commented Feb 24, 2018

a-robinson commented Feb 25, 2018

petermattis commented Feb 25, 2018