changefeedccl: use new bulk oracle for changefeed planning #120077

rharding6373 · 2024-03-07T20:49:01Z

This change uses the BulkOracle by default as part of changefeed planning, instead of the bin packing oracle. This will allow changefeeds to have plans that randomly assign spans to any replica, including followers if enabled, following locality filter constraints.

A new cluster setting, changefeed.random_replica_selection.enabled, protects this change. When enabled (by default), changefeeds will use the new BulkOracle for planning. If disabled, changefeeds will use the previous bin packing oracle.

Epic: none
Fixes: #119777
Fixes: #114611

Release note (enterprise change): Changefeeds now use the BulkOracle for planning, which distributes work evenly across all replica in the locality filter, including followers if enabled. This is enabled by default with the cluster setting
changefeed.random_replica_selection.enabled. If disabled, changefeed planning reverts to its previous bin packing oracle.

blathers-crl · 2024-03-07T20:49:05Z

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

cockroach-teamcity · 2024-03-07T20:49:14Z

This change is

rharding6373

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andyyang890, @dt, and @jayshrivastava)

pkg/ccl/changefeedccl/changefeed_dist.go line 360 at r1 (raw file):

var useBulkOracle = settings.RegisterBoolSetting(
	settings.ApplicationLevel,
	"changefeed.balanced_distribution.enabled",

@andyyang890 I keep going back and forth on whether it's better to be consistent with the backup naming or rename it to something that differentiates it from the existing distribution setting above. @dt do you have opinions on naming?

jayshrivastava

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andyyang890, @dt, and @rharding6373)

pkg/ccl/changefeedccl/changefeed_dist.go line 360 at r1 (raw file):

Previously, rharding6373 (Rachael Harding) wrote…

@andyyang890 I keep going back and forth on whether it's better to be consistent with the backup naming or rename it to something that differentiates it from the existing distribution setting above. @dt do you have opinions on naming?

I think we should call it something else because we used to have a setting changefeed.balance_range_distribution.enable, which is very similar.

pkg/ccl/changefeedccl/changefeed_dist.go line 399 at r1 (raw file):

		planCtx := dsp.NewPlanningCtxWithOracle(ctx, execCtx.ExtendedEvalContext(), nil, /* planner */
			blankTxn, sql.DistributionType(distMode), oracle, locFilter)
		spanPartitions, err := dsp.PartitionSpans(ctx, planCtx, trackedSpans)

I think we need to add a little bit more for this change to be effective.

From what I understand, we get a list of spans from PartitionSpans which does not contain replica info. Then we eventually pass this list of spans to the dist sender and the dist sender makes its own decision as to which replica we end up reading from. The aggregator starts the kvfeed here

cockroach/pkg/ccl/changefeedccl/changefeed_processors.go

Line 365 in 5dd7755

    
           ca.eventProducer, ca.kvFeedDoneCh, ca.errCh, err = ca.startKVFeed(ctx, spans, kvFeedHighWater, needsInitialScan, feed, pool, limit, opts)

which starts the rangefeed here

cockroach/pkg/ccl/changefeedccl/kvfeed/kv_feed.go

Lines 563 to 570 in 1afd0d2

    
           physicalCfg := rangeFeedConfig{ 
        
           	Spans:         stps, 
        
           	Frontier:      resumeFrontier.Frontier(), 
        
           	WithDiff:      f.withDiff, 
        
           	WithFiltering: f.withFiltering, 
        
           	Knobs:         f.knobs, 
        
           	RangeObserver: f.rangeObserver, 
        
           }

. The only input to the rangefeed is a list of spans, so the rangefeed code must be deciding which replicas to read from independently. Erik mentioned that the dist sender goes for the closest replica.

This means that we cannot influence which replicas we read from by changing the planning code. However, we can change the span partitions, which is what this change does. It's very subtle - it.Desc() in the code below points to a replica (using the oracle). Then getSQLInstanceIDForKVNodeID() maps the span to a partition using the replica ID, but the replica ID itself does not end up mattering because its not an input to the dist sender.

cockroach/pkg/sql/distsql_physical_planner.go

Lines 1343 to 1382 in 3b455a7

    
           it := planCtx.spanIter 
        
           // rSpan is the span we are currently partitioning. 
        
           rSpan, err := keys.SpanAddr(span) 
        
           if err != nil { 
        
           	return nil, 0, err 
        
           } 
        
           var lastSQLInstanceID base.SQLInstanceID 
        
           // lastKey maintains the EndKey of the last piece of `span`. 
        
           lastKey := rSpan.Key 
        
           if log.V(1) { 
        
           	log.Infof(ctx, "partitioning span %s", span) 
        
           } 
        
           // We break up rSpan into its individual ranges (which may or may not be on 
        
           // separate nodes). We then create "partitioned spans" using the end keys of 
        
           // these individual ranges. 
        
           for it.Seek(ctx, span, kvcoord.Ascending); ; it.Next(ctx) { 
        
           	if !it.Valid() { 
        
           		return nil, 0, it.Error() 
        
           	} 
        
           	replDesc, ignore, err := it.ReplicaInfo(ctx) 
        
           	if err != nil { 
        
           		return nil, 0, err 
        
           	} 
        
           	*ignoreMisplannedRanges = *ignoreMisplannedRanges || ignore 
        
           	desc := it.Desc() 
        
           	if log.V(1) { 
        
           		descCpy := desc // don't let desc escape 
        
           		log.Infof(ctx, "lastKey: %s desc: %s", lastKey, &descCpy) 
        
           	} 
        
           	if !desc.ContainsKey(lastKey) { 
        
           		// This range must contain the last range's EndKey. 
        
           		log.Fatalf( 
        
           			ctx, "next range %v doesn't cover last end key %v. Partitions: %#v", 
        
           			desc.RSpan(), lastKey, partitions, 
        
           		) 
        
           	} 
        
           	sqlInstanceID, reason := getSQLInstanceIDForKVNodeID(replDesc.NodeID)

Looking at the implementation for getSQLInstanceIDForKVNodeID below, there's a lot going on. It takes into account mixed process mode (I'm honestly not sure what that is), the gateway node, closest instances etc. So even if the oracle chooses replias in a way that uniformly distributes them across nodes, I feel that this mapping function might result in an imbalanced distribution.

cockroach/pkg/sql/distsql_physical_planner.go

Lines 1600 to 1737 in 3b455a7

    
           // makeInstanceResolver returns a function that can choose the SQL instance ID 
        
           // for a provided KV node ID. 
        
           func (dsp *DistSQLPlanner) makeInstanceResolver( 
        
           	ctx context.Context, planCtx *PlanningCtx, 
        
           ) (func(roachpb.NodeID) (base.SQLInstanceID, SpanPartitionReason), error) { 
        
           	_, mixedProcessMode := dsp.distSQLSrv.NodeID.OptionalNodeID() 
        
           	locFilter := planCtx.localityFilter 
        
           	var mixedProcessSameNodeResolver func(nodeID roachpb.NodeID) (base.SQLInstanceID, SpanPartitionReason) 
        
           	if mixedProcessMode { 
        
           		mixedProcessSameNodeResolver = dsp.healthySQLInstanceIDForKVNodeHostedInstanceResolver(ctx) 
        
           	} 
        
           	if mixedProcessMode && locFilter.Empty() { 
        
           		return mixedProcessSameNodeResolver, nil 
        
           	} 
        
           	// GetAllInstances only returns healthy instances. 
        
           	instances, err := dsp.sqlAddressResolver.GetAllInstances(ctx) 
        
           	if err != nil { 
        
           		return nil, err 
        
           	} 
        
           	if len(instances) == 0 { 
        
           		// For whatever reason, we think that we don't have any healthy 
        
           		// instances (one example is someone explicitly removing the rows from 
        
           		// the sql_instances table), but we always have the gateway pod to 
        
           		// execute on, so we'll use it (unless we have a locality filter). 
        
           		if locFilter.NonEmpty() { 
        
           			return nil, noInstancesMatchingLocalityFilterErr 
        
           		} 
        
           		log.Warningf(ctx, "no healthy sql instances available for planning, only using the gateway") 
        
           		return dsp.alwaysUseGatewayWithReason(SpanPartitionReason_GATEWAY_NO_HEALTHY_INSTANCES), nil 
        
           	} 
        
           	rng, _ := randutil.NewPseudoRand() 
        
           	instancesHaveLocality := false 
        
           	var gatewayIsEligible bool 
        
           	if locFilter.NonEmpty() { 
        
           		eligible := make([]sqlinstance.InstanceInfo, 0, len(instances)) 
        
           		for i := range instances { 
        
           			if ok, _ := instances[i].Locality.Matches(locFilter); ok { 
        
           				eligible = append(eligible, instances[i]) 
        
           				if instances[i].InstanceID == dsp.gatewaySQLInstanceID { 
        
           					gatewayIsEligible = true 
        
           				} 
        
           			} 
        
           		} 
        
           		if len(eligible) == 0 { 
        
           			return nil, noInstancesMatchingLocalityFilterErr 
        
           		} 
        
           		instances = eligible 
        
           		instancesHaveLocality = true 
        
           	} else { 
        
           		for i := range instances { 
        
           			if instances[i].Locality.NonEmpty() { 
        
           				instancesHaveLocality = true 
        
           				break 
        
           			} 
        
           		} 
        
           		gatewayIsEligible = true 
        
           	} 
        
           	if log.ExpensiveLogEnabled(ctx, 2) { 
        
           		log.VEventf(ctx, 2, "healthy SQL instances available for distributed planning: %v", instances) 
        
           	} 
        
           	// If we were able to determine the locality information for at least some 
        
           	// instances, use the locality-aware resolver. 
        
           	if instancesHaveLocality { 
        
           		resolver := func(nodeID roachpb.NodeID) (base.SQLInstanceID, SpanPartitionReason) { 
        
           			// Lookup the node localities to compare to the instance localities. 
        
           			nodeDesc, err := dsp.nodeDescs.GetNodeDescriptor(nodeID) 
        
           			if err != nil { 
        
           				log.Eventf(ctx, "unable to get node descriptor for KV node %s", nodeID) 
        
           				return dsp.gatewaySQLInstanceID, SpanPartitionReason_GATEWAY_ON_ERROR 
        
           			} 
        
           			// If we're in mixed-mode, check if the picked node already matches the 
        
           			// locality filter in which case we can just use it. 
        
           			if mixedProcessMode { 
        
           				if ok, _ := nodeDesc.Locality.Matches(locFilter); ok { 
        
           					return mixedProcessSameNodeResolver(nodeID) 
        
           				} else { 
        
           					log.VEventf(ctx, 2, 
        
           						"node %d locality %s does not match locality filter %s, finding alternative placement...", 
        
           						nodeID, nodeDesc.Locality, locFilter, 
        
           					) 
        
           				} 
        
           			} 
        
           			// TODO(dt): Pre-compute / cache this result, e.g. in the instance reader. 
        
           			if closest, _ := ClosestInstances(instances, 
        
           				nodeDesc.Locality); len(closest) > 0 { 
        
           				return closest[rng.Intn(len(closest))], SpanPartitionReason_CLOSEST_LOCALITY_MATCH 
        
           			} 
        
           			// No instances had any locality tiers in common with the node locality. 
        
           			// At this point we pick the gateway if it is eligible, otherwise we pick 
        
           			// a random instance from the eligible instances. 
        
           			if !gatewayIsEligible { 
        
           				return instances[rng.Intn(len(instances))].InstanceID, SpanPartitionReason_LOCALITY_FILTERED_RANDOM 
        
           			} 
        
           			if dsp.shouldPickGateway(planCtx, instances) { 
        
           				return dsp.gatewaySQLInstanceID, SpanPartitionReason_GATEWAY_NO_LOCALITY_MATCH 
        
           			} else { 
        
           				// If the gateway has a disproportionate number of partitions pick a 
        
           				// random instance that is not the gateway. 
        
           				if planCtx.spanPartitionState.testingOverrideRandomSelection != nil { 
        
           					return planCtx.spanPartitionState.testingOverrideRandomSelection(), 
        
           						SpanPartitionReason_LOCALITY_FILTERED_RANDOM_GATEWAY_OVERLOADED 
        
           				} 
        
           				// NB: This random selection may still pick the gateway but that is 
        
           				// alright as we are more interested in a uniform distribution rather 
        
           				// than avoiding the gateway. 
        
           				id := instances[rng.Intn(len(instances))].InstanceID 
        
           				return id, SpanPartitionReason_LOCALITY_FILTERED_RANDOM_GATEWAY_OVERLOADED 
        
           			} 
        
           		} 
        
           		return resolver, nil 
        
           	} 
        
           	// If no sql instances have locality information, fallback to a naive 
        
           	// round-robin strategy that is completely locality-ignorant. Randomize the 
        
           	// order in which we choose instances so that work is allocated fairly across 
        
           	// queries. 
        
           	rng.Shuffle(len(instances), func(i, j int) { 
        
           		instances[i], instances[j] = instances[j], instances[i] 
        
           	}) 
        
           	var i int 
        
           	resolver := func(roachpb.NodeID) (base.SQLInstanceID, SpanPartitionReason) { 
        
           		id := instances[i%len(instances)].InstanceID 
        
           		i++ 
        
           		return id, SpanPartitionReason_ROUND_ROBIN 
        
           	} 
        
           	return resolver, nil 
        
           }

Also, in some scenarios we use this other implementation for getSQLInstanceIDForKVNodeID here which uses the gateway node as a backup.

cockroach/pkg/sql/distsql_physical_planner.go

Line 1502 in 3b455a7

func (dsp *DistSQLPlanner) deprecatedHealthySQLInstanceIDForKVNodeIDSystem(

Overall, I'm not confident about how uniformly the work is distributed after making this change because of thegetSQLInstanceIDForKVNodeID mapping part. I think we would need to change that to simply use the node which houses the replica which the oracle chose for us. I think this is something which can go in this PR. Note that we have changefeed.default_range_distribution_strategy=balanced_simple which calls rebalanceSpanPartitions and rebalances the partitions after distsql gives them to us. I think that setting is very good already and is used by our customers - the 2 main problems are (1) doing the rebalancing after distsql isn't optimal. Ideally distsql gives us a uniformly balanced plan; and (2) sometimes distsql gives us fewer partitions than there are available nodes, so we only rebalance on a smaller set of nodes than we could have used. I think these are problems which this PR can solve and would be helpful to solve. Using the bulk oracle + a new getSQLInstanceIDForKVNodeID function probably solves them. I'm pretty sure those are the two things which cause non-unform distributions. Btw the tests in pkg/ccl/changefeedccl/changefeed_dist_test.go are a good way to play around with planning changes you make!

Also, this PR asks the question - do we want to get rid of default_range_distribution_strategy? If we keep it, what would be its purpose?

One more thing - consider this: maybe if you have a small 10 range changefeed, you don't want to spread the work across 10 nodes. Adding a minimum number of ranges before you actually use these hyper-distributed oracles/planners might be a good idea.

dt · 2024-03-11T22:37:20Z

@dt do you have opinions on naming?

No opinions about consistency with backup and it probably isn't something I'd worry about: ours is also undocumented and non-public.

In fact, we will probably actually just remove the cluster setting and just make it unconditionally the behavior soon / after 24.1 is cut since having too many knobs/differing versions of behavior has made it hard to reason about what is happening or how those are interacting in unexpected ways. Just recently we found a cluster that had toggled a restore setting that was causing it to behave strangely.

On the DR side we don't really care about optimizing for edge cases on the small end e.g. we don't worry about a small 10 span backup getting planned as a single processor over 10 spans rather than 10 single span processors. We need a 10 processor backup, cross processor overhead and all, to perform well enough in the cases where being that distributed it is the only option, and if it does then we might as well use it all the time, even when we don't strictly need it.

rharding6373

TFTRs!

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andyyang890, @dt, and @jayshrivastava)

pkg/ccl/changefeedccl/changefeed_dist.go line 360 at r1 (raw file):

Previously, jayshrivastava (Jayant) wrote…

I think we should call it something else because we used to have a setting changefeed.balance_range_distribution.enable, which is very similar.

I changed it.

pkg/ccl/changefeedccl/changefeed_dist.go line 399 at r1 (raw file):

Previously, jayshrivastava (Jayant) wrote…

I think we need to add a little bit more for this change to be effective.

From what I understand, we get a list of spans from PartitionSpans which does not contain replica info. Then we eventually pass this list of spans to the dist sender and the dist sender makes its own decision as to which replica we end up reading from. The aggregator starts the kvfeed here

cockroach/pkg/ccl/changefeedccl/changefeed_processors.go

Line 365 in 5dd7755

ca.eventProducer, ca.kvFeedDoneCh, ca.errCh, err = ca.startKVFeed(ctx, spans, kvFeedHighWater, needsInitialScan, feed, pool, limit, opts)

which starts the rangefeed here

cockroach/pkg/ccl/changefeedccl/kvfeed/kv_feed.go

Lines 563 to 570 in 1afd0d2

physicalCfg := rangeFeedConfig{

Spans: stps,

Frontier: resumeFrontier.Frontier(),

WithDiff: f.withDiff,

WithFiltering: f.withFiltering,

Knobs: f.knobs,

RangeObserver: f.rangeObserver,

}

. The only input to the rangefeed is a list of spans, so the rangefeed code must be deciding which replicas to read from independently. Erik mentioned that the dist sender goes for the closest replica.

This means that we cannot influence which replicas we read from by changing the planning code. However, we can change the span partitions, which is what this change does. It's very subtle - it.Desc() in the code below points to a replica (using the oracle). Then getSQLInstanceIDForKVNodeID() maps the span to a partition using the replica ID, but the replica ID itself does not end up mattering because its not an input to the dist sender.

cockroach/pkg/sql/distsql_physical_planner.go

Lines 1343 to 1382 in 3b455a7

it := planCtx.spanIter

// rSpan is the span we are currently partitioning.

rSpan, err := keys.SpanAddr(span)

if err != nil {

return nil, 0, err

}

var lastSQLInstanceID base.SQLInstanceID

// lastKey maintains the EndKey of the last piece of `span`.

lastKey := rSpan.Key

if log.V(1) {

log.Infof(ctx, "partitioning span %s", span)

}

// We break up rSpan into its individual ranges (which may or may not be on

// separate nodes). We then create "partitioned spans" using the end keys of

// these individual ranges.

for it.Seek(ctx, span, kvcoord.Ascending); ; it.Next(ctx) {

if !it.Valid() {

return nil, 0, it.Error()

}

replDesc, ignore, err := it.ReplicaInfo(ctx)

if err != nil {

return nil, 0, err

}

*ignoreMisplannedRanges = *ignoreMisplannedRanges || ignore

desc := it.Desc()

if log.V(1) {

descCpy := desc // don't let desc escape

log.Infof(ctx, "lastKey: %s desc: %s", lastKey, &descCpy)

}

if !desc.ContainsKey(lastKey) {

// This range must contain the last range's EndKey.

log.Fatalf(

ctx, "next range %v doesn't cover last end key %v. Partitions: %#v",

desc.RSpan(), lastKey, partitions,

)

}

sqlInstanceID, reason := getSQLInstanceIDForKVNodeID(replDesc.NodeID)

Looking at the implementation for getSQLInstanceIDForKVNodeID below, there's a lot going on. It takes into account mixed process mode (I'm honestly not sure what that is), the gateway node, closest instances etc. So even if the oracle chooses replias in a way that uniformly distributes them across nodes, I feel that this mapping function might result in an imbalanced distribution.

cockroach/pkg/sql/distsql_physical_planner.go

Lines 1600 to 1737 in 3b455a7

// makeInstanceResolver returns a function that can choose the SQL instance ID

// for a provided KV node ID.

func (dsp *DistSQLPlanner) makeInstanceResolver(

ctx context.Context, planCtx *PlanningCtx,

) (func(roachpb.NodeID) (base.SQLInstanceID, SpanPartitionReason), error) {

_, mixedProcessMode := dsp.distSQLSrv.NodeID.OptionalNodeID()

locFilter := planCtx.localityFilter

var mixedProcessSameNodeResolver func(nodeID roachpb.NodeID) (base.SQLInstanceID, SpanPartitionReason)

if mixedProcessMode {

mixedProcessSameNodeResolver = dsp.healthySQLInstanceIDForKVNodeHostedInstanceResolver(ctx)

}

if mixedProcessMode && locFilter.Empty() {

return mixedProcessSameNodeResolver, nil

}

// GetAllInstances only returns healthy instances.

instances, err := dsp.sqlAddressResolver.GetAllInstances(ctx)

if err != nil {

return nil, err

}

if len(instances) == 0 {

// For whatever reason, we think that we don't have any healthy

// instances (one example is someone explicitly removing the rows from

// the sql_instances table), but we always have the gateway pod to

// execute on, so we'll use it (unless we have a locality filter).

if locFilter.NonEmpty() {

return nil, noInstancesMatchingLocalityFilterErr

}

log.Warningf(ctx, "no healthy sql instances available for planning, only using the gateway")

return dsp.alwaysUseGatewayWithReason(SpanPartitionReason_GATEWAY_NO_HEALTHY_INSTANCES), nil

}

rng, _ := randutil.NewPseudoRand()

instancesHaveLocality := false

var gatewayIsEligible bool

if locFilter.NonEmpty() {

eligible := make([]sqlinstance.InstanceInfo, 0, len(instances))

for i := range instances {

if ok, _ := instances[i].Locality.Matches(locFilter); ok {

eligible = append(eligible, instances[i])

if instances[i].InstanceID == dsp.gatewaySQLInstanceID {

gatewayIsEligible = true

}

}

}

if len(eligible) == 0 {

return nil, noInstancesMatchingLocalityFilterErr

}

instances = eligible

instancesHaveLocality = true

} else {

for i := range instances {

if instances[i].Locality.NonEmpty() {

instancesHaveLocality = true

break

}

}

gatewayIsEligible = true

}

if log.ExpensiveLogEnabled(ctx, 2) {

log.VEventf(ctx, 2, "healthy SQL instances available for distributed planning: %v", instances)

}

// If we were able to determine the locality information for at least some

// instances, use the locality-aware resolver.

if instancesHaveLocality {

resolver := func(nodeID roachpb.NodeID) (base.SQLInstanceID, SpanPartitionReason) {

// Lookup the node localities to compare to the instance localities.

nodeDesc, err := dsp.nodeDescs.GetNodeDescriptor(nodeID)

if err != nil {

log.Eventf(ctx, "unable to get node descriptor for KV node %s", nodeID)

return dsp.gatewaySQLInstanceID, SpanPartitionReason_GATEWAY_ON_ERROR

}

// If we're in mixed-mode, check if the picked node already matches the

// locality filter in which case we can just use it.

if mixedProcessMode {

if ok, _ := nodeDesc.Locality.Matches(locFilter); ok {

return mixedProcessSameNodeResolver(nodeID)

} else {

log.VEventf(ctx, 2,

"node %d locality %s does not match locality filter %s, finding alternative placement...",

nodeID, nodeDesc.Locality, locFilter,

)

}

}

// TODO(dt): Pre-compute / cache this result, e.g. in the instance reader.

if closest, _ := ClosestInstances(instances,

nodeDesc.Locality); len(closest) > 0 {

return closest[rng.Intn(len(closest))], SpanPartitionReason_CLOSEST_LOCALITY_MATCH

}

// No instances had any locality tiers in common with the node locality.

// At this point we pick the gateway if it is eligible, otherwise we pick

// a random instance from the eligible instances.

if !gatewayIsEligible {

return instances[rng.Intn(len(instances))].InstanceID, SpanPartitionReason_LOCALITY_FILTERED_RANDOM

}

if dsp.shouldPickGateway(planCtx, instances) {

return dsp.gatewaySQLInstanceID, SpanPartitionReason_GATEWAY_NO_LOCALITY_MATCH

} else {

// If the gateway has a disproportionate number of partitions pick a

// random instance that is not the gateway.

if planCtx.spanPartitionState.testingOverrideRandomSelection != nil {

return planCtx.spanPartitionState.testingOverrideRandomSelection(),

SpanPartitionReason_LOCALITY_FILTERED_RANDOM_GATEWAY_OVERLOADED

}

// NB: This random selection may still pick the gateway but that is

// alright as we are more interested in a uniform distribution rather

// than avoiding the gateway.

id := instances[rng.Intn(len(instances))].InstanceID

return id, SpanPartitionReason_LOCALITY_FILTERED_RANDOM_GATEWAY_OVERLOADED

}

}

return resolver, nil

}

// If no sql instances have locality information, fallback to a naive

// round-robin strategy that is completely locality-ignorant. Randomize the

// order in which we choose instances so that work is allocated fairly across

// queries.

rng.Shuffle(len(instances), func(i, j int) {

instances[i], instances[j] = instances[j], instances[i]

})

var i int

resolver := func(roachpb.NodeID) (base.SQLInstanceID, SpanPartitionReason) {

id := instances[i%len(instances)].InstanceID

i++

return id, SpanPartitionReason_ROUND_ROBIN

}

return resolver, nil

}

Also, in some scenarios we use this other implementation for getSQLInstanceIDForKVNodeID here which uses the gateway node as a backup.

cockroach/pkg/sql/distsql_physical_planner.go

Line 1502 in 3b455a7

func (dsp *DistSQLPlanner) deprecatedHealthySQLInstanceIDForKVNodeIDSystem(

Overall, I'm not confident about how uniformly the work is distributed after making this change because of thegetSQLInstanceIDForKVNodeID mapping part. I think we would need to change that to simply use the node which houses the replica which the oracle chose for us. I think this is something which can go in this PR. Note that we have changefeed.default_range_distribution_strategy=balanced_simple which calls rebalanceSpanPartitions and rebalances the partitions after distsql gives them to us. I think that setting is very good already and is used by our customers - the 2 main problems are (1) doing the rebalancing after distsql isn't optimal. Ideally distsql gives us a uniformly balanced plan; and (2) sometimes distsql gives us fewer partitions than there are available nodes, so we only rebalance on a smaller set of nodes than we could have used. I think these are problems which this PR can solve and would be helpful to solve. Using the bulk oracle + a new getSQLInstanceIDForKVNodeID function probably solves them. I'm pretty sure those are the two things which cause non-unform distributions. Btw the tests in pkg/ccl/changefeedccl/changefeed_dist_test.go are a good way to play around with planning changes you make!

Also, this PR asks the question - do we want to get rid of default_range_distribution_strategy? If we keep it, what would be its purpose?

One more thing - consider this: maybe if you have a small 10 range changefeed, you don't want to spread the work across 10 nodes. Adding a minimum number of ranges before you actually use these hyper-distributed oracles/planners might be a good idea.

We discussed this a bit offline. The high level summary is that we want to limit the number of changes in this PR for this release, so we're going to leave potential improvements for the future.

With the bulk oracle, we expect that the span partitions will be fairly evenly distributed, since they're randomly chosen among all replicas of the range fitting the locality filter. DistSQL chooses a SQL node that is the same node returned by the oracle, the closest node using the locality filter, or does round robin assignment. Therefore we expect that most of the time DistSQL assignments will not deviate too much from the bulk oracle. So most of the time we expect that we won't need to rebalance in the changefeed.

We may consider deprecating default_range_distribution_strategy in the future if it no longer becomes useful with the bulk oracle.

I could add a threshold at which we apply the bulk oracle as another safeguard, but I wanted to get your opinion on at what # of ranges (or spans) we think that it would make a difference to distribute more or less. How common are changefeeds on very small tables? I glanced at some cloud metrics but couldn't tease out the smallest changefeeds (the lowest # of ranges on a cluster running a changefeed is 85, but that isn't a full picture). It seems like we could spend a bit more time discussing whether CDC behavior is like what David said backups are, where we'd prefer to involve as many nodes as possible spreading the spans among them as evenly as possible.

jayshrivastava

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andyyang890, @dt, and @rharding6373)

pkg/ccl/changefeedccl/changefeed_dist.go line 399 at r1 (raw file):

Previously, rharding6373 (Rachael Harding) wrote…

We discussed this a bit offline. The high level summary is that we want to limit the number of changes in this PR for this release, so we're going to leave potential improvements for the future.

With the bulk oracle, we expect that the span partitions will be fairly evenly distributed, since they're randomly chosen among all replicas of the range fitting the locality filter. DistSQL chooses a SQL node that is the same node returned by the oracle, the closest node using the locality filter, or does round robin assignment. Therefore we expect that most of the time DistSQL assignments will not deviate too much from the bulk oracle. So most of the time we expect that we won't need to rebalance in the changefeed.

We may consider deprecating default_range_distribution_strategy in the future if it no longer becomes useful with the bulk oracle.

I could add a threshold at which we apply the bulk oracle as another safeguard, but I wanted to get your opinion on at what # of ranges (or spans) we think that it would make a difference to distribute more or less. How common are changefeeds on very small tables? I glanced at some cloud metrics but couldn't tease out the smallest changefeeds (the lowest # of ranges on a cluster running a changefeed is 85, but that isn't a full picture). It seems like we could spend a bit more time discussing whether CDC behavior is like what David said backups are, where we'd prefer to involve as many nodes as possible spreading the spans among them as evenly as possible.

Spoke about this offline. We don't have a good answer for what the threshold should be. Also, for very small changefeeds, distributing as much as possible isn't significantly worse than assigning all the ranges to one node. In fact, it adds a problem where you could assign different changefeeds entirely to the same node, overloading it. You avoid this when you distribute as much as possible.

This change uses the BulkOracle by default as part of changefeed planning, instead of the bin packing oracle. This will allow changefeeds to have plans that randomly assign spans to any replica, including followers if enabled, following locality filter constraints. A new cluster setting, `changefeed.balanced_distribution.enabled`, protects this change. When enabled (by default), changefeeds will use the new BulkOracle for planning. If disabled, changefeeds will use the previous bin packing oracle. Epic: none Fixes: cockroachdb#119777 Fixes: cockroachdb#114611 Release note (enterprise change): Changefeeds now use the BulkOracle for planning, which distributes work evenly across all replica in the locality filter, including followers if enabled. This is enabled by default with the cluster setting `changefeed.balanced_distribution.enabled`. If disabled, changefeed planning reverts to its previous bin packing oracle.

rharding6373 · 2024-03-21T23:56:59Z

TFTR!

bors r+

craig · 2024-03-22T00:39:53Z

Build succeeded:

rharding6373 · 2024-05-31T16:55:27Z

blathers backport 23.1 23.2

blathers-crl · 2024-05-31T16:55:34Z

Encountered an error creating backports. Some common things that can go wrong:

The backport branch might have already existed.
There was a merge conflict.
The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.

error creating merge commit from 84e0e51 to blathers/backport-release-23.1-120077: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 23.1 failed. See errors above.

error creating merge commit from 84e0e51 to blathers/backport-release-23.2-120077: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 23.2 failed. See errors above.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

rharding6373 requested review from dt and andyyang890 March 7, 2024 20:49

rharding6373 requested a review from a team as a code owner March 7, 2024 20:49

rharding6373 requested review from jayshrivastava and removed request for a team March 7, 2024 20:49

rharding6373 commented Mar 7, 2024

View reviewed changes

jayshrivastava reviewed Mar 11, 2024

View reviewed changes

rharding6373 force-pushed the 20230307_new_bulk_oracle_119777 branch from 3ee57b5 to 27cf935 Compare March 15, 2024 21:12

rharding6373 requested a review from jayshrivastava March 15, 2024 21:12

rharding6373 commented Mar 15, 2024

View reviewed changes

jayshrivastava approved these changes Mar 18, 2024

View reviewed changes

rharding6373 force-pushed the 20230307_new_bulk_oracle_119777 branch 2 times, most recently from c2ed57f to e59c65f Compare March 20, 2024 23:33

rharding6373 force-pushed the 20230307_new_bulk_oracle_119777 branch from e59c65f to 84e0e51 Compare March 21, 2024 15:33

craig bot merged commit 5242dcb into cockroachdb:master Mar 22, 2024
20 of 36 checks passed

andyyang890 mentioned this pull request Mar 22, 2024

ccl/changefeedccl: TestChangefeedWithSimpleDistributionStrategy failed #120870

Closed

This was referenced May 31, 2024

release-23.2: changefeedccl: use new bulk oracle for changefeed planning #124925

Merged

release-23.1: changefeedccl: use new bulk oracle for changefeed planning #124930

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

changefeedccl: use new bulk oracle for changefeed planning #120077

changefeedccl: use new bulk oracle for changefeed planning #120077

rharding6373 commented Mar 7, 2024 •

edited

Loading

blathers-crl bot commented Mar 7, 2024

cockroach-teamcity commented Mar 7, 2024

rharding6373 left a comment

jayshrivastava left a comment

dt commented Mar 11, 2024

rharding6373 left a comment

jayshrivastava left a comment

rharding6373 commented Mar 21, 2024

craig bot commented Mar 22, 2024

rharding6373 commented May 31, 2024

blathers-crl bot commented May 31, 2024

	physicalCfg := rangeFeedConfig{
	Spans: stps,
	Frontier: resumeFrontier.Frontier(),
	WithDiff: f.withDiff,
	WithFiltering: f.withFiltering,
	Knobs: f.knobs,
	RangeObserver: f.rangeObserver,
	}

	it := planCtx.spanIter
	// rSpan is the span we are currently partitioning.
	rSpan, err := keys.SpanAddr(span)
	if err != nil {
	return nil, 0, err
	}

	var lastSQLInstanceID base.SQLInstanceID
	// lastKey maintains the EndKey of the last piece of `span`.
	lastKey := rSpan.Key
	if log.V(1) {
	log.Infof(ctx, "partitioning span %s", span)
	}
	// We break up rSpan into its individual ranges (which may or may not be on
	// separate nodes). We then create "partitioned spans" using the end keys of
	// these individual ranges.
	for it.Seek(ctx, span, kvcoord.Ascending); ; it.Next(ctx) {
	if !it.Valid() {
	return nil, 0, it.Error()
	}
	replDesc, ignore, err := it.ReplicaInfo(ctx)
	if err != nil {
	return nil, 0, err
	}
	ignoreMisplannedRanges = ignoreMisplannedRanges \|\| ignore
	desc := it.Desc()
	if log.V(1) {
	descCpy := desc // don't let desc escape
	log.Infof(ctx, "lastKey: %s desc: %s", lastKey, &descCpy)
	}

	if !desc.ContainsKey(lastKey) {
	// This range must contain the last range's EndKey.
	log.Fatalf(
	ctx, "next range %v doesn't cover last end key %v. Partitions: %#v",
	desc.RSpan(), lastKey, partitions,
	)
	}

	sqlInstanceID, reason := getSQLInstanceIDForKVNodeID(replDesc.NodeID)

	// makeInstanceResolver returns a function that can choose the SQL instance ID
	// for a provided KV node ID.
	func (dsp *DistSQLPlanner) makeInstanceResolver(
	ctx context.Context, planCtx *PlanningCtx,
	) (func(roachpb.NodeID) (base.SQLInstanceID, SpanPartitionReason), error) {
	_, mixedProcessMode := dsp.distSQLSrv.NodeID.OptionalNodeID()
	locFilter := planCtx.localityFilter

	var mixedProcessSameNodeResolver func(nodeID roachpb.NodeID) (base.SQLInstanceID, SpanPartitionReason)
	if mixedProcessMode {
	mixedProcessSameNodeResolver = dsp.healthySQLInstanceIDForKVNodeHostedInstanceResolver(ctx)
	}

	if mixedProcessMode && locFilter.Empty() {
	return mixedProcessSameNodeResolver, nil
	}

	// GetAllInstances only returns healthy instances.
	instances, err := dsp.sqlAddressResolver.GetAllInstances(ctx)
	if err != nil {
	return nil, err
	}
	if len(instances) == 0 {
	// For whatever reason, we think that we don't have any healthy
	// instances (one example is someone explicitly removing the rows from
	// the sql_instances table), but we always have the gateway pod to
	// execute on, so we'll use it (unless we have a locality filter).
	if locFilter.NonEmpty() {
	return nil, noInstancesMatchingLocalityFilterErr
	}
	log.Warningf(ctx, "no healthy sql instances available for planning, only using the gateway")
	return dsp.alwaysUseGatewayWithReason(SpanPartitionReason_GATEWAY_NO_HEALTHY_INSTANCES), nil
	}

	rng, _ := randutil.NewPseudoRand()

	instancesHaveLocality := false

	var gatewayIsEligible bool
	if locFilter.NonEmpty() {
	eligible := make([]sqlinstance.InstanceInfo, 0, len(instances))
	for i := range instances {
	if ok, _ := instances[i].Locality.Matches(locFilter); ok {
	eligible = append(eligible, instances[i])
	if instances[i].InstanceID == dsp.gatewaySQLInstanceID {
	gatewayIsEligible = true
	}
	}
	}
	if len(eligible) == 0 {
	return nil, noInstancesMatchingLocalityFilterErr
	}
	instances = eligible
	instancesHaveLocality = true
	} else {
	for i := range instances {
	if instances[i].Locality.NonEmpty() {
	instancesHaveLocality = true
	break
	}
	}
	gatewayIsEligible = true
	}

	if log.ExpensiveLogEnabled(ctx, 2) {
	log.VEventf(ctx, 2, "healthy SQL instances available for distributed planning: %v", instances)
	}

	// If we were able to determine the locality information for at least some
	// instances, use the locality-aware resolver.
	if instancesHaveLocality {
	resolver := func(nodeID roachpb.NodeID) (base.SQLInstanceID, SpanPartitionReason) {
	// Lookup the node localities to compare to the instance localities.
	nodeDesc, err := dsp.nodeDescs.GetNodeDescriptor(nodeID)
	if err != nil {
	log.Eventf(ctx, "unable to get node descriptor for KV node %s", nodeID)
	return dsp.gatewaySQLInstanceID, SpanPartitionReason_GATEWAY_ON_ERROR
	}

	// If we're in mixed-mode, check if the picked node already matches the
	// locality filter in which case we can just use it.
	if mixedProcessMode {
	if ok, _ := nodeDesc.Locality.Matches(locFilter); ok {
	return mixedProcessSameNodeResolver(nodeID)
	} else {
	log.VEventf(ctx, 2,
	"node %d locality %s does not match locality filter %s, finding alternative placement...",
	nodeID, nodeDesc.Locality, locFilter,
	)
	}
	}

	// TODO(dt): Pre-compute / cache this result, e.g. in the instance reader.
	if closest, _ := ClosestInstances(instances,
	nodeDesc.Locality); len(closest) > 0 {
	return closest[rng.Intn(len(closest))], SpanPartitionReason_CLOSEST_LOCALITY_MATCH
	}

	// No instances had any locality tiers in common with the node locality.
	// At this point we pick the gateway if it is eligible, otherwise we pick
	// a random instance from the eligible instances.
	if !gatewayIsEligible {
	return instances[rng.Intn(len(instances))].InstanceID, SpanPartitionReason_LOCALITY_FILTERED_RANDOM
	}
	if dsp.shouldPickGateway(planCtx, instances) {
	return dsp.gatewaySQLInstanceID, SpanPartitionReason_GATEWAY_NO_LOCALITY_MATCH
	} else {
	// If the gateway has a disproportionate number of partitions pick a
	// random instance that is not the gateway.
	if planCtx.spanPartitionState.testingOverrideRandomSelection != nil {
	return planCtx.spanPartitionState.testingOverrideRandomSelection(),
	SpanPartitionReason_LOCALITY_FILTERED_RANDOM_GATEWAY_OVERLOADED
	}
	// NB: This random selection may still pick the gateway but that is
	// alright as we are more interested in a uniform distribution rather
	// than avoiding the gateway.
	id := instances[rng.Intn(len(instances))].InstanceID
	return id, SpanPartitionReason_LOCALITY_FILTERED_RANDOM_GATEWAY_OVERLOADED
	}
	}
	return resolver, nil
	}

	// If no sql instances have locality information, fallback to a naive
	// round-robin strategy that is completely locality-ignorant. Randomize the
	// order in which we choose instances so that work is allocated fairly across
	// queries.
	rng.Shuffle(len(instances), func(i, j int) {
	instances[i], instances[j] = instances[j], instances[i]
	})
	var i int
	resolver := func(roachpb.NodeID) (base.SQLInstanceID, SpanPartitionReason) {
	id := instances[i%len(instances)].InstanceID
	i++
	return id, SpanPartitionReason_ROUND_ROBIN
	}
	return resolver, nil
	}

changefeedccl: use new bulk oracle for changefeed planning #120077

changefeedccl: use new bulk oracle for changefeed planning #120077

Conversation

rharding6373 commented Mar 7, 2024 • edited Loading

blathers-crl bot commented Mar 7, 2024

cockroach-teamcity commented Mar 7, 2024

rharding6373 left a comment

Choose a reason for hiding this comment

jayshrivastava left a comment

Choose a reason for hiding this comment

dt commented Mar 11, 2024

rharding6373 left a comment

Choose a reason for hiding this comment

jayshrivastava left a comment

Choose a reason for hiding this comment

rharding6373 commented Mar 21, 2024

craig bot commented Mar 22, 2024

rharding6373 commented May 31, 2024

blathers-crl bot commented May 31, 2024

rharding6373 commented Mar 7, 2024 •

edited

Loading