Store explicit TaskList partition data #6591

natemort · 2025-01-03T19:33:33Z

What changed?
Replace num_read_partitions and num_write_partitions with an explicit map of partition ids to partition configuration. This enables assigning isolation groups to partitions in the future.

This change is backwards compatible as it populates both values when writing data to/from persistence and when mapping to/from thrift/proto. When draining partitions we continue to maintain a contiguous block of partition ids.

Once this change has been deployed broadly we can remove the fields from IDL. We could also consider removing the fields from persistence in the future as well.

Why?

To support assigning isolation groups to partitions in the future

How did you test it?

Unit / Integration tests

Potential risks

Bugs in this functionality could result in partition autoscaling not working correctly or potentially impact partitioned task lists.

Detailed Description

Changed the in-memory representation of a TaskListConfig from the number of partitions to the map[int]TaskListPartition.
Updated the proto/thrift mappers to convert the partition numbers to default values (all isolation groups in each partition) any time the numeric values don't match the maps.
Updated the persistence logic to convert the partition numbers to default values (all isolation groups in each partition) any time the numeric values don't match the maps.
Updated AutoScaler to use DescribeTaskList and operate in terms of partitions rather than just numbers of partitions.
Updated logic depending on the numeric values to depend on the length of the maps. Subsequent PRs will actually use the values and populate the IsolationGroups.

Impact Analysis

Backward Compatibility: Since we're writing both fields we maintain compatibility with existing data and clients.
Forward Compatibility: This new schema is flexible enough that we should be able to freely change it.

Testing Plan

Unit Tests: Fully covered by unit tests, including old data
Persistence Tests: Persistence tests round trip the data and
Integration Tests: Partially covered by integration tests by virtue of partitioned task lists.
Compatibility Tests: No explicit compatibility tests

Rollout Plan

What is the rollout plan? This can be rolled out in any order. In order to enable actually storing partition data we need to copy it to the DB from the dynamic config.
Does the order of deployment matter? Order does not matter
Is it safe to rollback? Does the order of rollback matter? This can be safely rolled back as it writes the old fields.
Is there a kill switch to mitigate the impact immediately? This isn't behind a flag so rolling back is the only mechanism.

Release notes

Documentation Changes

proto/internal/uber/cadence/matching/v1/service.proto

client/matching/partition_config_provider.go

common/persistence/data_manager_interfaces.go

schema/cassandra/cadence/schema.cql

Shaddoll · 2025-01-03T22:44:54Z

service/matching/tasklist/adaptive_scaler.go

+	return result, changed
+}
+
+func (a *adaptiveScalerImpl) collectPartitionMetrics(config *types.TaskListPartitionConfig) (*aggregatePartitionMetrics, error) {


This might increase the load significantly. Originally, describeTaskList is only called when the number of write partitions is less than the number of read partitions. But after this change, it will be called periodically. Maybe, we can add a metric to track how much extra load is caused by this.

Good point. Failure case behavior is also not clear to me. If one partition fails to response (e.g. timeout) what do we do?
Maybe we just activate this new logic if the tasklist has isolation groups enabled. This way we can iterate and optimize as needed without impacting normal behavior of adaptive scaler

Failure case is a no-op, which seems reasonable as a starting point. In the future we can explore more sophisticated approaches such as maintaining more state. I've updated this logic to only perform the RPC to all child partitions when isolation is enabled.

taylanisikdemir · 2025-01-06T17:37:18Z

common/persistence/nosql/nosqlplugin/cassandra/tasks.go

+	if values == nil {
+		return nil
+	}
+	partitions := values.(map[int]map[string]any)


nit: to avoid future misuses causing panics let's check the result of casting

partitions, ok := values.(map[int]map[string]any) if !ok { return nil }

taylanisikdemir · 2025-01-06T17:38:30Z

common/persistence/nosql/nosqlplugin/cassandra/tasks.go

+	// If they're out of sync, go with the value of num_*_partitions. This is necessary only while support for
+	// read_partitions and write_partitions rolls out


until read_partitions/write_partitions are written back to DB we will be resetting the partition mappings. I guess it doesn't matter for now as isolation group feature is disabled

Yeah that should be fine for now. Once we start populating the mapping then they'll be persisted and we can eventually even remove this logic.

taylanisikdemir · 2025-01-06T17:39:43Z

common/persistence/sql/sql_task_store.go

@@ -654,3 +630,69 @@ func lockTaskList(ctx context.Context, tx sqlplugin.Tx, shardID int, domainID se
 func stickyTaskListExpiry() time.Time {
 	return time.Now().Add(stickyTasksListsTTL)
 }
+
+func toSerializationTaskListPartitionConfig(c *persistence.TaskListPartitionConfig) *serialization.TaskListPartitionConfig {


thanks for moving these to helper funcs

taylanisikdemir · 2025-01-06T17:47:39Z

service/matching/tasklist/adaptive_scaler.go

+			if e != nil {
+				a.logger.Warn("failed to get partition metrics", tag.WorkflowTaskListName(a.taskListID.GetPartition(partitionID)), tag.Error(e))
+			}
+			if result != nil {


if e is not nil we shouldn't care about result.

taylanisikdemir · 2025-01-06T17:52:31Z

service/matching/tasklist/adaptive_scaler.go

+	return result, changed
+}
+
+func (a *adaptiveScalerImpl) collectPartitionMetrics(config *types.TaskListPartitionConfig) (*aggregatePartitionMetrics, error) {


Good point. Failure case behavior is also not clear to me. If one partition fails to response (e.g. timeout) what do we do?
Maybe we just activate this new logic if the tasklist has isolation groups enabled. This way we can iterate and optimize as needed without impacting normal behavior of adaptive scaler

Shaddoll · 2025-01-06T18:59:16Z

service/matching/tasklist/task_list_manager.go

@@ -240,7 +240,7 @@ func NewManager(
 			partitionConfig := tlMgr.TaskListPartitionConfig()
 			r := 1
 			if partitionConfig != nil {
-				r = int(partitionConfig.NumReadPartitions)
+				r = len(partitionConfig.ReadPartitions)


This len() logic is reused at multiple places, let's consider creating a helper method.

Shaddoll · 2025-01-06T19:13:48Z

service/matching/tasklist/task_list_manager.go

 	taskListType := types.TaskListTypeDecision.Ptr()
 	if c.taskListID.GetType() == persistence.TaskListTypeActivity {
 		taskListType = types.TaskListTypeActivity.Ptr()
 	}
+	// TODO: Do we want to notify partitions that were removed?


we need to notify partitions that were removed from write

Replace num_read_partitions and num_write_partitions with an explicit map of partition ids to partition configuration. This enables assigning isolation groups to partitions in the future. This change is backwards compatible as it populates both values when writing data to Cassandra and when returning it via the API. When draining partitions we continue to maintain a contiguous block of partition ids.

This reverts commit eebf656.

…" (cadence-workflow#6625) Address issues in GRPC -> types mapper and add additional tets. This reverts commit bf9f526.

…" (cadence-workflow#6625) Address issues in GRPC -> types mapper and add additional tets. Additionally address issues in serialization <-> sqlblobs mapper and add tests. This reverts commit bf9f526.

natemort requested review from Shaddoll, neil-xie, davidporter-id-au, Groxx, shijiesheng, jakobht, 3vilhamster, sankari165, dkrotx, taylanisikdemir and demirkayaender as code owners January 3, 2025 19:33

natemort force-pushed the partition2 branch 3 times, most recently from 3b2163a to 0c97f02 Compare January 3, 2025 21:25