[RHCLOUD-37152] Add locks on Roles, CAR and groups in migrator #1441

lpichler · 2025-01-16T15:57:12Z

Link(s) to Jira

https://issues.redhat.com/browse/RHCLOUD-37152

Description of Intent of Change(s)

This PR add locks on Roles, CAR and groups in migrator and it will solve concurrency control during migration and dual writes.

Local Testing

..

Checklist

if API spec changes are required, is the spec updated?
are there any pre/post merge actions required? if so, document here.
are theses changes covered by unit tests?
if warranted, are documentation changes accounted for?
does this require migration changes?
- if yes, are they backwards compatible?
is there known, direct impact to dependent teams/components?
- if yes, how will this be handled?

Secure Coding Practices Checklist Link

https://github.com/RedHatInsights/secure-coding-checklist

Secure Coding Practices Checklist

alechenninger · 2025-01-16T20:44:13Z

rbac/migration_tool/migrate.py

            with transaction.atomic():
+                # Lock group
+                Group.objects.select_for_update().get(pk=group.pk)


I think we may want to use the group state as of this query, rather than relying on the group state prior to this (which is what the view does). I'm wondering for example about if between the time the principals are queried, and the group is locked, principals are removed. That removal replicates first, and then the add happens which adds them back.

It's also overall maybe just easier to reason about.

To do that it is probably as simple as moving the with transaction.atomic() to the top of and just inside the for loop. It will potentially lock more groups, because it will end up locking groups that we don't have anything to replicate for, but that's probably fine.

e.g.

for group in groups: with transaction.atomic(): principals: list[Principal] = [] system_roles: list[Role] = [] if not group.platform_default: # ... etc

Actually sorry that example is misleading I think. I was multitasking. I didn't include requerying the group with the lock which was the point.

for group in groups: with transaction.atomic(): # Requery the group with a lock group = Group.objects.select_for_update().get(pk=group.pk) principals: list[Principal] = [] system_roles: list[Role] = [] if not group.platform_default: # ... etc

(you could optimize this a little bit by getting just the group pk's in the prior tenant.group_set query)

Also, I think we may need to modify the group queryset in the view to lock the group in any principal modification, not just adds.

The reason is because in this case, the migrator assumes an "add principals" operation, but that's not actually happening. So normally when add and remove interleave, it should be fine, because they both also are modifying the database. So they may interleave, but regardless the outbox is always consistent with the database, so if a principal is there (or not), the same will be reflected in relations.

In this case, the migrator is not actually modifying the group-principal data, it's just producing an outbox message. If it also did group.principals.add(...) then the outbox and relations would be consistent even if this interleaved with a removal, however that is probably inappropriate because it would "undo" the removal operation. (In the case of normal dual write, another user "undid" the operation so it's not as big of a deal if they interleave.)

The upside is making the lock more general makes the code a little easier to reason about I think. The downside is it can hurt performance, but I think that is not really a big concern here, since it is really unlikely that there is contention over these specific records.

I added those suggestions.

alechenninger · 2025-01-17T14:49:15Z

rbac/migration_tool/migrate.py

+            role = Role.objects.select_for_update().get(pk=role.pk)
+            dual_write_handler = RelationApiDualWriteHandler(
+                role, ReplicationEventType.MIGRATE_CUSTOM_ROLE, replicator
            )
-        )
+
+            dual_write_handler.replicate_new_or_updated_role(role)


Here I think we might need two adjustments:

We need dual_write_handler.prepare_for_update() before the replicate method

We might have a similar problem to above group principal removal with role deletion here. Normally, dual write shouldn't need to lock the role on deletion, because foreign key constraints on modifications to the role will prevent inconsistency. But here is another case where the migrator is not adding the data to the DB at the same time. So I think we could fix this one by either re-running .save() on a role serializer here, so that access FKs serialize concurrent removal, OR we have to lock the role in the role view queryset on role deletion also.

I added prepare_for_update(I though that removal of current relations is not needed for role )

I think lock on role in delete action is already here

lpichler · 2025-01-21T12:11:12Z

/retest

astrozzc · 2025-01-21T20:05:06Z

rbac/migration_tool/migrate.py

-    # if we don't write relationships (testing out the migration and clean up the created bindingmappings)
-    if not settings.READ_ONLY_API_MODE and write_relationships != "False":
+    # Run this if we don't write relationships (testing out the migration and clean up the created bindingmappings)
+    if write_relationships != "False":


If we get rid of the settings.READ_ONLY_API_MODE condition, we can never set write_relationships to True?

yes, good point, I think we don't need this condition. I am removing it .

alechenninger · 2025-01-22T01:23:45Z

rbac/migration_tool/migrate.py



 def migrate_roles_for_tenant(tenant, exclude_apps, replicator):
    """Migrate all roles for a given tenant."""
-    default_workspace = Workspace.objects.get(type=Workspace.Types.DEFAULT, tenant=tenant)
-
    roles = tenant.role_set.all()
    if exclude_apps:
        roles = roles.exclude(access__permission__application__in=exclude_apps)

    for role in roles:


If you want I guess we could also just get the Role PKs here but not a big deal.

alechenninger · 2025-01-22T01:35:03Z

rbac/migration_tool/migrate.py

        with transaction.atomic():
+            # Lock cross account request
+            cross_account_request = CrossAccountRequest.objects.select_for_update().get(pk=cross_account_request.pk)


This is good I think. I can't think of a situation where this has a problem. When any CAR attribute changes (role, status, etc) it is locked, so those changes can't happen concurrently. The CAR used for replication is always the one queried with the lock, as is done here, so by the time a select returns, it will always be the latest state. So if dual write and migrator interleave, I don't think there should be a problem.

Looks good!

lpichler · 2025-01-22T11:58:30Z

I am going to merge as I added last suggestions and I got LGTM.

alechenninger reviewed Jan 16, 2025

View reviewed changes

alechenninger reviewed Jan 17, 2025

View reviewed changes

lpichler force-pushed the locks_objects_in_migrator branch 2 times, most recently from eb580f7 to 265d5aa Compare January 20, 2025 16:12

lpichler changed the title ~~[WIP] [RHCLOUD-37152] Add locks on Roles, CAR and groups in migrator~~ [RHCLOUD-37152] Add locks on Roles, CAR and groups in migrator Jan 20, 2025

lpichler requested review from alechenninger and astrozzc January 20, 2025 16:13

lpichler force-pushed the locks_objects_in_migrator branch 2 times, most recently from 3f2f9b7 to 58fe7eb Compare January 21, 2025 07:56

lpichler force-pushed the locks_objects_in_migrator branch from 185a72b to 8e4ca33 Compare January 21, 2025 15:08

astrozzc reviewed Jan 21, 2025

View reviewed changes

alechenninger reviewed Jan 22, 2025

View reviewed changes

lpichler force-pushed the locks_objects_in_migrator branch from 8e4ca33 to b5eee92 Compare January 22, 2025 11:39

lpichler added 8 commits January 22, 2025 12:39

Lock group and cross account objects in migrator

9d30518

Use dual write handler for role in role migrator

0366538

Extend scope for transaction for groups in migration and lock groups

0840105

Add prepare prepare_for_update for role migration

8ce5533

Don't require to run this in maintenance mode

ef37429

Extend scope for transaction in group principal removals

1190a98

Remove unnecessary condition

96f1562

Use primary keys for role and group queries

1bcb905

lpichler force-pushed the locks_objects_in_migrator branch from b5eee92 to 1bcb905 Compare January 22, 2025 11:39

lpichler merged commit 8262ac5 into master Jan 22, 2025
11 checks passed

lpichler deleted the locks_objects_in_migrator branch January 22, 2025 11:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RHCLOUD-37152] Add locks on Roles, CAR and groups in migrator #1441

[RHCLOUD-37152] Add locks on Roles, CAR and groups in migrator #1441

lpichler commented Jan 16, 2025 •

edited

Loading

alechenninger Jan 16, 2025 •

edited

Loading

alechenninger Jan 17, 2025 •

edited

Loading

alechenninger Jan 17, 2025

alechenninger Jan 17, 2025

lpichler Jan 20, 2025

alechenninger Jan 17, 2025

lpichler Jan 20, 2025 •

edited

Loading

lpichler commented Jan 21, 2025

astrozzc Jan 21, 2025

lpichler Jan 22, 2025

alechenninger Jan 22, 2025

alechenninger Jan 22, 2025

lpichler commented Jan 22, 2025

[RHCLOUD-37152] Add locks on Roles, CAR and groups in migrator #1441

[RHCLOUD-37152] Add locks on Roles, CAR and groups in migrator #1441

Conversation

lpichler commented Jan 16, 2025 • edited Loading

Link(s) to Jira

Description of Intent of Change(s)

Local Testing

Checklist

Secure Coding Practices Checklist Link

Secure Coding Practices Checklist

alechenninger Jan 16, 2025 • edited Loading

Choose a reason for hiding this comment

alechenninger Jan 17, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lpichler Jan 20, 2025 • edited Loading

Choose a reason for hiding this comment

lpichler commented Jan 21, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lpichler commented Jan 22, 2025

lpichler commented Jan 16, 2025 •

edited

Loading

alechenninger Jan 16, 2025 •

edited

Loading

alechenninger Jan 17, 2025 •

edited

Loading

lpichler Jan 20, 2025 •

edited

Loading