Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RHCLOUD-37152] Add locks on Roles, CAR and groups in migrator #1441

Merged
merged 8 commits into from
Jan 22, 2025

Conversation

lpichler
Copy link
Contributor

@lpichler lpichler commented Jan 16, 2025

Link(s) to Jira

Description of Intent of Change(s)

This PR add locks on Roles, CAR and groups in migrator and it will solve concurrency control during migration and dual writes.

Local Testing

..

Checklist

  • if API spec changes are required, is the spec updated?
  • are there any pre/post merge actions required? if so, document here.
  • are theses changes covered by unit tests?
  • if warranted, are documentation changes accounted for?
  • does this require migration changes?
    • if yes, are they backwards compatible?
  • is there known, direct impact to dependent teams/components?
    • if yes, how will this be handled?

Secure Coding Practices Checklist Link

Secure Coding Practices Checklist

  • Input Validation
  • Output Encoding
  • Authentication and Password Management
  • Session Management
  • Access Control
  • Cryptographic Practices
  • Error Handling and Logging
  • Data Protection
  • Communication Security
  • System Configuration
  • Database Security
  • File Management
  • Memory Management
  • General Coding Practices

with transaction.atomic():
# Lock group
Group.objects.select_for_update().get(pk=group.pk)
Copy link
Collaborator

@alechenninger alechenninger Jan 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we may want to use the group state as of this query, rather than relying on the group state prior to this (which is what the view does). I'm wondering for example about if between the time the principals are queried, and the group is locked, principals are removed. That removal replicates first, and then the add happens which adds them back.

It's also overall maybe just easier to reason about.

To do that it is probably as simple as moving the with transaction.atomic() to the top of and just inside the for loop. It will potentially lock more groups, because it will end up locking groups that we don't have anything to replicate for, but that's probably fine.

e.g.

    for group in groups:
        with transaction.atomic():
            principals: list[Principal] = []
            system_roles: list[Role] = []
            if not group.platform_default:
            # ... etc

Copy link
Collaborator

@alechenninger alechenninger Jan 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually sorry that example is misleading I think. I was multitasking. I didn't include requerying the group with the lock which was the point.

    for group in groups:  
        with transaction.atomic():
            # Requery the group with a lock
            group = Group.objects.select_for_update().get(pk=group.pk)
            principals: list[Principal] = []  
            system_roles: list[Role] = []  
            if not group.platform_default:  
            # ... etc

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(you could optimize this a little bit by getting just the group pk's in the prior tenant.group_set query)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I think we may need to modify the group queryset in the view to lock the group in any principal modification, not just adds.

The reason is because in this case, the migrator assumes an "add principals" operation, but that's not actually happening. So normally when add and remove interleave, it should be fine, because they both also are modifying the database. So they may interleave, but regardless the outbox is always consistent with the database, so if a principal is there (or not), the same will be reflected in relations.

In this case, the migrator is not actually modifying the group-principal data, it's just producing an outbox message. If it also did group.principals.add(...) then the outbox and relations would be consistent even if this interleaved with a removal, however that is probably inappropriate because it would "undo" the removal operation. (In the case of normal dual write, another user "undid" the operation so it's not as big of a deal if they interleave.)

The upside is making the lock more general makes the code a little easier to reason about I think. The downside is it can hurt performance, but I think that is not really a big concern here, since it is really unlikely that there is contention over these specific records.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added those suggestions.

Comment on lines 84 to 89
role = Role.objects.select_for_update().get(pk=role.pk)
dual_write_handler = RelationApiDualWriteHandler(
role, ReplicationEventType.MIGRATE_CUSTOM_ROLE, replicator
)
)

dual_write_handler.replicate_new_or_updated_role(role)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I think we might need two adjustments:

  • We need dual_write_handler.prepare_for_update() before the replicate method
  • We might have a similar problem to above group principal removal with role deletion here. Normally, dual write shouldn't need to lock the role on deletion, because foreign key constraints on modifications to the role will prevent inconsistency. But here is another case where the migrator is not adding the data to the DB at the same time. So I think we could fix this one by either re-running .save() on a role serializer here, so that access FKs serialize concurrent removal, OR we have to lock the role in the role view queryset on role deletion also.

Copy link
Contributor Author

@lpichler lpichler Jan 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • I added prepare_for_update(I though that removal of current relations is not needed for role )
  • I think lock on role in delete action is already here

@lpichler lpichler force-pushed the locks_objects_in_migrator branch 2 times, most recently from eb580f7 to 265d5aa Compare January 20, 2025 16:12
@lpichler lpichler changed the title [WIP] [RHCLOUD-37152] Add locks on Roles, CAR and groups in migrator [RHCLOUD-37152] Add locks on Roles, CAR and groups in migrator Jan 20, 2025
@lpichler lpichler force-pushed the locks_objects_in_migrator branch 2 times, most recently from 3f2f9b7 to 58fe7eb Compare January 21, 2025 07:56
@lpichler
Copy link
Contributor Author

/retest

@lpichler lpichler force-pushed the locks_objects_in_migrator branch from 185a72b to 8e4ca33 Compare January 21, 2025 15:08
# if we don't write relationships (testing out the migration and clean up the created bindingmappings)
if not settings.READ_ONLY_API_MODE and write_relationships != "False":
# Run this if we don't write relationships (testing out the migration and clean up the created bindingmappings)
if write_relationships != "False":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we get rid of the settings.READ_ONLY_API_MODE condition, we can never set write_relationships to True?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, good point, I think we don't need this condition. I am removing it .



def migrate_roles_for_tenant(tenant, exclude_apps, replicator):
"""Migrate all roles for a given tenant."""
default_workspace = Workspace.objects.get(type=Workspace.Types.DEFAULT, tenant=tenant)

roles = tenant.role_set.all()
if exclude_apps:
roles = roles.exclude(access__permission__application__in=exclude_apps)

for role in roles:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want I guess we could also just get the Role PKs here but not a big deal.

with transaction.atomic():
# Lock cross account request
cross_account_request = CrossAccountRequest.objects.select_for_update().get(pk=cross_account_request.pk)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is good I think. I can't think of a situation where this has a problem. When any CAR attribute changes (role, status, etc) it is locked, so those changes can't happen concurrently. The CAR used for replication is always the one queried with the lock, as is done here, so by the time a select returns, it will always be the latest state. So if dual write and migrator interleave, I don't think there should be a problem.

Looks good!

@lpichler lpichler force-pushed the locks_objects_in_migrator branch from 8e4ca33 to b5eee92 Compare January 22, 2025 11:39
@lpichler lpichler force-pushed the locks_objects_in_migrator branch from b5eee92 to 1bcb905 Compare January 22, 2025 11:39
@lpichler
Copy link
Contributor Author

I am going to merge as I added last suggestions and I got LGTM.

@lpichler lpichler merged commit 8262ac5 into master Jan 22, 2025
11 checks passed
@lpichler lpichler deleted the locks_objects_in_migrator branch January 22, 2025 11:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants