Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: save cni state only during endpoint creation or deletion #3254

Merged
merged 2 commits into from
Dec 13, 2024

Conversation

QxBytes
Copy link
Contributor

@QxBytes QxBytes commented Dec 11, 2024

Reason for Change:

This PR removes saving to the azure cni statefile from the AddExternalInterface and CreateNetwork methods since we only want to commit to the statefile once we create the endpoint. Saving the state will save all in-memory state including all networks, external interfaces, and endpoints, so it's unecessary to call save state after each update-- we only need to call it once at the end (which is when the endpoint is created).

We also backport https://github.com/Azure/azure-container-networking/pull/2309/files here

Issue Fixed:

Previously, if we wrote the external interface to the state, and then crashed before we wrote the corresponding network, the subsequent DEL call would not clean up the ips (since azure cni does not see the network in the azure cni state file, leading to leaked ips), and we would not be able to auto recover by deleting the azure ipam statefile either since the program detects that the azure cni statefile exists (Autorecovery is here https://github.com/Azure/azure-container-networking/blob/release/v1.4/cni/network/invoker_azure.go#L58).

Now, if there is a panic or crash between adding the external interface and adding the network, no state file is written, and so the azure cni will auto recover by deleting the ipam state file (as it determines the azure cni statefile is not present) and continue as normal.

Requirements:

Notes:
Tested by forcing a panic between adding the external interface and creating the network-- no azure cni statefile is created, and when the forced panic is removed, we can successfully create an endpoint with no leaked ips.

Also tested forcing a panic during endpoint create-- the logs mention that we clean up the ip (issue ipam delete) and when we remove the forced panic, there are no leaked ips (# ips in azure ipam statefile match the azure cni statefile).

Confirmed that we will never have a scenario where the azure cni creates an external interface only (without a network)-- it's either the interface, network, and endpoint are saved to the statefile or none of them are.

@QxBytes QxBytes self-assigned this Dec 11, 2024
@QxBytes QxBytes added cni Related to CNI. fix Fixes something. release/1.4 Change affects v1.4 release train needs-backport Change needs to be backported to previous release trains release/1.5 Change affects v1.5 release train labels Dec 11, 2024
@QxBytes QxBytes marked this pull request as ready for review December 12, 2024 00:00
Copy link
Member

@tamilmani1989 tamilmani1989 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hope you tested and no regression with this change

@QxBytes QxBytes force-pushed the alew/single-save branch 2 times, most recently from ff712e5 to 553841b Compare December 12, 2024 23:55
@QxBytes
Copy link
Contributor Author

QxBytes commented Dec 12, 2024

/azp run Azure Container Networking PR

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@QxBytes QxBytes enabled auto-merge (squash) December 13, 2024 17:42
@QxBytes
Copy link
Contributor Author

QxBytes commented Dec 13, 2024

/azp run Azure Container Networking PR

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@QxBytes QxBytes merged commit 447f655 into release/v1.4 Dec 13, 2024
55 of 57 checks passed
@QxBytes QxBytes deleted the alew/single-save branch December 13, 2024 19:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cni Related to CNI. fix Fixes something. needs-backport Change needs to be backported to previous release trains release/1.4 Change affects v1.4 release train release/1.5 Change affects v1.5 release train
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants