fix: save cni state only during endpoint creation or deletion #3254
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Reason for Change:
This PR removes saving to the azure cni statefile from the AddExternalInterface and CreateNetwork methods since we only want to commit to the statefile once we create the endpoint. Saving the state will save all in-memory state including all networks, external interfaces, and endpoints, so it's unecessary to call save state after each update-- we only need to call it once at the end (which is when the endpoint is created).
We also backport https://github.com/Azure/azure-container-networking/pull/2309/files here
Issue Fixed:
Previously, if we wrote the external interface to the state, and then crashed before we wrote the corresponding network, the subsequent DEL call would not clean up the ips (since azure cni does not see the network in the azure cni state file, leading to leaked ips), and we would not be able to auto recover by deleting the azure ipam statefile either since the program detects that the azure cni statefile exists (Autorecovery is here https://github.com/Azure/azure-container-networking/blob/release/v1.4/cni/network/invoker_azure.go#L58).
Now, if there is a panic or crash between adding the external interface and adding the network, no state file is written, and so the azure cni will auto recover by deleting the ipam state file (as it determines the azure cni statefile is not present) and continue as normal.
Requirements:
Notes:
Tested by forcing a panic between adding the external interface and creating the network-- no azure cni statefile is created, and when the forced panic is removed, we can successfully create an endpoint with no leaked ips.
Also tested forcing a panic during endpoint create-- the logs mention that we clean up the ip (issue ipam delete) and when we remove the forced panic, there are no leaked ips (# ips in azure ipam statefile match the azure cni statefile).
Confirmed that we will never have a scenario where the azure cni creates an external interface only (without a network)-- it's either the interface, network, and endpoint are saved to the statefile or none of them are.