Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: [WIN-NPM] process updatePods in fifo order #1856

Merged
merged 14 commits into from
Apr 12, 2023

Conversation

huntergregory
Copy link
Contributor

@huntergregory huntergregory commented Mar 17, 2023

Reason for Change:
Fully solve #1729. When the issue’s scenario occurs, ordering of Pods matters for UpdatePod() within ApplyDataPlane().

UTs have been failing for sequence 1 and 2 of this issue, and the chance of failure is higher when applying IPSets in background.

Issue Fixed:

Requirements:

Notes:

Solution

Handle updatePod objects in FIFO order, the same order as the control plane.

Results

Update Pod ACLs in same order as the control plane, e.g., the first Pod created is the first Pod to have proper connectivity.

@huntergregory huntergregory added npm Related to NPM. windows labels Mar 17, 2023
@huntergregory huntergregory requested a review from vakalapa March 17, 2023 18:02
@huntergregory huntergregory requested a review from a team as a code owner March 17, 2023 18:02
@huntergregory
Copy link
Contributor Author

Closing because this issue would be extremely rare. All of these rare events would have to coincide.

  1. Sequence 2 of Edge Case for memory-starved Windows Server 2022 Nodes with NPM #1729
  2. Pods in question are on the same node
  3. RefreshEndpoints failure

@huntergregory huntergregory reopened this Apr 10, 2023
@huntergregory
Copy link
Contributor Author

Closing because this issue would be extremely rare. All of these rare events would have to coincide.

  1. Sequence 2 of Edge Case for memory-starved Windows Server 2022 Nodes with NPM #1729
  2. Pods in question are on the same node
  3. RefreshEndpoints failure

The chance of this occurring is higher when applying ipsets in background (bullet point 3 is no longer required).

Have now seen sequence 1 fail too.

@@ -149,6 +174,9 @@ func (dp *DataPlane) AddToSets(setNames []*ipsets.IPSetMetadata, podMetadata *Po
klog.Infof("[DataPlane] {AddToSet} pod key %s not found in updatePodCache. creating a new obj", podMetadata.PodKey)
updatePod = newUpdateNPMPod(podMetadata)
dp.updatePodCache.cache[podMetadata.PodKey] = updatePod

// add to queue only if not in the cache/queue already
dp.updatePodCache.queue = append(dp.updatePodCache.queue, podMetadata.PodKey)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where are we checking if the key is not in queue ? we will need to have a 2nd check here,

@@ -177,6 +205,9 @@ func (dp *DataPlane) RemoveFromSets(setNames []*ipsets.IPSetMetadata, podMetadat
klog.Infof("[DataPlane] {RemoveFromSet} pod key %s not found in updatePodCache. creating a new obj", podMetadata.PodKey)
updatePod = newUpdateNPMPod(podMetadata)
dp.updatePodCache.cache[podMetadata.PodKey] = updatePod

// add to queue only if not in the cache/queue already
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to add a secondary check here if the key is in queue or not,

defer dp.updatePodCache.Unlock()
defer func() {
dp.updatePodCache.removeDeletedItemsFromQueue()
dp.updatePodCache.Unlock()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be create a datastructure called ordered Map ? and use that as the update pod cache ?

example: https://github.com/iancoleman/orderedmap/blob/master/orderedmap.go

// enqueue adds a pod to the queue if necessary and returns the pod object used
func (c *updatePodCache) enqueue(m *PodMetadata) *updateNPMPod {
pod, ok := c.cache[m.PodKey]
if !ok {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if the nodename got updated ? we will need to check if the pod objs are same if not update the element in map and move the pod obj to the end ?

// Caller should ensure the queue is not empty.
// Otherwise, the following will occur: "panic: runtime error: index out of range [0] with length 0"
func (c *updatePodCache) dequeue() *updateNPMPod {
pod := c.cache[c.queue[0]]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check if 0th element is present before calling for it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

discussed offline. caller must lock updatePodCache and check isEmpty() before calling dequeue()

@huntergregory huntergregory force-pushed the hgregory/npm-fifo-updatepod branch from 692aa51 to 3e8d187 Compare April 11, 2023 18:06
@vakalapa vakalapa merged commit 026c201 into master Apr 12, 2023
@vakalapa vakalapa deleted the hgregory/npm-fifo-updatepod branch April 12, 2023 16:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
npm Related to NPM. windows
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants