Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

setHeaderRoute error and memory leak #3276

Open
2 tasks done
dtelaroli opened this issue Dec 27, 2023 · 5 comments
Open
2 tasks done

setHeaderRoute error and memory leak #3276

dtelaroli opened this issue Dec 27, 2023 · 5 comments
Labels
bug Something isn't working

Comments

@dtelaroli
Copy link

dtelaroli commented Dec 27, 2023

Checklist:

  • I've included steps to reproduce the bug.
  • I've included the version of argo rollouts.

Describe the bug

Problem 1:
The argo-rollouts is adding duplicated header route, flooding the virtual service with a content that is bigger than the etcd supports.

Problem 2:
After the problem 1, the argo-rollouts pod is leaking memory consuming all the node memory, it restarts and starts again the cycle.
This problem happens if happens any problem which generates a big manifest. I saw same behavior using the analysis-run for 24h of metrics collection.

time="2023-12-27T15:31:07Z" level=warning msg="Request entity too large: limit is 3145728" event_reason=TrafficRoutingError namespace=psm-test rollout=clismo

To Reproduce

I don't know how to reproduce the Problem 1.
It's possible to reproduce the Problem 2 creating a virtual service with this route duplicated.

- match:
        - headers:
            x-version:
              exact: PR-132-b36d66a
      name: header-route-version
      route:
        - destination:
            host: clismo
            subset: canary
          weight: 100

It's needed more than 6k lines to error happen.
After that, execute a change in the rollout to starts a new rollout version.

Expected behavior

Screenshots

image

Version

v1.5.0

Logs

# Paste the logs from the rollout controller

# Logs for the entire controller:
kubectl logs -n argo-rollouts deployment/argo-rollouts

# Logs for a specific rollout:
kubectl logs -n argo-rollouts deployment/argo-rollouts | grep rollout=<ROLLOUTNAME

time="2023-12-27T15:38:47Z" level=info msg="Started syncing rollout" generation=359 namespace=psm-test resourceVersion=3287885544 rollout=clismo
time="2023-12-27T15:38:48Z" level=info msg="Found 1 TrafficRouting Reconcilers" namespace=psm-test rollout=clismo
time="2023-12-27T15:38:48Z" level=info msg="Reconciling TrafficRouting with type 'Istio'" namespace=psm-test rollout=clismo
time="2023-12-27T15:38:50Z" level=warning msg="Request entity too large: limit is 3145728" event_reason=TrafficRoutingError namespace=psm-test rollout=clismo
time="2023-12-27T15:38:50Z" level=error msg="roCtx.reconcile err Request entity too large: limit is 3145728" generation=359 namespace=psm-test resourceVersion=3287885544 rollout=clismo
time="2023-12-27T15:38:50Z" level=info msg="Event(v1.ObjectReference{Kind:\"Rollout\", Namespace:\"psm-test\", Name:\"clismo\", UID:\"15051ab3-a968-4673-b1af-55ac0a8c525d\", APIVersion:\"argoproj.io/v1alpha1\", ResourceVersion:\"3287885544\", FieldPath:\"\"}): type: 'Warning' reason: 'TrafficRoutingError' Request entity too large: limit is 3145728"

Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritize the issues with the most 👍.

@dtelaroli dtelaroli added the bug Something isn't working label Dec 27, 2023
@zachaller
Copy link
Collaborator

zachaller commented Dec 27, 2023

I think this is possibly fixed in 1.6, could you try 1.6.4?

#2887

@dtelaroli
Copy link
Author

dtelaroli commented Dec 27, 2023

Hi @zachaller
I have another issue that is a blocker to me upgrade the argo-rollouts.
#3223

@dtelaroli
Copy link
Author

Anyway, the PR #2887 fixes the problem 1, it doesn't solve the problem 2.

@andyliuliming
Copy link
Contributor

@dtelaroli did you have some findings for the memory footprint issue?
we observed some potential memory leak issue in our env too. (usually the memory usage is 200Mi, but after 15 days, it becomes 600Mi, althrough we only have about 5 rollouts in our cluster.

@dtelaroli
Copy link
Author

@andyliuliming i've discovered that the issue happens when you have a big manifest synced by the application.
There is a limit of size and when the size is over the limit the argo-rollouts dispatch error each sync cicle and this generates the memory leak. Request entity too large: limit is 3145728
Fixing the big manifest, the issue disappear.

Another issue that I had is because the rollouts adds a empty step during the setHeaderRoute: - {}
This brakes the rollouts also, generating memory leak.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants