-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory Threshold exceeded #591
Comments
As a preliminary result, I would increase our configured memory threshold ( Let's confirm that the memory usage is valid by checking in the monitoring for memory spikes (once either InfluxDB or the Telegraf Dashboard is accessible again). |
In our Telegraf Dashboard, we see that the memory usage likely correlates with Poseidon's usage volume and includes spikes. However, the spikes do not exceed 2GB and flatten without external intervention. When analyzing the two events of 13-05-2024, we notice that both the As mentioned before, I would go with increasing the configured memory threshold. An alternative might be to decrease Sentry's Sample Rate. |
Thanks for analyzing. I am fine with increasing the threshold, but still think we should also find to tackle one of the "root causes" to find a permanent solution. With regard to Sentry, I think a memory usage of 96% seems way too high, close to being unacceptable. Do we have any information why the memory usage is so high for Sentry? Are events not correctly forward to the relay? Or are some data kept in memory once submitted successfully? Or do we need to submit traces more often to reduce the memory footprint? Or shall we disable some information, such as version information of all dependencies (these are usually included in a Sentry report, at least on Ruby)? What I am trying to say: I am not really convinced whether the large memory footprint is legit given the expected details collected per request and the number of users we recently had, even when sampling all requests (as we do right now). |
Good call spending more time into this 👍 Issue 1 Issue 2 Issue 1 Issue 3 To find the root cause here, we have a look at the memory profile from the 28th of April and from the 7th of May. Issue 3.1 Issue 3.2
While the Sentry-unrelated memory usage has increased stronger, its dimension is more vulnerable to randomness and by far no factor for the increase of The increase in the Sentry memory usage can be explained by each marshaling requiring more memory ( |
Thanks, that's very useful! And I like the categorization as well as visualization very much; really helpful! Issue 1 Issue 2 {"id":"c7f60ff3cd4542cb841450effcf97c1b"} 2 byte responses (which might also occur for web-based sessions) are just returning: {} I haven't observed a response with code 413 during my short investigation, so that this is left for you :). My recommendation: Try to gather more insights on the data being sent and the error case. You may adjust the nginx logging to include the request (body) size and/or do some Wireshark tracing. Since our internal network traffic is not HTTPS encrypted, you'll be able to read it. Only once we have an example of a payload being too large, I would further increase the Issue 1 (No. 2) Issue 3 Issue 3.1 Issue 3.2 |
Thank you! This question cleared my misunderstanding that in contrast to goroutines, the memory stack traces represent not the place where the data is currently processed but where it was first allocated.
Issue 2 However, as we are allocating Issue 3 At some point, we might separate this issue into a separate GitHub issue as we have different priorities, and the investigation seems to steer in a different direction. Issue 3.1 The goroutines of our local storage are started on Poseidon startup or for each new environment and new runner. poseidon/pkg/storage/storage.go Line 85 in 48f3228
Due to our limited data, we have to mind the number of runners (and environments) for the time of the data (goroutine dump). Issue 3.2
The |
Issue 1 Thanks for providing more details here.
Issue 2 Regarding the size: When I check last, most HTTP requests were not in the size of 200 MB, but we might have singular traces that large. Issue 3 Issue 3.1 Issue 3.2 |
Issue 3.1 Next, we will analyze the log entries for some of these runners to identify reasons why the runner (and monitoring) context has not been stopped aka why the runner has not been destroyed (at the latest by the Inactivity Timer). |
Issue 2 |
We conclude that the interplay between Goroutine leaks within Poseidon and the usage of Sentry caused the issue. We identified two causes of Goroutine leaks (#601, #602). However, they were leaking so slowly that the direct impact did not cause any issues. Instead, Sentry caused high memory usage in response to the Goroutine Leaks. To be precise, Sentry's Profiling Feature collects the stack traces of all Goroutines at a fine sampling rate. Therefore, the memory usage of the profiling depends mostly on the sampling rate (that has no documented configuration option), the number of Goroutines, and the variability of the Goroutine Stacks. With the two action points taken, we will reduce Goroutines and memory usage, and close this investigational issue. |
Sentry Issue: POSEIDON-4H
Last Memory Issue: #453
The text was updated successfully, but these errors were encountered: