-
Notifications
You must be signed in to change notification settings - Fork 712
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
memory leak? #77
Comments
Hello @bbigras and thanks for reporting! Which collectors did you enable, the default ones? And which version of Windows? |
I've spent a while looking at this today. Easily reproduced with some load generator tool to do lots of scrapes, and our production installations have grown to a couple of hundred megabytes as well. I wrote a little test harness to try to isolate where the memory is spent, here is a gist. There is a bug reported on the WMI library we use, but it has been closed for several years. Short on ideas to investigate now. |
Have you tried pprof? https://www.robustperception.io/analysing-prometheus-memory-usage/ are the Unix instructions, may require adding the import to the code. |
@brian-brazil Yep, I remembered reading your blog post a few weeks back so I re-read it (thanks!), and a few others, this morning. Didn't really get any results that made sense to me, though. I can see lots of allocations within client_golang, expfmt and net/http, but those are cleaned up nicely after GC. |
If you can share the svg I can take a quick look, preferably from a production instance that's gotten big. |
Exposing my lack of pprof competence: you mean there is a way to get a dump without recompiling with the pprof endpoint? Otherwise I only have some profiles the I have generated locally, I'm afraid. It seems to take a couple of weeks/months for the leak to go into the hundreds of megabytes with "normal scraping". |
I think there may be a native memory leak. I added output from Windows measure of used memory, and while Go's Memstats are more or less constant after a while, the OS' view of the world is different. Here's some output from the testing:
(That's quite a bit of memory churn... 10 GB for 4400 scrapes) |
You need to recompile. |
Ok, as I thought then. Since it takes quite a while for it to amass, I'll try to run a couple hundred thousand requests on my local machine then and get the dump when it is done. Should hopefully be equivalent, but without the month of waiting. |
@brian-brazil While we're waiting (21k reqs done...), what output do you want? |
The default inuse_space |
After ~68k requests, the RSS is at ~32 MB. This would correspond to about 9.5 hours of scrapes in my production environment (30s interval), so still quite a way to simulate weeks or months... Turns out I don't have graphviz on this machine, but I've dumped the .dot output here. http://www.webgraphviz.com/ works decently as a web renderer. |
Nothing stands out to me there. Maybe run it a bit longer and compare the two? |
I've deployed the build with pprof to one of our instances, so let's have a look at what it looks like after the weekend. |
@brian-brazil See attached file. The output is even less understandable to me in this run. The last dump was taken a few moments ago, and the resident size is now reported as 96 MB by Windows. |
That needs the binary to be useful. Can you load them up and do a "top10" in the CLI for each? |
This seems strange, no? (EDIT: Removed an extra negation. It does seem strange, not not strange. Go sees 4 MB, the OS 100 MB) |
That looks steady, the last one caught a scrape I'd guess. Smells like a leak outside Go. |
As an update, I've just deployed a version on a few of my machines with the fix suggested in the above referenced issue on StackExchange/wmi. I'll let it run over the weekend, and if it looks good, I'll push my branch. |
So, I restarted the service on another node at the same time to get a comparison. The patched version is now at ~36 MB, while the unpatched one is at ~359 MB. So I'd say it is working :) Will submit a PR later today. |
I use the latest version and it's using more ram over time. Once it was at about 500 megabytes. I'm not sure if it's a leak. Maybe the memory is reserved but not really used.
Not sure which go_memstats_ I could check for that.
The text was updated successfully, but these errors were encountered: