Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

memory leak? #77

Closed
bbigras opened this issue May 22, 2017 · 21 comments
Closed

memory leak? #77

bbigras opened this issue May 22, 2017 · 21 comments

Comments

@bbigras
Copy link
Contributor

bbigras commented May 22, 2017

I use the latest version and it's using more ram over time. Once it was at about 500 megabytes. I'm not sure if it's a leak. Maybe the memory is reserved but not really used.

Not sure which go_memstats_ I could check for that.

@martinlindhe
Copy link
Collaborator

martinlindhe commented May 24, 2017

Hello @bbigras and thanks for reporting!
I'll try to reproduce this and look more into it.

Which collectors did you enable, the default ones? And which version of Windows?

@bbigras
Copy link
Contributor Author

bbigras commented May 24, 2017

Only the default ones I think. I just installed the msi. Windows Server 2016 Standard (64 bits).

image

@carlpett
Copy link
Collaborator

I've spent a while looking at this today. Easily reproduced with some load generator tool to do lots of scrapes, and our production installations have grown to a couple of hundred megabytes as well.
However, it is less easy to pinpoint what is actually leaking. I did a fair amount of experimenting with pprof, but couldn't make any particular sense of the data I got back.

I wrote a little test harness to try to isolate where the memory is spent, here is a gist.
No real conclusive results so far, though. Running a single collector, or even all of them, seems to stabilize at a handful of megabytes after a around 1500 calls. The largest test includes the promhttp handler in the call-chain, and seems to stabilize around 15 MB.

There is a bug reported on the WMI library we use, but it has been closed for several years.

Short on ideas to investigate now.

@brian-brazil
Copy link

Have you tried pprof? https://www.robustperception.io/analysing-prometheus-memory-usage/ are the Unix instructions, may require adding the import to the code.

@carlpett
Copy link
Collaborator

@brian-brazil Yep, I remembered reading your blog post a few weeks back so I re-read it (thanks!), and a few others, this morning. Didn't really get any results that made sense to me, though. I can see lots of allocations within client_golang, expfmt and net/http, but those are cleaned up nicely after GC.

@brian-brazil
Copy link

If you can share the svg I can take a quick look, preferably from a production instance that's gotten big.

@carlpett
Copy link
Collaborator

Exposing my lack of pprof competence: you mean there is a way to get a dump without recompiling with the pprof endpoint? Otherwise I only have some profiles the I have generated locally, I'm afraid.

It seems to take a couple of weeks/months for the leak to go into the hundreds of megabytes with "normal scraping".

@carlpett
Copy link
Collaborator

I think there may be a native memory leak. I added output from Windows measure of used memory, and while Go's Memstats are more or less constant after a while, the OS' view of the world is different. Here's some output from the testing:

Time:    0s  Count:     0  Resident size:  6.0MB  MemStats.Sys:  8.0MB  MemStats.Alloc:  2.3MB  MemStats.TotalAlloc:   3.5MB
Time:   51s  Count:   200  Resident size:  8.8MB  MemStats.Sys:  9.1MB  MemStats.Alloc:  3.1MB  MemStats.TotalAlloc: 474.1MB
Time:  103s  Count:   400  Resident size: 11.2MB  MemStats.Sys:  9.6MB  MemStats.Alloc:  3.1MB  MemStats.TotalAlloc: 944.7MB
Time:  154s  Count:   600  Resident size: 11.3MB  MemStats.Sys: 10.6MB  MemStats.Alloc:  3.5MB  MemStats.TotalAlloc: 1413.8MB
Time:  205s  Count:   800  Resident size: 11.3MB  MemStats.Sys: 10.6MB  MemStats.Alloc:  3.1MB  MemStats.TotalAlloc: 1882.0MB
Time:  256s  Count:  1000  Resident size: 11.5MB  MemStats.Sys: 10.6MB  MemStats.Alloc:  3.1MB  MemStats.TotalAlloc: 2348.2MB
Time:  308s  Count:  1200  Resident size: 14.3MB  MemStats.Sys: 12.1MB  MemStats.Alloc:  3.1MB  MemStats.TotalAlloc: 2814.3MB
Time:  358s  Count:  1400  Resident size: 14.5MB  MemStats.Sys: 12.1MB  MemStats.Alloc:  3.2MB  MemStats.TotalAlloc: 3280.3MB
Time:  409s  Count:  1600  Resident size: 14.6MB  MemStats.Sys: 12.1MB  MemStats.Alloc:  3.2MB  MemStats.TotalAlloc: 3747.8MB
Time:  461s  Count:  1800  Resident size: 14.5MB  MemStats.Sys: 13.1MB  MemStats.Alloc:  3.5MB  MemStats.TotalAlloc: 4211.4MB
Time:  517s  Count:  2000  Resident size: 14.3MB  MemStats.Sys: 13.1MB  MemStats.Alloc:  3.7MB  MemStats.TotalAlloc: 4678.4MB
Time:  568s  Count:  2200  Resident size: 15.1MB  MemStats.Sys: 13.1MB  MemStats.Alloc:  3.1MB  MemStats.TotalAlloc: 5141.5MB
Time:  621s  Count:  2400  Resident size: 14.9MB  MemStats.Sys: 13.1MB  MemStats.Alloc:  3.6MB  MemStats.TotalAlloc: 5606.7MB
Time:  673s  Count:  2600  Resident size: 15.0MB  MemStats.Sys: 13.1MB  MemStats.Alloc:  3.2MB  MemStats.TotalAlloc: 6074.9MB
Time:  725s  Count:  2800  Resident size: 15.1MB  MemStats.Sys: 13.1MB  MemStats.Alloc:  3.6MB  MemStats.TotalAlloc: 6544.2MB
Time:  777s  Count:  3000  Resident size: 15.0MB  MemStats.Sys: 13.1MB  MemStats.Alloc:  3.5MB  MemStats.TotalAlloc: 7012.3MB
Time:  828s  Count:  3200  Resident size: 15.3MB  MemStats.Sys: 13.1MB  MemStats.Alloc:  3.5MB  MemStats.TotalAlloc: 7480.7MB
Time:  879s  Count:  3400  Resident size: 15.4MB  MemStats.Sys: 13.1MB  MemStats.Alloc:  3.2MB  MemStats.TotalAlloc: 7947.2MB
Time:  932s  Count:  3600  Resident size: 15.2MB  MemStats.Sys: 13.1MB  MemStats.Alloc:  3.6MB  MemStats.TotalAlloc: 8415.2MB
Time:  984s  Count:  3800  Resident size: 15.3MB  MemStats.Sys: 13.4MB  MemStats.Alloc:  3.6MB  MemStats.TotalAlloc: 8884.5MB
Time: 1038s  Count:  4000  Resident size: 15.5MB  MemStats.Sys: 13.4MB  MemStats.Alloc:  3.5MB  MemStats.TotalAlloc: 9354.6MB
Time: 1091s  Count:  4200  Resident size: 15.5MB  MemStats.Sys: 13.4MB  MemStats.Alloc:  3.6MB  MemStats.TotalAlloc: 9823.1MB
Time: 1144s  Count:  4400  Resident size: 15.6MB  MemStats.Sys: 13.4MB  MemStats.Alloc:  3.6MB  MemStats.TotalAlloc: 10291.6MB

(That's quite a bit of memory churn... 10 GB for 4400 scrapes)

@brian-brazil
Copy link

Exposing my lack of pprof competence: you mean there is a way to get a dump without recompiling with the pprof endpoint?

You need to recompile.

@carlpett
Copy link
Collaborator

Ok, as I thought then. Since it takes quite a while for it to amass, I'll try to run a couple hundred thousand requests on my local machine then and get the dump when it is done. Should hopefully be equivalent, but without the month of waiting.

@carlpett
Copy link
Collaborator

@brian-brazil While we're waiting (21k reqs done...), what output do you want? inuse_space?

@brian-brazil
Copy link

The default inuse_space

@carlpett
Copy link
Collaborator

After ~68k requests, the RSS is at ~32 MB. This would correspond to about 9.5 hours of scrapes in my production environment (30s interval), so still quite a way to simulate weeks or months...

Turns out I don't have graphviz on this machine, but I've dumped the .dot output here. http://www.webgraphviz.com/ works decently as a web renderer.

@brian-brazil
Copy link

Nothing stands out to me there. Maybe run it a bit longer and compare the two?

@carlpett
Copy link
Collaborator

I've deployed the build with pprof to one of our instances, so let's have a look at what it looks like after the weekend.

@carlpett
Copy link
Collaborator

@brian-brazil See attached file. The output is even less understandable to me in this run. The last dump was taken a few moments ago, and the resident size is now reported as 96 MB by Windows.

pprofs.zip

@brian-brazil
Copy link

That needs the binary to be useful. Can you load them up and do a "top10" in the CLI for each?

@carlpett
Copy link
Collaborator

carlpett commented May 30, 2017

20170527-2135.pprof
1542.05kB of 1542.05kB total (  100%)
Dropped 195 nodes (cum <= 7.71kB)
Showing top 10 nodes out of 19 (cum >= 518.02kB)
      flat  flat%   sum%        cum   cum%
  518.02kB 33.59% 33.59%   518.02kB 33.59%  github.com/martinlindhe/wmi_exporter/vendor/github.com/beorn7/perks/quantile.NewTargeted
  512.02kB 33.20% 66.80%   512.02kB 33.20%  github.com/martinlindhe/wmi_exporter/collector.(*serviceCollector).collect
  512.02kB 33.20%   100%   512.02kB 33.20%  vendor/golang_org/x/net/http2/hpack.addDecoderNode
         0     0%   100%   512.02kB 33.20%  github.com/martinlindhe/wmi_exporter/collector.(*serviceCollector).Collect
         0     0%   100%   518.02kB 33.59%  github.com/martinlindhe/wmi_exporter/vendor/github.com/prometheus/client_golang/prometheus.(*MetricVec).GetMetricWithLabelValues
         0     0%   100%   518.02kB 33.59%  github.com/martinlindhe/wmi_exporter/vendor/github.com/prometheus/client_golang/prometheus.(*MetricVec).WithLabelValues
         0     0%   100%   518.02kB 33.59%  github.com/martinlindhe/wmi_exporter/vendor/github.com/prometheus/client_golang/prometheus.(*MetricVec).getOrCreateMetricWithLabelValues
         0     0%   100%   518.02kB 33.59%  github.com/martinlindhe/wmi_exporter/vendor/github.com/prometheus/client_golang/prometheus.(*SummaryVec).WithLabelValues
         0     0%   100%   518.02kB 33.59%  github.com/martinlindhe/wmi_exporter/vendor/github.com/prometheus/client_golang/prometheus.(*summary).newStream
         0     0%   100%   518.02kB 33.59%  github.com/martinlindhe/wmi_exporter/vendor/github.com/prometheus/client_golang/prometheus.NewSummaryVec.func1

20170528-1516.pprof
1542.08kB of 1542.08kB total (  100%)
Dropped 215 nodes (cum <= 7.71kB)
Showing top 10 nodes out of 25 (cum >= 518.02kB)
      flat  flat%   sum%        cum   cum%
  518.02kB 33.59% 33.59%   518.02kB 33.59%  github.com/martinlindhe/wmi_exporter/vendor/github.com/beorn7/perks/quantile.NewTargeted
  512.04kB 33.20% 66.80%   512.04kB 33.20%  runtime.acquireSudog
  512.02kB 33.20%   100%   512.02kB 33.20%  vendor/golang_org/x/net/http2/hpack.addDecoderNode
         0     0%   100%   512.04kB 33.20%  github.com/martinlindhe/wmi_exporter/collector.(*CSCollector).Collect
         0     0%   100%   512.04kB 33.20%  github.com/martinlindhe/wmi_exporter/collector.(*CSCollector).collect
         0     0%   100%   512.04kB 33.20%  github.com/martinlindhe/wmi_exporter/vendor/github.com/StackExchange/wmi.(*Client).Query
         0     0%   100%   512.04kB 33.20%  github.com/martinlindhe/wmi_exporter/vendor/github.com/StackExchange/wmi.Query
         0     0%   100%   518.02kB 33.59%  github.com/martinlindhe/wmi_exporter/vendor/github.com/prometheus/client_golang/prometheus.(*MetricVec).GetMetricWithLabelValues
         0     0%   100%   518.02kB 33.59%  github.com/martinlindhe/wmi_exporter/vendor/github.com/prometheus/client_golang/prometheus.(*MetricVec).WithLabelValues
         0     0%   100%   518.02kB 33.59%  github.com/martinlindhe/wmi_exporter/vendor/github.com/prometheus/client_golang/prometheus.(*MetricVec).getOrCreateMetricWithLabelValues

20170529-0810.pprof
1545.36kB of 1545.36kB total (  100%)
Dropped 225 nodes (cum <= 7.73kB)
Showing top 10 nodes out of 32 (cum >= 518.02kB)
      flat  flat%   sum%        cum   cum%
  518.02kB 33.52% 33.52%   518.02kB 33.52%  github.com/martinlindhe/wmi_exporter/vendor/github.com/beorn7/perks/quantile.NewTargeted
  515.32kB 33.35% 66.87%   515.32kB 33.35%  bytes.makeSlice
  512.02kB 33.13%   100%   512.02kB 33.13%  vendor/golang_org/x/net/http2/hpack.addDecoderNode
         0     0%   100%   515.32kB 33.35%  bytes.(*Buffer).Write
         0     0%   100%   515.32kB 33.35%  bytes.(*Buffer).grow
         0     0%   100%   515.32kB 33.35%  fmt.Fprintf
         0     0%   100%   518.02kB 33.52%  github.com/martinlindhe/wmi_exporter/vendor/github.com/prometheus/client_golang/prometheus.(*MetricVec).GetMetricWithLabelValues
         0     0%   100%   518.02kB 33.52%  github.com/martinlindhe/wmi_exporter/vendor/github.com/prometheus/client_golang/prometheus.(*MetricVec).WithLabelValues
         0     0%   100%   518.02kB 33.52%  github.com/martinlindhe/wmi_exporter/vendor/github.com/prometheus/client_golang/prometheus.(*MetricVec).getOrCreateMetricWithLabelValues
         0     0%   100%   518.02kB 33.52%  github.com/martinlindhe/wmi_exporter/vendor/github.com/prometheus/client_golang/prometheus.(*SummaryVec).WithLabelValues

20170530-0852.pprof
3980.69kB of 3980.69kB total (  100%)
Dropped 239 nodes (cum <= 19.90kB)
Showing top 10 nodes out of 33 (cum >= 518.02kB)
      flat  flat%   sum%        cum   cum%
 1536.02kB 38.59% 38.59%  1536.02kB 38.59%  github.com/martinlindhe/wmi_exporter/vendor/github.com/prometheus/client_golang/prometheus.makeLabelPairs
  902.59kB 22.67% 61.26%   902.59kB 22.67%  compress/flate.NewWriter
  518.02kB 13.01% 74.27%   518.02kB 13.01%  github.com/martinlindhe/wmi_exporter/vendor/github.com/beorn7/perks/quantile.NewTargeted
  512.05kB 12.86% 87.14%   512.05kB 12.86%  github.com/martinlindhe/wmi_exporter/vendor/github.com/prometheus/client_golang/prometheus.(*Registry).Gather
  512.02kB 12.86%   100%   512.02kB 12.86%  vendor/golang_org/x/net/http2/hpack.addDecoderNode
         0     0%   100%   902.59kB 22.67%  compress/gzip.(*Writer).Write
         0     0%   100%  1536.02kB 38.59%  github.com/martinlindhe/wmi_exporter/collector.(*serviceCollector).Collect
         0     0%   100%  1536.02kB 38.59%  github.com/martinlindhe/wmi_exporter/collector.(*serviceCollector).collect
         0     0%   100%   902.59kB 22.67%  github.com/martinlindhe/wmi_exporter/vendor/github.com/matttproud/golang_protobuf_extensions/pbutil.WriteDelimited
         0     0%   100%   518.02kB 13.01%  github.com/martinlindhe/wmi_exporter/vendor/github.com/prometheus/client_golang/prometheus.(*MetricVec).GetMetricWithLabelValues

This seems strange, no? (EDIT: Removed an extra negation. It does seem strange, not not strange. Go sees 4 MB, the OS 100 MB)

@brian-brazil
Copy link

That looks steady, the last one caught a scrape I'd guess. Smells like a leak outside Go.

@carlpett
Copy link
Collaborator

As an update, I've just deployed a version on a few of my machines with the fix suggested in the above referenced issue on StackExchange/wmi. I'll let it run over the weekend, and if it looks good, I'll push my branch.

@carlpett
Copy link
Collaborator

So, I restarted the service on another node at the same time to get a comparison. The patched version is now at ~36 MB, while the unpatched one is at ~359 MB. So I'd say it is working :)

Will submit a PR later today.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants