Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bandwhich appears to have a serious memory leak #284

Closed
popey opened this issue Sep 9, 2023 · 39 comments
Closed

Bandwhich appears to have a serious memory leak #284

popey opened this issue Sep 9, 2023 · 39 comments
Labels
bug Something isn't working

Comments

@popey
Copy link

popey commented Sep 9, 2023

I built and ran bandwhich on a few machines. After a while of running, it dies. I see this in my logs

[Sat Sep  9 14:11:34 2023] [ 910813]     0 910813 25425140  6316768 76255232  3166464           200 bandwhich
[Sat Sep  9 14:11:34 2023] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=zerotier-one.service,mems_allowed=0,global_oom,task_memcg=/system.slice/snap.bandwhich.bandwhich-92de3b50-fb9a-430c-9645-eac2cde77fe0.scope,task=bandwhich,pid=910813,uid=0
[Sat Sep  9 14:11:34 2023] Out of memory: Killed process 910813 (bandwhich) total-vm:101700560kB, anon-rss:25267072kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:74468kB oom_score_adj:200
[Sat Sep  9 14:11:37 2023] oom_reaper: reaped process 910813 (bandwhich), now anon-rss:88kB, file-rss:0kB, shmem-rss:0kB

While I've seen this on remote, small servers. I have also seen it on my workstation which has 32GB RAM and 22GB swap.

@popey
Copy link
Author

popey commented Sep 9, 2023

image

Caught a picture from bottom.

Note the purple line shoots up (that's bandwhich eating RAM) then the yellow line climbs as it eats all the swap, then gets killed by the OOM killer...

@cyqsimon
Copy link
Collaborator

cyqsimon commented Sep 11, 2023

Is this problem happening on main?

Bandwhich itself does not use any unsafe code, so it's likely that the memory leak was caused by a dependency. We had lots of dependency bumps since the last release, so this may have been already solved if indeed that's the version you are seeing problems on.

@popey
Copy link
Author

popey commented Sep 11, 2023

I was using whatever was in main when I filed the issue, yes. Happy to test again after the changes that landed today, but that all looks like CD pipeline stuff?

I found it quite easily reproducible on multiple machines for me. Just leave it running for a while and eventually, it blows up. Might take many minutes or hours, depending on traffic volume, maybe.

I built it with rust 1.70.0 if that makes any difference?

@cyqsimon
Copy link
Collaborator

but that all looks like CD pipeline stuff?

Yes that's true. You don't need to test again =).

I found it quite easily reproducible on multiple machines for me. Just leave it running for a while and eventually, it blows up. Might take many minutes or hours, depending on traffic volume, maybe.

Curious. I'm no expert in debugging memory issues, but I'll give it a try.

I built it with rust 1.70.0 if that makes any difference?

I doubt it.

@cyqsimon cyqsimon added bug Something isn't working help wanted Extra attention is needed labels Sep 11, 2023
@sigmaSd
Copy link
Collaborator

sigmaSd commented Sep 15, 2023

I can't reproduce on my linux machine, can you run it in debug mode with flamegraph https://github.com/flamegraph-rs/flamegraph it can give some hints

@cyqsimon
Copy link
Collaborator

Cannot reproduce on my box (Linux 6.5.3-arch1-1 x86_64) either. I kept bandwhich running for one hour while I worked, and the memory usage stayed constant around 9.7M.

@cyqsimon cyqsimon added needs reproduction and removed help wanted Extra attention is needed labels Sep 20, 2023
@cyqsimon
Copy link
Collaborator

Out of curiosity, are all of your testing done on the same distro? @popey

As unlikely as it is, maybe there's something wrong with the allocator or even the kernel?

@gms8994
Copy link

gms8994 commented Sep 21, 2023

I tested this on a box specifically trying to reproduce this and could: memory allocation of 6442450944 bytes failed.

Sep 21 06:00:39 kernel: [31561.482912] [   1291]  1000  1291   907564   394590  3289088        0             0 bandwhich
Sep 21 06:00:39 kernel: [31561.483047] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/user.slice/user-1000.slice/session-1.scope,task=bandwhich,pid=1291,uid=1000
Sep 21 06:00:39 kernel: [31561.483086] Out of memory: Killed process 1291 (bandwhich) total-vm:3630256kB, anon-rss:1578360kB, file-rss:0kB, shmem-rss:0kB, UID:1000 pgtables:3212kB oom_score_adj:0

This is on Ubuntu 22.04.3 LTS with v0.21.0 from the releases tab. I started at around 21:15 on Sep 20, so it ran for ~9 hours before being killed. I'll try and get a flamegraph version today.

@terminaldweller
Copy link

same issue here with v0.21.0 on void.
currently can't figure out how to run flamegraph due to this:
flamegraph-rs/flamegraph#181
I can post a flamegraph when i manage to get one.

@terminaldweller
Copy link

here's a flamegraph for when bandwhich ends up taking up too much memory:

flamegraph

@sigmaSd
Copy link
Collaborator

sigmaSd commented Oct 7, 2023

I can't seem to zoom in the indivual svg elements, anyhow the big view shows mostly read/iterate/openat, so maybe there is leak when reading procfs, maybe it needs a big number of connections

@sigmaSd
Copy link
Collaborator

sigmaSd commented Oct 7, 2023

I tried https://github.com/KDE/heaptrack and it seems very simple and work really well, shows graph and exact functions and which one leaked

Note: heaptrak bandhwich didn't work for me, I had to start bandwhich, then use heaptrack -p pid of bandhwich

@sigmaSd
Copy link
Collaborator

sigmaSd commented Oct 7, 2023

It shows 0 leaks for me

@sigmaSd
Copy link
Collaborator

sigmaSd commented Oct 7, 2023

i kept the program running more, I do have some small leaks

heaptrack is making me suspect https://github.com/imsnif/bandwhich/blob/main/src/display/ui.rs#L184 seems like we keep updating the state which will keep updating the utilization map https://github.com/imsnif/bandwhich/blob/main/src/network/utilization.rs#L27 without ever removing from it

Maybe when there is a big number of connection this causes the leak to show

@sigmaSd
Copy link
Collaborator

sigmaSd commented Oct 7, 2023

ah I see to make the leaks show up you need to attach heaptrack to bandwhich, and then close heptrack not bandhwich

Because seems RAII is working correctly and everyhting will get cleaned up at exit, but if we disconnect early it shows us places that accumulate memory in the life time of the program like the state updating above

@sigmaSd
Copy link
Collaborator

sigmaSd commented Oct 7, 2023

We do have this function which is indeed called https://github.com/imsnif/bandwhich/blob/main/src/network/utilization.rs#L24

but its using HashMap::clear which does say in the doc that it clear the entries but keeps the memory allocated, maybe that's what we're looking for

@sigmaSd
Copy link
Collaborator

sigmaSd commented Oct 7, 2023

ah I see to make the leaks show up you need to attach heaptrack to bandwhich, and then close heptrack not bandhwich

note that this have a lot of false positive though

@cyqsimon
Copy link
Collaborator

cyqsimon commented Oct 8, 2023

So if I understand correctly, what you are saying @sigmaSd is that this is not so much a leak as it is simply keeping a value long past its usefulness?

@cyqsimon
Copy link
Collaborator

cyqsimon commented Oct 8, 2023

@popey @gms8994 @terminaldweller Can you please reproduce the severe memory usage issue again while tracing using heaptrack, and upload the output file? This will help us verify @sigmaSd's hypothesis.

sudo heaptrack bandwhich

Thanks in advance.

@sigmaSd
Copy link
Collaborator

sigmaSd commented Oct 8, 2023

So if I understand correctly, what you are saying @sigmaSd is that this is not so much a leak as it is simply keeping a value long past its usefulness?

Yes But its more like keeping the memory past its usefulness

Also maybe someone who reproduces the issue can try changing https://github.com/imsnif/bandwhich/blob/main/src/network/utilization.rs#L24 to self.connections = HashMap::new(); if thats indeed where the biggest issue is, it should be felt by forcing the hashmap to relase its memory

@sigmaSd
Copy link
Collaborator

sigmaSd commented Oct 8, 2023

Though this doesn't explain alone why it would fail to allocate memory

@popey
Copy link
Author

popey commented Oct 8, 2023

@cyqsimon sure thing. I've left bandwhich running under heaptrack on my desktop. It has 64GB RAM and this OOM usually happens unexpectedly after some hours. So will report back if/when it happens.

@sigmaSd
Copy link
Collaborator

sigmaSd commented Oct 8, 2023

You don't need to wait for oom, just stop heaptrack after you see an usually memory (like 100mb is probably enough)

@popey
Copy link
Author

popey commented Oct 8, 2023

Oh, too late. By the time I saw this reply it had already run for a while..
It was too big for github file upload, so I put it here: https://popey.com/~alan/bandwhich-heaptrack.zip

@cyqsimon
Copy link
Collaborator

cyqsimon commented Oct 8, 2023

Incorrect diagnosis. See my newer comment.

Good good. From the backtrace I was able to track the allocations back to a single line.

inode_to_procname.insert(inode, procname.clone());

It seems like your hypothesis is correct @imsnif, albeit of a different hashmap. I'll take a look at how this stored data is used and how to best implement some sort of cleanup.


Edit:

Perhaps I shouldn't rule out the possibility that it's caused by file handles not being closed. I'll make sure that isn't the case first.

@popey
Copy link
Author

popey commented Oct 11, 2023

Nice work everyone!

cyqsimon added a commit that referenced this issue Oct 16, 2023
@cyqsimon
Copy link
Collaborator

Good news, rustix has backported this fix to all minor versions up to 0.35, which is the first minor version affected. This means we do not block on an update from procfs (I have notified them nonetheless).

Since I am pretty confident that we've fixed the exact problem, I will close this issue as completed. Please reopen and/or comment in case you are still seeing it on main, thanks.

TheRealLorenz added a commit to TheRealLorenz/dotlink that referenced this issue Oct 20, 2023
Includes dependency update to fix security issue: imsnif/bandwhich#284 (comment)
TheRealLorenz added a commit to TheRealLorenz/dotlink that referenced this issue Oct 20, 2023
Includes dependency update to fix security issue: imsnif/bandwhich#284 (comment)
a-kenji added a commit to a-kenji/nix-uri that referenced this issue Oct 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

6 participants