-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Take a XPRT ref when hooking events #204
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will test it out and let you know.
The fix helped to avoid the crash (avoiding use after free of xprt). In /var/log/messages: Following CRIT logs noticed: [mynode] # lsof -p $(pidof gpfs.ganesha.nfsd) | wc -l |
3f4d585
to
80572c0
Compare
@malahal Okay, this one seems to work properly. There's still a leak, but I think it's in Ganesha (I think the backchannel clients have a refcount bug). There's an associated Ganesha patch that needs to be merged along with this, or the listening sockets also leak: |
I read all documentation I could regarding epoll and threads! Looks like the API itself is broken. The best way to guaranty REF over epoll is taking REF while hooking and re-arming. If we only take REF while hooking and RELEASE while unhooking, unfortunately events CAN happen after unhook. There is no way to prevent execution of event code after unhook. This could lead to accessing freed memory! The issue with REF/RELEASE while hooking/re-arming and event handing is that the "refcount" count can go to zero only after handing an "event". This seems to be the issue with leaking sockets with rpc.statd. A work around here could be to have an extra refcount in the back channel to avoid creating too many sockets. |
Workaround for rpc.statd did work, but it leaks connections to portmapper and other sockets with remote clients with NLM. Maybe, unhook should have some fuzzy logic to wait for any additional events that could happen. Maybe, a sleep of one second (or socket pair based trick) may work. |
Hmm... This is an interesting problem. Let me think about it for a bit. |
Okay, this should close that hole. It works for me, and tests correctly (modulo the client leak mentioned above, which I need to look at next). |
Found 2/3 issues with this code: https://github.com/malahal/ntirpc/commits/hook has fixes for #1, #2 and #3 (based on 2.7 code though). The top commit is just a clean up patch, but the following 3 commits should fix the issues reported here. |
I have 1 and 2 in my branch now. I'm not seeing how 3 can matter, since the hash table has a ref, and releases it at the end of DESTROY. If the RELEASE is freeing the xprt before the DESTROY, that's a bug. |
The 3rd one is about not having SVC_XPRT_FLAG_CLOSE flag. The xprt memory gets freed but the fd would be still open without this flag, see "svc_vc_destroy_task()" for details. If you are talking about my TOP commit that has re-order of destroy and release, then that is an issue if some other threads can call DESTROY on the xprt. I don't know if that can happen though. the idea seems to be that multiple threads could call SVC_DESTROY(). |
Okay, so that branch doesn't have a fix for #3? I'm only seeing 3 commits there. |
There is a race when unhooking events from epoll, where the event could be ready for delivery (or even delivered, but the thread not scheduled) and so the event is processed after the unhook, and therefore after the XPRT has been freed. To close this, stop putting a pointer to the rec in the event data, and instead put the FD in there and use it to look up the XPRT. This ensures that, if we got the XPRT from lookup, it's valid and ref'd for the duration of the event. Once we're no longer storing a XPRT pointer in the epoll event, we don't need a refcount across the hook/event/unhook series. Remove these refcounts, allowing a destroyed XPRT to just be freed. Signed-off-by: Daniel Gryniewicz <[email protected]>
Sorry, I must have rebased the branch. You suggested to fix the #3 in ganesha repo to avoid API change. The patch here in out ibm2.7 branch (ganesha repo though): commit 5b67cd1
|
Okay, that looks good. I'll merge this, then. |
Thank you. |
The epoll loop needs a ref on the XPRT, so that it stays alive while in
epoll(). svc_rqst_rearm_events_locked() takes a ref, but the initial
svc_rqst_hook_events() does not. This means that, if the XPRT times
out, the cleanup will drop a ref, and the epoll loop will drop a ref,
causing one to many refs to be dropped. This leaves the client
potentially with a dangling ref.
Take the ref for the epoll loop when initially hooking the events.
Signed-off-by: Daniel Gryniewicz [email protected]