Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

crash when deleting session #205

Closed
ntadas opened this issue Jun 16, 2016 · 17 comments
Closed

crash when deleting session #205

ntadas opened this issue Jun 16, 2016 · 17 comments

Comments

@ntadas
Copy link
Contributor

ntadas commented Jun 16, 2016

Hi,

I have sometimes a crash, when some network problems occur and I have several sessions connected and monitored at the same time.
this is not so easy to reproduce, so also not easy to analyse. Here is what I already found until now and my suspicions:

  • I know that the crash is exactly in API void nc_session_free(struct nc_session* session) line 1418
  • I know when the crash happens line 1397 always returns false, so nc_session_monitor_remove is not called and session->stats is not set to null

with this evidences it seems that the lib detects the failure before my server and deletes the session.
then I try to delete the same session and I get this crash (at least this is my theory for now).

In the beginning of nc_session_free we can see the check if (session == NULL) but session is never set to NULL inside the lib (or at least I didn't find where).

The struct nc_session* that I give to the nc_session_free is obtained when the session is establish with the method nc_session_accept_inout

Since the session is never set to NULL inside the lib, should I check some flag inside nc_session struct before I call the nc_session_free?

thanks
Best Regards

@rkrejci
Copy link
Contributor

rkrejci commented Jun 16, 2016

Are you on current master (59f5ba5)? The mentioned lines do not match what you describe (e.g. 1418 is comment). Ideally, try to test it with deadlockfix branch which changes nc_session_free() function.

The session == NULL check is just a parameter check to avoid referencing NULL pointer. The session is freed at the end of the function, but since the app holds the variable, it is up to it to set NULL to the variable (or do not use the variable after calling nc_session_free()). Internal libnetconf functions actually never call nc_session_free() on the sessions avavilable to the application. In case of error, the session is invalidated by nc_session_close(), but the structure is still available to the application which is supposed to call nc_session_free() to finnish the cleanup. Valgrind's stack at the moment of crash would be very helpful, but I understand that it is challenging to reproduce the issue.

@ntadas
Copy link
Contributor Author

ntadas commented Jun 16, 2016

yes I'm in the master branch, sorry I've added some debug logs and forgot about them when writting the issue.
session.c line 1418 is in fact 1423 :
free(session->stats);

valgrind is not possible right now because I'm only able to reproduce this in target (ppc) until now I was not able to reproduce it in x86.

regarding setting the variable to NULL sure, I'm doing it.
but how can I check if I can call the nc_session_free method?
is it possible that the lib is closing/deleting the session and I'm trying to delete a session that doesn't exist? from the debug I've done, when this happens this if:
if (strcmp(litem->session_id, session->session_id) == 0)
is never true, so when this one is:
if (litem->offset_next == 0)
we get the crash here:
free(session->stats);

@rkrejci
Copy link
Contributor

rkrejci commented Jun 16, 2016

Actually you can call nc_session_free() anytime you want. When an error is detected by some internal function, the session's transport session and some other stuff is destroyed and session's status is set to NC_SESSION_STATUS_CLOSED (you can check this via nc_session_get_status()). From this time, libnetconf will not able to use the session for anything meaningfull (getter functions should work, but you won't be able to send anything via the session, it is closed).

I have one idea about the issue, just please answer me the following questions:

  1. how do you call nc_init()? I mean, what flags do you pass to the function
  2. that PPC target, what OS is there? Is there /proc/ filesystem?

@ntadas
Copy link
Contributor Author

ntadas commented Jun 16, 2016

1- I call nc_init with (NC_INIT_ALL & ~NC_INIT_NACM) | NC_INIT_MULTILAYER
2- Windriver Linux 32 bits, yes we have a /proc/ filesystem

@rkrejci
Copy link
Contributor

rkrejci commented Jun 16, 2016

hmm, then it is weird that the session is not found in the session_list (session.c:1399-1402). There are only 2 situations when the session is removed from the list:

  1. nc_session_free(), but that can be called just once on the session
  2. nc_session_alive_monitor(), but it should happen only if the process accepted the session is not alive (checked by PID and records in /proc/ - session.c:497)

Calling free(session->stats) when all items in session_list were checked and the session wasn't found there should be correct. Without session monitoring, the session structure has some allocated memory in stats, so it is supposed to be free. But in your case you shouldn't get there. According to nc_init(), the session monitoring is enabled, so the stats points to the shared memory block and the session should be always in the session_list - as I mentioned, the only other way to remove the session from the list is that the process that accepted the session is no more running and it was detected by nc_session_alive_monitor() (which runs whenever you get status data by <get> operation)..

@ntadas
Copy link
Contributor Author

ntadas commented Jun 16, 2016

or could it be that during the delete/create of other sessions the prev and next offset are not correctly linked so the session that I'm trying to delete is not found while iterating in the list but in fact its there (just not linked to the others)?

@rkrejci
Copy link
Contributor

rkrejci commented Jun 16, 2016

yes, bug in session_list manipulation would explain that

@ntadas
Copy link
Contributor Author

ntadas commented Jun 16, 2016

after adding some logs I found something strange.
it seems that this some session are being closed by nc_session_monitor_alive_check
but I really don't understand why, since the process that launches all the sessions is the same.

@rkrejci
Copy link
Contributor

rkrejci commented Jun 16, 2016

Do you know which part of the nc_session_monitor_alive_check() causes it? I mean, which of those 3 nc_session_monitor_remove() calls is used.

@ntadas
Copy link
Contributor Author

ntadas commented Jun 16, 2016

unfortunately not, I add prints in nc_session_monitor_remove and in nc_session_free only.

I notice that nc_session_monitor_remove is called not from the nc_session_free and when session free is being called (afterwards) I get negative session_list->count

also this is only happening in ppc.

@ntadas
Copy link
Contributor Author

ntadas commented Jun 16, 2016

its the last one (pfd == NULL)

this happens everytime I do a get all on target (ppc)
and the number of sessions is decreased.

@rkrejci
Copy link
Contributor

rkrejci commented Jun 16, 2016

can you provide the output of the following command on the target system? The PID is PID number of your server process accepting the NETCONF sessions.

sudo ls -l /proc/PID/fd

There should be a symlink to libnetconf_sessions.bin file which should correspond to the aux string in the nc_session_monitor_alive_check() function.

@rkrejci
Copy link
Contributor

rkrejci commented Jun 16, 2016

If not, then, during the libnetconf init, there should be some error message from libnetconf about "sessions monitoring file" or the "session list".

@ntadas
Copy link
Contributor Author

ntadas commented Jun 16, 2016

I think I got the problem:

lrwx------ 1 root root 64 May 3 00:41 20 -> /var/volatile/tmp/netconf/libnetconf_sessions.bin
lrwx------ 1 root root 64 May 3 00:41 21 -> /var/volatile/tmp/netconf/libnetconf_sessions.bin

but the files aux is /tmp/netconf/libnetconf_sessions.bin
and the file exists in /tmp/netconf/libnetconf_sessions.bin (no error in init)

if I do a "readlink -f /tmp/mf_communication_log.log"
returns: /var/volatile/tmp/mf_communication_log.log

/tmp in my system is a link to volatile/tmp

I think in the lib code you should have a readlink of the aux to get the canonical file name

@ntadas
Copy link
Contributor Author

ntadas commented Jun 16, 2016

or I can change it to /var/volatile/tmp/netconf/libnetconf_sessions.bin

@rkrejci
Copy link
Contributor

rkrejci commented Jun 16, 2016

I'm going to fix it

@ntadas
Copy link
Contributor Author

ntadas commented Jun 16, 2016

thanks a lot :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants