-
Notifications
You must be signed in to change notification settings - Fork 69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PRRTE indirect-multi co-launch testcase fails with node connection error #978
Comments
Are you running in a managed environment? Or strictly using "hostfile"? You can always try putting |
I am using only a hostfile. There is no resource manager like LSF. I tried setting PRTE_MCA_prte_do_not_resolve=0 and PRTE_MCA_prte_do_not_resolve=1. In both cases I still had the same problem. |
What if you use |
I tried export PRTE_MCA_oob_if_include=10.128.7.0/16 since that is the IP address range for the f7n01 thru f7n04 hostnames that the hostname command returns. I also added the export to both my .profile and .bashrc since I don't know if it ends up set on the other nodes otherwise. I still got the same error. @jjhursey suggested I try export PRTE_MCA_prte_do_not_resolve=0 and export PRTE_MCA_prte_do_not_resolve=1, which I also set in .bashrc and .profile nd that did not fix the problem either. |
Try adding |
I tried running with f7n01 as my local/launch node and f7n02 as my remote node with 2 application tasks. I ran the test 15 times and it never failed. I ran with f7n01 as my local node and f7n02,f7n04 as my two remote nodes, where the hostfile specified those two nodes with 2 slots each. I still ran with only 2 application tasks and those tasks went to f7n02. I ran that test about 6 times and it only failed once. If I run with 3 remote nodes, 4 tasks per node, then it always fails. I looked at the debug output but don't understand what I am looking at. The log is attached. |
This output is with export PRTE_MCA_oob_if_include=10.128.7.0/16 |
Well, it all looks just fine. It appears to establish all connections and perform a number of operations. At the end, though, it looks like your compite node daemons believe they have been told to "shutdown" and do so, but the HNP (prterun in this case) still thinks it has a message to send to them. So the HNP tries to re-establish the connections and fails because nobody is listening any more. You might try replacing the |
I changed PRTE_MCA_oob_base_verbose to PRTE_MCA_state_verbose and it does appear that processes terminate normally, but then at the end of the log I see then it looks like prterun tries to activate the PRTE_JOB_STATE_TERMINATED state at pmix_server_gen.c::418 and then fails with a lost connection at oob_tcp_component.c:1016 Maybe I'm reading the log wrong, but it also looks like prterun is trying to activate the PRTE_JOB_TERMINATED state twice for the same process, lines 293 and 304 of the attached log. While this is going on, the indirect-multi code is waiting for the lost connection event which is apparently received after the error messages are issued. |
It appears to be "ringing" - I see multiple (more than two) activations of "job_terminated" for the debugger daemon job. The application job itself looks fine. Not sure I understand how anything we changed last week would affect this, but you are welcome to back down the commit tree to see if/where it breaks. There weren't that many of them - you might want to start by verifying that this really was working at the last point you thought it was, just to be sure. |
One possibility comes to mind. In accc32a I made some changes to ensure that the "job end" event got out of |
Commit accc32a causes the problem. I ran my test with the preceding commit eabeb2c and it worked. I checked out commit accc32a, rebuilt, and it failed as above. I repeated the process, verified eabeb2c worked and accc32a failed. I also got a second failure with accc32a where my test displayed these messages instead of the above failure.
|
I believe #981 will solve the primary reported problem - i.e., the DVM will only order termination once. This probably won't deal with the "ringing" going on in reporting job termination of the daemons. I think that is a slightly subtler problem centering around the question of "monitoring" daemon jobs. We originally had decided not to monitor such jobs - i.e., even if someone asked ORTE to start a set of daemons for them, we wouldn't shut down the DVM if the daemons failed nor would we provide alerts/notifications of daemon termination. We basically just started them and then ignored them. At that time, we weren't really envisioning support for debugger tools. Now that we have transitioned the code to PRRTE and are supporting debuggers, we probably need to revisit that decision. I suspect we do need to monitor the daemon procs and treat them just like a regular job for notification purposes. |
I'm not sure if I'm being premature in updating my source to test this or if #981 wasn't intended to fix this issue. I did update source and verified the source file state_dvm.c was updated. I still see the problem happening although it doesn't seem to be a solid failure now. I did run my test successfully 2 times out of 6 or 7 this time where it seemed to be a solid failure with 3 nodes/12 application tasks before |
It was hopefully going to solve the problem of the DVM controller issuing those |
Yes, I still get these error messages, in this case once for each of the 3 remote nodes f7n02 thru f7n04
|
I'm just checking what the status of the solution for this issue is. I'm also wondering, from the perspective of stability for the PMIX tools interfaces if this is a serious enough issue that it is required to be fixed before the upcoming release. |
@drwootton Please give it a try again with head of PMIx and PRRTE. As for your question: I'm not sure. Sounds like this is just a node identification problem like the others we've been working, so hopefully we'll get it fixed. |
I updated source for PMIx and PRRTE then I ran my complete set of CI tests. I still see this failure with my 3 multi-node tests but not with any other tests. |
Try putting |
I set the environment variable and it still fails. |
And it fails due to two daemons unable to connect to each other - is that still the failure mode? If so, you might try setting the oob_base_verbose option to see what connections are being attempted and work from there. |
There's something, most likely timing related going on with this failure If I run the command ./indirect-multi --daemon-colocate-per-node 1 prterun --hostfile ./hostfile_4_slots --np 12 ./hello 10 where hostfile is
stdio output is displayed to the terminal session where I run the command something like 4 out of 5 times with the error messages as noted above. Once in a while the command runs successfully. If I redirect stdout and stderr to a file where I run the same command then the test completes successfully more frequently but it still fails a few times. If I add the --prtemca oob_base_verbose 99 option to the prterun command and redirect stdout and stderr to a file then I get about 2800 lines of output to the file but the test never fails. I tried running the same command without redirecting stdio but inside a console session I started with the script command so stdio went to the console but was also written to a file. Five tries out of 5 were successful. I don't know where to go from here. Maybe I'm chasing a ghost where there's something weird about my setup, but I am getting close to 100% failure rate when I run the command without debug flags. |
I am not seeing this problem any more today. The problems I was seeing after approx Jun 7th may have been problems with backlevel code in my repo. I have run the indirect multi-node testcases a bunch of times and no longer see the error. |
Thank you for taking the time to submit an issue!
Background information
What version of the PMIx Reference Server are you using? (e.g., v1.0, v2.1, git master @ hash, etc.)
Built from master source as of 5/25 8:00AM
What version of PMIx are you using? (e.g., v1.2.5, v2.0.3, v2.1.0, git branch name and hash, etc.)
Built from master source as of 5/25 8:00AM
Please describe the system on which you are running
Details of the problem
I'm trying to run the PRRTE debug example indirect-multi. If I run it to co-launch 1 daemon per application process or to launch 1 daemon per node in non-co-launch mode it works. If I try to run it to co-launch 1 daemon per node then it fails with errors stating that a connection between two nodes cannot be completed.
The system I am running on has 4 nodes with 2 hostnames that resolve to two separate ethernet adapters.
There is a public network where the hostnames are c656f7n01 thru c656f7n04 and an internal network, where the hostnames are f7n01 thru f7n04. If I log onto one of these nodes to run my test, the hostname command reports the short hostname, f7n01, etc.
I originally tried to run my test with a hostfile that specified the public hostnames c656f7n02 thru c656f6n04, 4 slots for each node, where I ran my test from f7n01. ./indirect-multi --daemon-colocate-per-node 1 prterun --hostfile ./hostfile_4_slots --np 12 ./hello
This fails with messages
I also tried the test after modifying the hostfile to use hostnames f7n02 thru f7n04 with 4 slots each to avoid any problems with using 2 networks and got a similar error
even though all hostnames respond to ping and work with ssh.
Both ./indirect-multi --daemon-colocate-per-node 1 prterun --hostfile ./hostfile_4_slots --np 12 ./hello and ./indirect-multi prterun --hostfile ./hostfile_4_slots --np 12 ./hello work fine.
All of these tests worked early last week.
@jjhursey wondered if PRRTE issue #974 might have something to do with this.
The text was updated successfully, but these errors were encountered: