PRRTE indirect-multi co-launch testcase fails with node connection error #978

drwootton · 2021-05-24T21:55:07Z

Thank you for taking the time to submit an issue!

Background information

What version of the PMIx Reference Server are you using? (e.g., v1.0, v2.1, git master @ hash, etc.)

Built from master source as of 5/25 8:00AM

What version of PMIx are you using? (e.g., v1.2.5, v2.0.3, v2.1.0, git branch name and hash, etc.)

Built from master source as of 5/25 8:00AM

Please describe the system on which you are running

Operating system/version:
RHEL 7.7
Computer hardware:
4 Power 9 nodes, 20 core, 8 threads per core
Network type:
Ethernet

Details of the problem

I'm trying to run the PRRTE debug example indirect-multi. If I run it to co-launch 1 daemon per application process or to launch 1 daemon per node in non-co-launch mode it works. If I try to run it to co-launch 1 daemon per node then it fails with errors stating that a connection between two nodes cannot be completed.

The system I am running on has 4 nodes with 2 hostnames that resolve to two separate ethernet adapters.

There is a public network where the hostnames are c656f7n01 thru c656f7n04 and an internal network, where the hostnames are f7n01 thru f7n04. If I log onto one of these nodes to run my test, the hostname command reports the short hostname, f7n01, etc.

I originally tried to run my test with a hostfile that specified the public hostnames c656f7n02 thru c656f6n04, 4 slots for each node, where I ran my test from f7n01. ./indirect-multi --daemon-colocate-per-node 1 prterun --hostfile ./hostfile_4_slots --np 12 ./hello
This fails with messages

A process or daemon was unable to complete a TCP connection
to another process:
  Local host:    f7n01
  Remote host:   c656f7n02
This is usually caused by a firewall on the remote host. Please
check that any firewall (e.g., iptables) has been disabled and
try again.

I also tried the test after modifying the hostfile to use hostnames f7n02 thru f7n04 with 4 slots each to avoid any problems with using 2 networks and got a similar error

A process or daemon was unable to complete a TCP connection
to another process:
  Local host:    f7n01
  Remote host:   f7n02
This is usually caused by a firewall on the remote host. Please
check that any firewall (e.g., iptables) has been disabled and
try again.

even though all hostnames respond to ping and work with ssh.

Both ./indirect-multi --daemon-colocate-per-node 1 prterun --hostfile ./hostfile_4_slots --np 12 ./hello and ./indirect-multi prterun --hostfile ./hostfile_4_slots --np 12 ./hello work fine.

All of these tests worked early last week.
@jjhursey wondered if PRRTE issue #974 might have something to do with this.

The text was updated successfully, but these errors were encountered:

rhc54 · 2021-05-24T22:07:46Z

Are you running in a managed environment? Or strictly using "hostfile"?

You can always try putting PRTE_MCA_prte_do_not_resolve=0 (or try setting it to 1) in your environment to see if it helps.

drwootton · 2021-05-25T11:15:45Z

I am using only a hostfile. There is no resource manager like LSF.

I tried setting PRTE_MCA_prte_do_not_resolve=0 and PRTE_MCA_prte_do_not_resolve=1. In both cases I still had the same problem.

rhc54 · 2021-05-25T12:53:00Z

What if you use PRTE_MCA_oob_if_include to specify the network to use? Regardless of hostname, the messaging system will grab an arbitrary interface - sounds like different daemons may be grabbing different interfaces.

drwootton · 2021-05-25T13:47:09Z

I tried export PRTE_MCA_oob_if_include=10.128.7.0/16 since that is the IP address range for the f7n01 thru f7n04 hostnames that the hostname command returns. I also added the export to both my .profile and .bashrc since I don't know if it ends up set on the other nodes otherwise.

I still got the same error.

@jjhursey suggested I try export PRTE_MCA_prte_do_not_resolve=0 and export PRTE_MCA_prte_do_not_resolve=1, which I also set in .bashrc and .profile nd that did not fix the problem either.

rhc54 · 2021-05-25T14:07:13Z

Try adding PRTE_MCA_oob_base_verbose=5 and see what the output tells you about connection attempts. You might want to only launch one other node to keep the volume down.

drwootton · 2021-05-25T15:04:18Z

I tried running with f7n01 as my local/launch node and f7n02 as my remote node with 2 application tasks. I ran the test 15 times and it never failed.

I ran with f7n01 as my local node and f7n02,f7n04 as my two remote nodes, where the hostfile specified those two nodes with 2 slots each. I still ran with only 2 application tasks and those tasks went to f7n02. I ran that test about 6 times and it only failed once.

If I run with 3 remote nodes, 4 tasks per node, then it always fails.

I looked at the debug output but don't understand what I am looking at. The log is attached.

indirect-fail.txt

drwootton · 2021-05-25T15:06:01Z

This output is with export PRTE_MCA_oob_if_include=10.128.7.0/16

rhc54 · 2021-05-25T15:23:30Z

Well, it all looks just fine. It appears to establish all connections and perform a number of operations. At the end, though, it looks like your compite node daemons believe they have been told to "shutdown" and do so, but the HNP (prterun in this case) still thinks it has a message to send to them. So the HNP tries to re-establish the connections and fails because nobody is listening any more.

You might try replacing the oob_base_verbose with state_base_verbose and see what is happening.

drwootton · 2021-05-25T17:20:49Z

I changed PRTE_MCA_oob_base_verbose to PRTE_MCA_state_verbose and it does appear that processes terminate normally, but then at the end of the log I see then it looks like prterun tries to activate the PRTE_JOB_STATE_TERMINATED state at pmix_server_gen.c::418 and then fails with a lost connection at oob_tcp_component.c:1016

Maybe I'm reading the log wrong, but it also looks like prterun is trying to activate the PRTE_JOB_TERMINATED state twice for the same process, lines 293 and 304 of the attached log.

While this is going on, the indirect-multi code is waiting for the lost connection event which is apparently received after the error messages are issued.
indirect-failed-state.txt

rhc54 · 2021-05-25T17:31:22Z

It appears to be "ringing" - I see multiple (more than two) activations of "job_terminated" for the debugger daemon job. The application job itself looks fine. Not sure I understand how anything we changed last week would affect this, but you are welcome to back down the commit tree to see if/where it breaks. There weren't that many of them - you might want to start by verifying that this really was working at the last point you thought it was, just to be sure.

rhc54 · 2021-05-25T17:45:12Z

One possibility comes to mind. In accc32a I made some changes to ensure that the "job end" event got out of prterun prior to its termination. It could be that this "ringing" for daemon jobs was always there, but you weren't seeing it due to prterun terminating early. You might want to go back before that commit and see if this worked there, and then check again after the commit.

drwootton · 2021-05-26T13:32:03Z

Commit accc32a causes the problem. I ran my test with the preceding commit eabeb2c and it worked.

I checked out commit accc32a, rebuilt, and it failed as above.

I repeated the process, verified eabeb2c worked and accc32a failed.

I also got a second failure with accc32a where my test displayed these messages instead of the above failure.

> [f7n01:120288] [f7n00284:1,0]-[@NS<0>,1] prte_oob_tcp_peer_send_handler: unable to send message ON SOCKET 24
> [f7n01:120288] [f7n00284:1,0]-[@NS<0>,2] prte_oob_tcp_peer_send_handler: unable to send message ON SOCKET 22
> [f7n01:120288] oob:tsend_msg: write failed: Broken pipe (32) [sd = 22]
> [f7n01:120288] oob:tsend_msg: write failed: Broken pipe (32) [sd = 24]

rhc54 · 2021-05-26T16:02:50Z

I believe #981 will solve the primary reported problem - i.e., the DVM will only order termination once. This probably won't deal with the "ringing" going on in reporting job termination of the daemons. I think that is a slightly subtler problem centering around the question of "monitoring" daemon jobs.

We originally had decided not to monitor such jobs - i.e., even if someone asked ORTE to start a set of daemons for them, we wouldn't shut down the DVM if the daemons failed nor would we provide alerts/notifications of daemon termination. We basically just started them and then ignored them. At that time, we weren't really envisioning support for debugger tools.

Now that we have transitioned the code to PRRTE and are supporting debuggers, we probably need to revisit that decision. I suspect we do need to monitor the daemon procs and treat them just like a regular job for notification purposes.

drwootton · 2021-05-26T17:34:51Z

I'm not sure if I'm being premature in updating my source to test this or if #981 wasn't intended to fix this issue. I did update source and verified the source file state_dvm.c was updated.

I still see the problem happening although it doesn't seem to be a solid failure now. I did run my test successfully 2 times out of 6 or 7 this time where it seemed to be a solid failure with 3 nodes/12 application tasks before

rhc54 · 2021-05-26T17:43:38Z

It was hopefully going to solve the problem of the DVM controller issuing those A process or daemon was unable to complete a TCP connection errors. Is that what you still see?

drwootton · 2021-05-26T17:49:49Z

Yes, I still get these error messages, in this case once for each of the 3 remote nodes f7n02 thru f7n04

------------------------------------------------------------
A process or daemon was unable to complete a TCP connection
to another process:
  Local host:    f7n01
  Remote host:   f7n02
This is usually caused by a firewall on the remote host. Please
check that any firewall (e.g., iptables) has been disabled and
try again.

drwootton · 2021-06-07T19:42:11Z

I'm just checking what the status of the solution for this issue is. I'm also wondering, from the perspective of stability for the PMIX tools interfaces if this is a serious enough issue that it is required to be fixed before the upcoming release.

rhc54 · 2021-06-15T21:28:32Z

@drwootton Please give it a try again with head of PMIx and PRRTE. As for your question: I'm not sure. Sounds like this is just a node identification problem like the others we've been working, so hopefully we'll get it fixed.

drwootton · 2021-06-16T11:22:05Z

I updated source for PMIx and PRRTE then I ran my complete set of CI tests. I still see this failure with my 3 multi-node tests but not with any other tests.

rhc54 · 2021-06-16T13:10:59Z

Try putting PRTE_MCA_prte_if_include=10.128.7.0/16 in your environment. If it still fails, I can point you to the relevant code areas and advise on debugging, but that probably is the limit on what I can to do help as I cannot replicate this anywhere.

drwootton · 2021-06-16T14:23:34Z

I set the environment variable and it still fails.
I also changed the code to handle PMIX_ERR_LOST_CONNECTION instead of PMIX_EVENT_JOB_END and it still fails with these messages.

rhc54 · 2021-06-16T14:37:57Z

And it fails due to two daemons unable to connect to each other - is that still the failure mode? If so, you might try setting the oob_base_verbose option to see what connections are being attempted and work from there.

drwootton · 2021-06-21T21:51:41Z

There's something, most likely timing related going on with this failure

If I run the command ./indirect-multi --daemon-colocate-per-node 1 prterun --hostfile ./hostfile_4_slots --np 12 ./hello 10 where hostfile is

c656f7n02
c656f7n03
c656f7n04
c656f7n01

stdio output is displayed to the terminal session where I run the command something like 4 out of 5 times with the error messages as noted above. Once in a while the command runs successfully.

If I redirect stdout and stderr to a file where I run the same command then the test completes successfully more frequently but it still fails a few times.

If I add the --prtemca oob_base_verbose 99 option to the prterun command and redirect stdout and stderr to a file then I get about 2800 lines of output to the file but the test never fails.

I tried running the same command without redirecting stdio but inside a console session I started with the script command so stdio went to the console but was also written to a file. Five tries out of 5 were successful.

I don't know where to go from here. Maybe I'm chasing a ghost where there's something weird about my setup, but I am getting close to 100% failure rate when I run the command without debug flags.

drwootton · 2021-07-14T21:10:00Z

I am not seeing this problem any more today. The problems I was seeing after approx Jun 7th may have been problems with backlevel code in my repo. I have run the indirect multi-node testcases a bunch of times and no longer see the error.

jjhursey mentioned this issue Jun 15, 2021

Re-enable the debugger PRRTE tests openpmix/pmix-tests#88

Closed

2 tasks

drwootton closed this as completed Jul 14, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PRRTE indirect-multi co-launch testcase fails with node connection error #978

PRRTE indirect-multi co-launch testcase fails with node connection error #978

drwootton commented May 24, 2021

rhc54 commented May 24, 2021

drwootton commented May 25, 2021

rhc54 commented May 25, 2021

drwootton commented May 25, 2021

rhc54 commented May 25, 2021

drwootton commented May 25, 2021

drwootton commented May 25, 2021

rhc54 commented May 25, 2021

drwootton commented May 25, 2021

rhc54 commented May 25, 2021

rhc54 commented May 25, 2021

drwootton commented May 26, 2021

rhc54 commented May 26, 2021

drwootton commented May 26, 2021

rhc54 commented May 26, 2021

drwootton commented May 26, 2021 •

edited

Loading

drwootton commented Jun 7, 2021

rhc54 commented Jun 15, 2021

drwootton commented Jun 16, 2021

rhc54 commented Jun 16, 2021

drwootton commented Jun 16, 2021

rhc54 commented Jun 16, 2021

drwootton commented Jun 21, 2021 •

edited

Loading

drwootton commented Jul 14, 2021

PRRTE indirect-multi co-launch testcase fails with node connection error #978

PRRTE indirect-multi co-launch testcase fails with node connection error #978

Comments

drwootton commented May 24, 2021

Background information

What version of the PMIx Reference Server are you using? (e.g., v1.0, v2.1, git master @ hash, etc.)

What version of PMIx are you using? (e.g., v1.2.5, v2.0.3, v2.1.0, git branch name and hash, etc.)

Please describe the system on which you are running

Details of the problem

rhc54 commented May 24, 2021

drwootton commented May 25, 2021

rhc54 commented May 25, 2021

drwootton commented May 25, 2021

rhc54 commented May 25, 2021

drwootton commented May 25, 2021

drwootton commented May 25, 2021

rhc54 commented May 25, 2021

drwootton commented May 25, 2021

rhc54 commented May 25, 2021

rhc54 commented May 25, 2021

drwootton commented May 26, 2021

rhc54 commented May 26, 2021

drwootton commented May 26, 2021

rhc54 commented May 26, 2021

drwootton commented May 26, 2021 • edited Loading

drwootton commented Jun 7, 2021

rhc54 commented Jun 15, 2021

drwootton commented Jun 16, 2021

rhc54 commented Jun 16, 2021

drwootton commented Jun 16, 2021

rhc54 commented Jun 16, 2021

drwootton commented Jun 21, 2021 • edited Loading

drwootton commented Jul 14, 2021

drwootton commented May 26, 2021 •

edited

Loading

drwootton commented Jun 21, 2021 •

edited

Loading