Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PRRTE indirect-multi co-launch testcase fails with node connection error #978

Closed
drwootton opened this issue May 24, 2021 · 24 comments
Closed

Comments

@drwootton
Copy link
Contributor

Thank you for taking the time to submit an issue!

Background information

What version of the PMIx Reference Server are you using? (e.g., v1.0, v2.1, git master @ hash, etc.)

Built from master source as of 5/25 8:00AM

What version of PMIx are you using? (e.g., v1.2.5, v2.0.3, v2.1.0, git branch name and hash, etc.)

Built from master source as of 5/25 8:00AM

Please describe the system on which you are running

  • Operating system/version:
  • RHEL 7.7
  • Computer hardware:
  • 4 Power 9 nodes, 20 core, 8 threads per core
  • Network type:
  • Ethernet

Details of the problem

I'm trying to run the PRRTE debug example indirect-multi. If I run it to co-launch 1 daemon per application process or to launch 1 daemon per node in non-co-launch mode it works. If I try to run it to co-launch 1 daemon per node then it fails with errors stating that a connection between two nodes cannot be completed.

The system I am running on has 4 nodes with 2 hostnames that resolve to two separate ethernet adapters.

There is a public network where the hostnames are c656f7n01 thru c656f7n04 and an internal network, where the hostnames are f7n01 thru f7n04. If I log onto one of these nodes to run my test, the hostname command reports the short hostname, f7n01, etc.

I originally tried to run my test with a hostfile that specified the public hostnames c656f7n02 thru c656f6n04, 4 slots for each node, where I ran my test from f7n01. ./indirect-multi --daemon-colocate-per-node 1 prterun --hostfile ./hostfile_4_slots --np 12 ./hello
This fails with messages

A process or daemon was unable to complete a TCP connection
to another process:
  Local host:    f7n01
  Remote host:   c656f7n02
This is usually caused by a firewall on the remote host. Please
check that any firewall (e.g., iptables) has been disabled and
try again.

I also tried the test after modifying the hostfile to use hostnames f7n02 thru f7n04 with 4 slots each to avoid any problems with using 2 networks and got a similar error

A process or daemon was unable to complete a TCP connection
to another process:
  Local host:    f7n01
  Remote host:   f7n02
This is usually caused by a firewall on the remote host. Please
check that any firewall (e.g., iptables) has been disabled and
try again.

even though all hostnames respond to ping and work with ssh.

Both ./indirect-multi --daemon-colocate-per-node 1 prterun --hostfile ./hostfile_4_slots --np 12 ./hello and ./indirect-multi prterun --hostfile ./hostfile_4_slots --np 12 ./hello work fine.

All of these tests worked early last week.
@jjhursey wondered if PRRTE issue #974 might have something to do with this.

@rhc54
Copy link
Contributor

rhc54 commented May 24, 2021

Are you running in a managed environment? Or strictly using "hostfile"?

You can always try putting PRTE_MCA_prte_do_not_resolve=0 (or try setting it to 1) in your environment to see if it helps.

@drwootton
Copy link
Contributor Author

I am using only a hostfile. There is no resource manager like LSF.

I tried setting PRTE_MCA_prte_do_not_resolve=0 and PRTE_MCA_prte_do_not_resolve=1. In both cases I still had the same problem.

@rhc54
Copy link
Contributor

rhc54 commented May 25, 2021

What if you use PRTE_MCA_oob_if_include to specify the network to use? Regardless of hostname, the messaging system will grab an arbitrary interface - sounds like different daemons may be grabbing different interfaces.

@drwootton
Copy link
Contributor Author

I tried export PRTE_MCA_oob_if_include=10.128.7.0/16 since that is the IP address range for the f7n01 thru f7n04 hostnames that the hostname command returns. I also added the export to both my .profile and .bashrc since I don't know if it ends up set on the other nodes otherwise.

I still got the same error.

@jjhursey suggested I try export PRTE_MCA_prte_do_not_resolve=0 and export PRTE_MCA_prte_do_not_resolve=1, which I also set in .bashrc and .profile nd that did not fix the problem either.

@rhc54
Copy link
Contributor

rhc54 commented May 25, 2021

Try adding PRTE_MCA_oob_base_verbose=5 and see what the output tells you about connection attempts. You might want to only launch one other node to keep the volume down.

@drwootton
Copy link
Contributor Author

I tried running with f7n01 as my local/launch node and f7n02 as my remote node with 2 application tasks. I ran the test 15 times and it never failed.

I ran with f7n01 as my local node and f7n02,f7n04 as my two remote nodes, where the hostfile specified those two nodes with 2 slots each. I still ran with only 2 application tasks and those tasks went to f7n02. I ran that test about 6 times and it only failed once.

If I run with 3 remote nodes, 4 tasks per node, then it always fails.

I looked at the debug output but don't understand what I am looking at. The log is attached.

indirect-fail.txt

@drwootton
Copy link
Contributor Author

This output is with export PRTE_MCA_oob_if_include=10.128.7.0/16

@rhc54
Copy link
Contributor

rhc54 commented May 25, 2021

Well, it all looks just fine. It appears to establish all connections and perform a number of operations. At the end, though, it looks like your compite node daemons believe they have been told to "shutdown" and do so, but the HNP (prterun in this case) still thinks it has a message to send to them. So the HNP tries to re-establish the connections and fails because nobody is listening any more.

You might try replacing the oob_base_verbose with state_base_verbose and see what is happening.

@drwootton
Copy link
Contributor Author

I changed PRTE_MCA_oob_base_verbose to PRTE_MCA_state_verbose and it does appear that processes terminate normally, but then at the end of the log I see then it looks like prterun tries to activate the PRTE_JOB_STATE_TERMINATED state at pmix_server_gen.c::418 and then fails with a lost connection at oob_tcp_component.c:1016

Maybe I'm reading the log wrong, but it also looks like prterun is trying to activate the PRTE_JOB_TERMINATED state twice for the same process, lines 293 and 304 of the attached log.

While this is going on, the indirect-multi code is waiting for the lost connection event which is apparently received after the error messages are issued.
indirect-failed-state.txt

@rhc54
Copy link
Contributor

rhc54 commented May 25, 2021

It appears to be "ringing" - I see multiple (more than two) activations of "job_terminated" for the debugger daemon job. The application job itself looks fine. Not sure I understand how anything we changed last week would affect this, but you are welcome to back down the commit tree to see if/where it breaks. There weren't that many of them - you might want to start by verifying that this really was working at the last point you thought it was, just to be sure.

@rhc54
Copy link
Contributor

rhc54 commented May 25, 2021

One possibility comes to mind. In accc32a I made some changes to ensure that the "job end" event got out of prterun prior to its termination. It could be that this "ringing" for daemon jobs was always there, but you weren't seeing it due to prterun terminating early. You might want to go back before that commit and see if this worked there, and then check again after the commit.

@drwootton
Copy link
Contributor Author

Commit accc32a causes the problem. I ran my test with the preceding commit eabeb2c and it worked.

I checked out commit accc32a, rebuilt, and it failed as above.

I repeated the process, verified eabeb2c worked and accc32a failed.

I also got a second failure with accc32a where my test displayed these messages instead of the above failure.

> [f7n01:120288] [f7n00284:1,0]-[@NS<0>,1] prte_oob_tcp_peer_send_handler: unable to send message ON SOCKET 24
> [f7n01:120288] [f7n00284:1,0]-[@NS<0>,2] prte_oob_tcp_peer_send_handler: unable to send message ON SOCKET 22
> [f7n01:120288] oob:tsend_msg: write failed: Broken pipe (32) [sd = 22]
> [f7n01:120288] oob:tsend_msg: write failed: Broken pipe (32) [sd = 24]

@rhc54
Copy link
Contributor

rhc54 commented May 26, 2021

I believe #981 will solve the primary reported problem - i.e., the DVM will only order termination once. This probably won't deal with the "ringing" going on in reporting job termination of the daemons. I think that is a slightly subtler problem centering around the question of "monitoring" daemon jobs.

We originally had decided not to monitor such jobs - i.e., even if someone asked ORTE to start a set of daemons for them, we wouldn't shut down the DVM if the daemons failed nor would we provide alerts/notifications of daemon termination. We basically just started them and then ignored them. At that time, we weren't really envisioning support for debugger tools.

Now that we have transitioned the code to PRRTE and are supporting debuggers, we probably need to revisit that decision. I suspect we do need to monitor the daemon procs and treat them just like a regular job for notification purposes.

@drwootton
Copy link
Contributor Author

I'm not sure if I'm being premature in updating my source to test this or if #981 wasn't intended to fix this issue. I did update source and verified the source file state_dvm.c was updated.

I still see the problem happening although it doesn't seem to be a solid failure now. I did run my test successfully 2 times out of 6 or 7 this time where it seemed to be a solid failure with 3 nodes/12 application tasks before

@rhc54
Copy link
Contributor

rhc54 commented May 26, 2021

It was hopefully going to solve the problem of the DVM controller issuing those A process or daemon was unable to complete a TCP connection errors. Is that what you still see?

@drwootton
Copy link
Contributor Author

drwootton commented May 26, 2021

Yes, I still get these error messages, in this case once for each of the 3 remote nodes f7n02 thru f7n04

------------------------------------------------------------
A process or daemon was unable to complete a TCP connection
to another process:
  Local host:    f7n01
  Remote host:   f7n02
This is usually caused by a firewall on the remote host. Please
check that any firewall (e.g., iptables) has been disabled and
try again.

@drwootton
Copy link
Contributor Author

I'm just checking what the status of the solution for this issue is. I'm also wondering, from the perspective of stability for the PMIX tools interfaces if this is a serious enough issue that it is required to be fixed before the upcoming release.

@rhc54
Copy link
Contributor

rhc54 commented Jun 15, 2021

@drwootton Please give it a try again with head of PMIx and PRRTE. As for your question: I'm not sure. Sounds like this is just a node identification problem like the others we've been working, so hopefully we'll get it fixed.

@drwootton
Copy link
Contributor Author

I updated source for PMIx and PRRTE then I ran my complete set of CI tests. I still see this failure with my 3 multi-node tests but not with any other tests.

@rhc54
Copy link
Contributor

rhc54 commented Jun 16, 2021

Try putting PRTE_MCA_prte_if_include=10.128.7.0/16 in your environment. If it still fails, I can point you to the relevant code areas and advise on debugging, but that probably is the limit on what I can to do help as I cannot replicate this anywhere.

@drwootton
Copy link
Contributor Author

I set the environment variable and it still fails.
I also changed the code to handle PMIX_ERR_LOST_CONNECTION instead of PMIX_EVENT_JOB_END and it still fails with these messages.

@rhc54
Copy link
Contributor

rhc54 commented Jun 16, 2021

And it fails due to two daemons unable to connect to each other - is that still the failure mode? If so, you might try setting the oob_base_verbose option to see what connections are being attempted and work from there.

@drwootton
Copy link
Contributor Author

drwootton commented Jun 21, 2021

There's something, most likely timing related going on with this failure

If I run the command ./indirect-multi --daemon-colocate-per-node 1 prterun --hostfile ./hostfile_4_slots --np 12 ./hello 10 where hostfile is

c656f7n02
c656f7n03
c656f7n04
c656f7n01

stdio output is displayed to the terminal session where I run the command something like 4 out of 5 times with the error messages as noted above. Once in a while the command runs successfully.

If I redirect stdout and stderr to a file where I run the same command then the test completes successfully more frequently but it still fails a few times.

If I add the --prtemca oob_base_verbose 99 option to the prterun command and redirect stdout and stderr to a file then I get about 2800 lines of output to the file but the test never fails.

I tried running the same command without redirecting stdio but inside a console session I started with the script command so stdio went to the console but was also written to a file. Five tries out of 5 were successful.

I don't know where to go from here. Maybe I'm chasing a ghost where there's something weird about my setup, but I am getting close to 100% failure rate when I run the command without debug flags.

@drwootton
Copy link
Contributor Author

I am not seeing this problem any more today. The problems I was seeing after approx Jun 7th may have been problems with backlevel code in my repo. I have run the indirect multi-node testcases a bunch of times and no longer see the error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants