Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-enable the debugger PRRTE tests #88

Closed
2 tasks done
jjhursey opened this issue Apr 5, 2021 · 11 comments · Fixed by #102
Closed
2 tasks done

Re-enable the debugger PRRTE tests #88

jjhursey opened this issue Apr 5, 2021 · 11 comments · Fixed by #102
Assignees
Labels
bug Something isn't working

Comments

@jjhursey
Copy link
Member

jjhursey commented Apr 5, 2021

We need to get the PRRTE debugger tests running again. They hit a signature issue last week and they had to be disabled.

To do:

  • Fix baseline signatures in the pmix-tests branch
  • Fix tests in PRRTE examples/debugger, as needed

Discussion context:

@jjhursey jjhursey added the bug Something isn't working label Apr 5, 2021
@jjhursey
Copy link
Member Author

jjhursey commented Apr 5, 2021

A note about Jenkins and branches.

  • In testing, we need three repos to run the tests: prrte, openpmix, and pmix-tests
  • For a PR in any one of those three repos CI uses the candidate branch of that PR plus the master branch of the other two repos.
    • For example, if I have a PR branch foo against PRRTE then CI tests with: prrte repo with foo branch, openpmix repo with master branch, and pmix-tests repo with master branch.
  • The above works as long as the changes needed to update/fix CI is in only one of those branches.
  • Where it falls short is if we need a change in 2 or more repos to work together. For example, a fix branch in the prrte repo, and a corresponding fix-test branch in pmix-test.
  • I took a note to add a hook that would let you specify the repo+branch of all three of these repos for this type of corresponding changes testing. I'll work on that separate from this particular ticket.

@jjhursey
Copy link
Member Author

jjhursey commented Apr 6, 2021

A few PRTE repairs needed before we are back at 100%

@drwootton
Copy link
Contributor

drwootton commented Jun 14, 2021

There are a number of CI tests defined in the run.py script. The tests and current status is tracked in this issue and is as follows

direct - working: (start prte system daemon then run ./direct)

attach - fails (start prterun -n 2 --report-uri + hello 60 then run attach <prterun-ns>)
I think this is a testcase problem. If I register for the PMIX_EVENT_JOB_ENDED event and try to qualify the registration
with the proc of the debug daemon, attach hangs and never sees the event. If I don't qualify the registration, then attach
sees the event and does not hang, but there are other problems with the attach colaunch cases. Need to investigate.

indirect-prterun - fails (run indirect prterun -n 2 hello 10)
I need to resolve handling of PMIX_EVENT_JOB_ENDED vs PMIX_ERR_LOST_CONNECTION
I also notice with this testcase that any debug daemon printf output after the call to PMIx_tool_finalize is lost.
I think there are changes in termination and cleanup processing such that I cannot rely on any output from the daemon
or application processes after PMIx_tool_finalize is called. I need to update the daemon and hello programs accordingly.

indirect-prun - working (start prte system daemon then run indirect prun -n 2 hello 10)

direct-multi - working (start prte system daemon then run direct-multi --app-pernode 2
--app-np 6 --hostfile ./hostfile
)

direct-colaunch1 - working (start prte system daemon then run direct-multi --daemon-colocate-per-node --app-pernode 2
--app-np 6 --hostfile ./hostfile
)

direct-colaunch2 - working (start prte system daemon then run direct-multi --daemon-colocate-per-proc --app-pernode 2
--app-np 6 --hostfile ./hostfile
)

indirect-multi - fails (run indirect-multi --num-nodes 3 --np 12 --hostfile ./hostfile hello)
fails due to PRRTE issue #978

indirect-colaunch1 - fails (run indirect-multi --colocate-per-node 1 --num-nodes 3 --np 12 --hostfile ./hostfile hello)
fails due to PRRTE issue #978

indirect-colaunch1 - fails (run indirect-multi --colocate-per-proc 1 --num-nodes 3 --np 12 --hostfile ./hostfile hello)
fails due to PRRTE issue #978

attach-colaunch1 - working (start multi-node prterun --report-uri + command then
run attach --daemon-colocate-per-node 1 <prterun_ns>)

attach-colaunch2 - working (start multi-node prterun --report-uri + command then
run attach --daemon-colocate-per-proc 1 <prterun_ns>)

@jjhursey
Copy link
Member Author

Ref: openpmix/prrte#978

@jjhursey
Copy link
Member Author

Ref: openpmix/prrte#1004

@drwootton
Copy link
Contributor

drwootton commented Jun 28, 2021

The problems with the attach and indirect testcases are fixed. This required CI testcase baseline file updates as well as updates to the testcases.

The currently failing testcases are indirect-multi, indirect-colaunch1 and indirect-colaunch2, all due to Ref: openpmix/prrte#978

The pull requests to resolve the other CI test failures are Ref: openpmix/prrte#1010 and Ref: #93

@drwootton
Copy link
Contributor

As of today, all PRRTE debugger examples are running successfully. Pull request Ref: openpmix/prrte#1030 has the debugger example changes to resolve outstanding problems. Since the example code changes changed the example's output, updated baselines are needed, which was done with pull request #97

@rhc54
Copy link
Contributor

rhc54 commented Jul 14, 2021

I'll be back on Fri and can take a look at all the parts then - I expect it will be fine and we are now ready to bring these back online.

@rhc54
Copy link
Contributor

rhc54 commented Jul 16, 2021

Should be good-to-go now.

@rhc54 rhc54 closed this as completed Jul 16, 2021
@jjhursey
Copy link
Member Author

These are still disabled in CI by default:

@jjhursey jjhursey reopened this Jul 19, 2021
@rhc54
Copy link
Contributor

rhc54 commented Jul 19, 2021

Worth giving it a try - sure, please do so and let's see how it goes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
3 participants