Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dragon server enhancement #582

Merged
merged 7 commits into from
May 14, 2024
Merged

Conversation

al-rigazzi
Copy link
Collaborator

The Dragon server could fail, dumping a core file, if it was shut down before all spawned Process Groups completed. This PR fixes such behavior: the immediate flag on the DragonShutdownRequest now requests every non-terminated job to be stopped.

@al-rigazzi al-rigazzi requested a review from ashao May 14, 2024 16:17
Copy link
Collaborator

@ashao ashao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One quick question, but otherwise it looks good. Thanks for tracking this down

@@ -130,7 +130,10 @@ def redir_worker(io_conn: dragon_connection.Connection, file_path: str) -> None:
except Exception as e:
print(e)
finally:
io_conn.close()
try:
io_conn.close()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain why this is a universal catch and print? Is it that io_conn can fail out in a number of ways? Would re-reraising the error result in too much crashing down?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'm a bit scared of raising an exception and taking down all the dragon infra. I'll put up a ticket to investigate better ways of handling this.

Copy link

codecov bot commented May 14, 2024

Codecov Report

Attention: Patch coverage is 0% with 15 lines in your changes are missing coverage. Please review.

Project coverage is 60.63%. Comparing base (781d4b6) to head (bdbc2fd).

Additional details and impacted files

Impacted file tree graph

@@             Coverage Diff              @@
##           develop     #582       +/-   ##
============================================
- Coverage    79.12%   60.63%   -18.50%     
============================================
  Files           78       78               
  Lines         5988     6000       +12     
============================================
- Hits          4738     3638     -1100     
- Misses        1250     2362     +1112     
Files Coverage Δ
smartsim/_core/launcher/dragon/dragonLauncher.py 26.92% <0.00%> (ø)
smartsim/_core/launcher/dragon/dragonBackend.py 2.39% <0.00%> (-76.13%) ⬇️

... and 42 files with indirect coverage changes

@al-rigazzi al-rigazzi merged commit 4e7302e into CrayLabs:develop May 14, 2024
36 checks passed
@al-rigazzi al-rigazzi deleted the drg_prerelease branch May 14, 2024 18:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants