-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Experiment Can Stop Launched Jobs #677
Conversation
launchers how to stop the jobs that they manage.
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## smartsim-refactor #677 +/- ##
=====================================================
+ Coverage 40.45% 43.59% +3.13%
=====================================================
Files 110 110
Lines 7326 7076 -250
=====================================================
+ Hits 2964 3085 +121
+ Misses 4362 3991 -371
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a couple of questons/comments
raise ValueError("No job ids provided") | ||
by_launcher = self._launch_history.group_by_launcher(set(ids), unknown_ok=True) | ||
id_to_stop_stat = ( | ||
launcher.stop_jobs(*launched).items() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing I am thinking about here is stop_jobs may raise a smartsim.error.errors.LauncherJobNotFound
, and if it does, it seems like the process only partially completes. Should we complete all of the ones that can complete and then (re)raise the error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In theory this cannot happen if the _launch_history
is never corrupted (e.g. an id is never mapped to a launcher that never issued it), but I do agree that this could become a problem, especially if users start to make their own launchers.
I moved this to a new issue where it can be discussed further because I think this may be a problem with other methods (e.g. get_status
, wait
) as well as potentially other errors that can arise when stopping over a collection of ids.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Add an
Experiment.stop
method to stop jobs prematurely. Teach launchers how to stop the jobs that they manage.