-
Notifications
You must be signed in to change notification settings - Fork 566
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SkyBenchmark: fix job_status is None for failed candidates. #2767
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for fixing this @concretevitamin! Left a question.
sky/benchmark/benchmark_utils.py
Outdated
if job_status is None: | ||
benchmark_status = benchmark_state.BenchmarkStatus.TERMINATED | ||
elif (cluster_status == status_lib.ClusterStatus.INIT or | ||
job_status < job_lib.JobStatus.RUNNING): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if job_status is None: | |
benchmark_status = benchmark_state.BenchmarkStatus.TERMINATED | |
elif (cluster_status == status_lib.ClusterStatus.INIT or | |
job_status < job_lib.JobStatus.RUNNING): | |
if (cluster_status == status_lib.ClusterStatus.INIT or job_status is None or | |
job_status < job_lib.JobStatus.RUNNING): |
Reason: when the cluster is being provisioned, the job_status
will be None and the benchmark task should be initializing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There seems to be a few cases
# 'record' is not None:
# cluster_status None (e.g., preempted/never launched), job_status None
# --> BenchmarkStatus.TERMINATED or INIT
# cluster_status INIT (e.g., something's wrong), job_status None
# --> BenchmarkStatus ??
# cluster_status STOPPED (e.g., manually stopped or auto-stopped), job_status None
# --> BenchmarkStatus ??
# cluster_status UP (e.g., manually stopped or auto-stopped), job_status None
# --> BenchmarkStatus.INIT
# cluster_status UP (e.g., manually stopped or auto-stopped), job_status not-None
# --> handled below
The problem is BenchmarkStatus's definition doesn't seem too clear (e.g., for the first case, should we set it to TERMINATED or INIT; does it matter?). Wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO, it would be better to set the BenchmarkStatus to INIT for all the cases mentinoed, as it aligns with the semantic of INIT
, i.e., the status is UNKNOWN or abnormal or initializing. I would not set it to TERMINATED as it may cause a transition from TERMINATED to a non-terminated state in the case cluster_status is None
or cluster_status is UP
while job_status is None
, which can be quite suprising. The TERMINATED state should be a sink state.
Maybe we can specially handle the case where the cluster_status
is STOPPED, but I think it should be fine to set all of the to BenchmarkStatus.INIT.
Updated logic and PR description, PTAL.
For the first case, it's ok to send it to the sink state TERMINATED, since there's no way to "rerun" the same benchmark name to retry the launch. Wdyt? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the fix @concretevitamin! LGTM.
sky/benchmark/benchmark_utils.py
Outdated
if end_time is not None: | ||
# The job has terminated with zero exit code. | ||
benchmark_status = benchmark_state.BenchmarkStatus.FINISHED | ||
return end_time, benchmark_state.BenchmarkStatus.FINISHED |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems will never happen as we have checked end_time
in L348?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! Simplified the logic now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Simplified the logic + added a basic smoke test. PTAL.
sky/benchmark/benchmark_utils.py
Outdated
if end_time is not None: | ||
# The job has terminated with zero exit code. | ||
benchmark_status = benchmark_state.BenchmarkStatus.FINISHED | ||
return end_time, benchmark_state.BenchmarkStatus.FINISHED |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! Simplified the logic now.
…-org#2767) * SkyBenchmark: fix job_status is None for failed candidates. * Fix --gpus parsing * Logic updates * Simplify and add smoke test.
Attempt to fix #2765.
Repro:
Before: failed to parse
--gpus
; after fixing parsing and waiting for bench finished launching (A100:8)sky bench show mybench
reproduces the error in [Benchmark] TypeError: '<' not supported between instances of 'NoneType' and 'JobStatus' #2765With this PR: no error occurs
Tested (run the relevant ones):
bash format.sh
pytest tests/test_smoke.py
pytest tests/test_smoke.py::test_sky_bench --generic-cloud gcp
: passed on this PR; failed on masterpytest tests/test_smoke.py::test_fill_in_the_name
bash tests/backward_comaptibility_tests.sh