-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Transition mila code
to RemoteV2 [MT-80]
#105
Transition mila code
to RemoteV2 [MT-80]
#105
Conversation
cc145f6
to
9762a66
Compare
mila code
mila code
[MT-80]
6dd086f
to
3884d69
Compare
11fff5c
to
4af4758
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #105 +/- ##
==========================================
+ Coverage 72.93% 74.55% +1.61%
==========================================
Files 15 17 +2
Lines 2176 2283 +107
==========================================
+ Hits 1587 1702 +115
+ Misses 589 581 -8 ☔ View full report in Codecov by Sentry. |
4af4758
to
959d798
Compare
c1e55e9
to
2126dfa
Compare
91c67b3
to
762f96d
Compare
mila code
[MT-80]mila code
to use RemoteV2 [MT-80]
92463f4
to
8f5eaae
Compare
mila code
to use RemoteV2 [MT-80]mila code
to RemoteV2 [MT-80]
1c02eaa
to
6cd29bd
Compare
ab8f74e
to
69aac5e
Compare
Signed-off-by: Fabrice Normandin <[email protected]>
e4e7e7f
to
188c254
Compare
Signed-off-by: Fabrice Normandin <[email protected]>
Signed-off-by: Fabrice Normandin <[email protected]>
Signed-off-by: Fabrice Normandin <[email protected]>
70bf903
to
efa3eda
Compare
Signed-off-by: Fabrice Normandin <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code review done, will do the tests review
milatools/cli/code.py
Outdated
""" | ||
# Check that the `code` command is in the $PATH so that we can use just `code` as | ||
# the command. | ||
code_command = command |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need code_command
? Can we just use command
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. Fixed in 4feafe0
# todo: use the mila or the local machine as the reference for vscode | ||
# extensions? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I would say mila but I'm not sure if this would change much as users would probably use the sync feature of vscode it self? I wonder if there could be incompatibility between local and cluster if local is not linux?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure either! I guess we'll have to test it out.
milatools/cli/code.py
Outdated
# ideally, so we can collect some simple stats about the use of `milatools` on | ||
# the clusters. | ||
if any(flag == "-J" or "-J=" in flag or "--job-name" in flag for flag in alloc): | ||
# todo: Get the job name from the flags instead? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A warning could be a good idea with something like this :
# todo: Get the job name from the flags instead? | |
import argparse | |
parser = argparse.ArgumentParser() | |
parser.add_argument("-J", "--job-name") | |
args, alloc = parser.parse_known_args(alloc) | |
if args.job_name: | |
#warning |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not parsing the job name at the moment, but yes I'd definitely do something like this if we decide to do so.
# TODO: BUG: This now requires two Ctrl+C's instead of one! | ||
console.print( | ||
"The editor was closed. Reopen it with <Enter> or terminate the " | ||
"process with <Ctrl+C> (maybe twice)." | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we know why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This had to do with how asyncio deals with KeyboardInterrupt, but I might have fixed it. I'll double check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah it still seems necessary to do Ctrl+C twice. I'm not sure why exactly.
milatools/utils/parallel_progress.py
Outdated
if not currently_in_a_test(): | ||
columns.append(TimeElapsedColumn()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should probably be mocked to output a fixed string instead of mixing test code to real code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in 31b55dd
) | ||
# iterate over the jobs we need to run | ||
for task_description, async_task_fn in zip(task_descriptions, async_task_fns): | ||
# NOTE: Could set visible=false so we don't have a lot of bars all at once. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for the idea
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Applied in 94f690d
|
||
for result in parallel_progress_bar([_example_task_fn, _example_task_fn]): | ||
print(result) | ||
class AsyncTaskFn(Protocol[OutT_co]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this for typing help cues or does this have a practical utility in the async progress bars?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the signature of a function that is used by the async progress bar.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
your IDE is then able to show you if your function is valid by checking against this protocol
milatools/utils/vscode_utils.py
Outdated
# Connect to the remotes in parallel. | ||
dest_runners_and_hostnames = await asyncio.gather( | ||
*(_get_runner_and_hostname(dest) for dest in destinations) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does not seam necessary to have this bit of coda and the _get_runner_and_hostname
util as _install_vscode_extensions_task_function
will connect in parallel anyway?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in 94f690d
milatools/utils/vscode_utils.py
Outdated
if dest_hostname == "localhost": | ||
remote = LocalV1() | ||
remote = LocalV2() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure to understand how and why we would be syncing to localhost? Wouldn't it be simpler to manage the extensions locally through the vscode interface?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might want to get the same vscode extensions locally as you had on the Mila cluster. Sure, if you have "settings sync" setup in vscode, it could probably sync it better than this would, but I don't see why we shouldn't support doing this.
], | ||
) | ||
def test_internet_on_compute_nodes(cluster: str, expected: bool): | ||
assert internet_on_compute_nodes(cluster) == expected |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might want to make a real internet connection test when we'll move away from hardcoded list of clusters
if login_node_v2.hostname == "localhost": | ||
pytest.skip( | ||
"TODO: This test doesn't yet work with the slurm cluster spun up in the GitHub CI." | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we extracte the code specific to the environment (GH CI here) and rather parameterize the environment to have env agnostic tests? Maybe here we could use something like :
pytest --deselect tests/integration/test_code.py::test_code
in the workflow if it doesn't work in the GH CI?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how we could do this, perhaps we can chat about this in-person?
doesnt_create_new_jobs = pytest.mark.usefixtures( | ||
doesnt_create_new_jobs_fixture.__name__ | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this the same as having @pytest.fixture(autouse=True)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not quite, it creates a marker that can be applied to a test to make it use a given fixture.
# hostname! | ||
node = removesuffix(hostname, ".server.mila.quebec") | ||
|
||
if not use_v1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we still need to test v1? If it's deprecated code and we're note touching / supporting it maybe we could also clean-up the test code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're still supporting it though, at least for now.
When I make a PR to remove the v1, I'll also remove anything related to v1 from here, but in the meantime this makes it so we have decent coverage of code_v1 also
compute_node_or_job_id = await code( | ||
path=relative_path, | ||
command="echo", # replace the usual `code` with `echo` for testing. | ||
# idea: Could probably also return the process ID of the `code` editor? | ||
persist=persist, | ||
job=None, | ||
node=None, | ||
alloc=allocation_flags, | ||
cluster=login_node_v2.hostname, # type: ignore | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's the case but just to be sure, are we using the job name test-milatools
or something like that to identify test jobs or something like that and is that why we are sure to delete the remaining jobs even when we encounter errors?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not quite. When using mila code
, we're preventing users from setting the job name flag, which also (at the moment) forces us to use the "real" mila-code
job-name. I filter out my own jobs when scraping the recent jobs on the cluster. This is not ideal, and I'd be interested if you have suggestions on what we could do differently here, for sure.
As for the "we are sure to delete the remaining jobs even when we encounter errors", are you referring to the cancel_new_jobs_on_interrupt
?
tests/integration/test_code.py
Outdated
"""Gets a compute node connecting to a running job on the cluster. | ||
|
||
This avoids making an allocation if possible, by reusing an already-running job | ||
with the name `persistent_job_name` or `job_name`, or any running job if found. | ||
|
||
This will try to use either persistent or temporary jobs (small jobs that are used | ||
across test runs, or jobs that are spawned just for a single test run) to avoid | ||
making a new allocation on the cluster. | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a change a non-persistent job could die before the end of the tests? Otherwise, what would be the difference between the persistent and non-persistent job?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I simplified this slightly in 27fbd5a, removing the idea of "persistent" jobs.
This will now try to reuse a running test job if one is found. If not, then it will spawn a new job with salloc
.
If we wanted to, we could also just make a new job allocation per test run and not try to reuse test jobs at all, but this seems a bit wasteful to me.
assert "CANCELLED" in job_info["State"] or job_info["State"] in [ | ||
"RUNNING", | ||
# fixme: Not sure why this is the case, but the function is being | ||
# deprecated anyway. | ||
"COMPLETED", | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code is being deprecated anyway but I was wondering why it was "CANCELLED" in job_info["State"]
instead of "CANCELLED" in job_info["State"]
? Can there be a suffix or a prefix to CANCELLED
in job_info["State"]
and what would it be?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I sometimes saw it be "CANCELLED by (...)", cant recall what exactly.
Signed-off-by: Fabrice Normandin <[email protected]>
Signed-off-by: Fabrice Normandin <[email protected]>
Signed-off-by: Fabrice Normandin <[email protected]>
Signed-off-by: Fabrice Normandin <[email protected]>
Signed-off-by: Fabrice Normandin <[email protected]>
Signed-off-by: Fabrice Normandin <[email protected]>
Signed-off-by: Fabrice Normandin <[email protected]>
Signed-off-by: Fabrice Normandin <[email protected]>
Signed-off-by: Fabrice Normandin <[email protected]>
Signed-off-by: Fabrice Normandin <[email protected]>
Signed-off-by: Fabrice Normandin <[email protected]>
Signed-off-by: Fabrice Normandin <[email protected]>
Signed-off-by: Fabrice Normandin <[email protected]>
Signed-off-by: Fabrice Normandin <[email protected]>
Based off of #112
mila code
(thanks toRemoteV2
)code
to useRemoteV2
.code
tocode_v1
,code
-->code_v1
for Windows and add a newcode
implementation for WSL/MacOS/Linux.mila code
command for Windows users, printing a warning encouraging to switch to WSL.