Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transition mila code to RemoteV2 [MT-80] #105

Merged

Conversation

lebrice
Copy link
Collaborator

@lebrice lebrice commented Mar 21, 2024

Based off of #112

  • Support ssh keys with passphrase in mila code (thanks to RemoteV2)
  • Adjust code to use RemoteV2.
  • Rename code to code_v1, code --> code_v1 for Windows and add a new code implementation for WSL/MacOS/Linux.
    • For now, leaving the current mila code command for Windows users, printing a warning encouraging to switch to WSL.

@lebrice lebrice force-pushed the support_ssh_keys_with_passphrase_mila_code branch 2 times, most recently from cc145f6 to 9762a66 Compare March 25, 2024 20:49
@lebrice lebrice changed the title Support ssh keys with passphrase in mila code Support ssh keys with passphrase in mila code [MT-80] Mar 25, 2024
@lebrice lebrice force-pushed the support_ssh_keys_with_passphrase_mila_code branch 3 times, most recently from 6dd086f to 3884d69 Compare March 26, 2024 16:32
@lebrice lebrice mentioned this pull request Mar 26, 2024
2 tasks
@lebrice lebrice force-pushed the support_ssh_keys_with_passphrase_mila_code branch 3 times, most recently from 11fff5c to 4af4758 Compare March 26, 2024 17:38
@codecov-commenter
Copy link

codecov-commenter commented Mar 26, 2024

Codecov Report

Attention: Patch coverage is 85.67335% with 50 lines in your changes are missing coverage. Please review.

Project coverage is 74.55%. Comparing base (9b4a7cd) to head (1972cf8).

Files Patch % Lines
milatools/utils/vscode_utils.py 80.21% 18 Missing ⚠️
milatools/cli/code.py 82.43% 13 Missing ⚠️
milatools/utils/disk_quota.py 86.95% 9 Missing ⚠️
milatools/utils/parallel_progress.py 88.13% 7 Missing ⚠️
milatools/cli/commands.py 92.50% 3 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #105      +/-   ##
==========================================
+ Coverage   72.93%   74.55%   +1.61%     
==========================================
  Files          15       17       +2     
  Lines        2176     2283     +107     
==========================================
+ Hits         1587     1702     +115     
+ Misses        589      581       -8     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@lebrice lebrice force-pushed the support_ssh_keys_with_passphrase_mila_code branch from 4af4758 to 959d798 Compare March 26, 2024 19:45
@lebrice lebrice marked this pull request as ready for review April 2, 2024 14:56
@lebrice lebrice force-pushed the support_ssh_keys_with_passphrase_mila_code branch 3 times, most recently from c1e55e9 to 2126dfa Compare April 11, 2024 19:05
@lebrice lebrice marked this pull request as draft April 11, 2024 19:06
@lebrice lebrice force-pushed the support_ssh_keys_with_passphrase_mila_code branch 3 times, most recently from 91c67b3 to 762f96d Compare April 11, 2024 20:13
@lebrice lebrice changed the title Support ssh keys with passphrase in mila code [MT-80] Transition mila code to use RemoteV2 [MT-80] Apr 11, 2024
@lebrice lebrice force-pushed the support_ssh_keys_with_passphrase_mila_code branch 3 times, most recently from 92463f4 to 8f5eaae Compare April 11, 2024 20:35
@lebrice lebrice changed the title Transition mila code to use RemoteV2 [MT-80] Transition mila code to RemoteV2 [MT-80] Apr 11, 2024
@lebrice lebrice force-pushed the support_ssh_keys_with_passphrase_mila_code branch 3 times, most recently from 1c02eaa to 6cd29bd Compare April 12, 2024 19:01
@lebrice lebrice force-pushed the support_ssh_keys_with_passphrase_mila_code branch from ab8f74e to 69aac5e Compare April 18, 2024 16:13
@lebrice lebrice force-pushed the support_ssh_keys_with_passphrase_mila_code branch from e4e7e7f to 188c254 Compare May 24, 2024 17:10
@lebrice lebrice force-pushed the support_ssh_keys_with_passphrase_mila_code branch from 70bf903 to efa3eda Compare May 24, 2024 17:16
Signed-off-by: Fabrice Normandin <[email protected]>
@lebrice lebrice requested review from satyaog, breuleux and obilaniu May 28, 2024 13:35
Copy link
Member

@satyaog satyaog left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code review done, will do the tests review

"""
# Check that the `code` command is in the $PATH so that we can use just `code` as
# the command.
code_command = command
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need code_command? Can we just use command?

Copy link
Collaborator Author

@lebrice lebrice Jun 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Fixed in 4feafe0

Comment on lines +79 to +80
# todo: use the mila or the local machine as the reference for vscode
# extensions?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I would say mila but I'm not sure if this would change much as users would probably use the sync feature of vscode it self? I wonder if there could be incompatibility between local and cluster if local is not linux?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure either! I guess we'll have to test it out.

milatools/cli/code.py Outdated Show resolved Hide resolved
# ideally, so we can collect some simple stats about the use of `milatools` on
# the clusters.
if any(flag == "-J" or "-J=" in flag or "--job-name" in flag for flag in alloc):
# todo: Get the job name from the flags instead?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A warning could be a good idea with something like this :

Suggested change
# todo: Get the job name from the flags instead?
import argparse
parser = argparse.ArgumentParser()
parser.add_argument("-J", "--job-name")
args, alloc = parser.parse_known_args(alloc)
if args.job_name:
#warning

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not parsing the job name at the moment, but yes I'd definitely do something like this if we decide to do so.

Comment on lines +181 to +185
# TODO: BUG: This now requires two Ctrl+C's instead of one!
console.print(
"The editor was closed. Reopen it with <Enter> or terminate the "
"process with <Ctrl+C> (maybe twice)."
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we know why?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This had to do with how asyncio deals with KeyboardInterrupt, but I might have fixed it. I'll double check.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it still seems necessary to do Ctrl+C twice. I'm not sure why exactly.

Comment on lines 122 to 123
if not currently_in_a_test():
columns.append(TimeElapsedColumn())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably be mocked to output a fixed string instead of mixing test code to real code

Copy link
Collaborator Author

@lebrice lebrice Jun 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 31b55dd

)
# iterate over the jobs we need to run
for task_description, async_task_fn in zip(task_descriptions, async_task_fns):
# NOTE: Could set visible=false so we don't have a lot of bars all at once.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for the idea

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Applied in 94f690d


for result in parallel_progress_bar([_example_task_fn, _example_task_fn]):
print(result)
class AsyncTaskFn(Protocol[OutT_co]):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this for typing help cues or does this have a practical utility in the async progress bars?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the signature of a function that is used by the async progress bar.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

your IDE is then able to show you if your function is valid by checking against this protocol

Comment on lines 105 to 108
# Connect to the remotes in parallel.
dest_runners_and_hostnames = await asyncio.gather(
*(_get_runner_and_hostname(dest) for dest in destinations)
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does not seam necessary to have this bit of coda and the _get_runner_and_hostname util as _install_vscode_extensions_task_function will connect in parallel anyway?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 94f690d

Comment on lines 203 to 204
if dest_hostname == "localhost":
remote = LocalV1()
remote = LocalV2()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure to understand how and why we would be syncing to localhost? Wouldn't it be simpler to manage the extensions locally through the vscode interface?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might want to get the same vscode extensions locally as you had on the Mila cluster. Sure, if you have "settings sync" setup in vscode, it could probably sync it better than this would, but I don't see why we shouldn't support doing this.

],
)
def test_internet_on_compute_nodes(cluster: str, expected: bool):
assert internet_on_compute_nodes(cluster) == expected
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might want to make a real internet connection test when we'll move away from hardcoded list of clusters

Comment on lines +83 to +86
if login_node_v2.hostname == "localhost":
pytest.skip(
"TODO: This test doesn't yet work with the slurm cluster spun up in the GitHub CI."
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we extracte the code specific to the environment (GH CI here) and rather parameterize the environment to have env agnostic tests? Maybe here we could use something like :
pytest --deselect tests/integration/test_code.py::test_code in the workflow if it doesn't work in the GH CI?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how we could do this, perhaps we can chat about this in-person?

Comment on lines +296 to +298
doesnt_create_new_jobs = pytest.mark.usefixtures(
doesnt_create_new_jobs_fixture.__name__
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this the same as having @pytest.fixture(autouse=True)?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not quite, it creates a marker that can be applied to a test to make it use a given fixture.

# hostname!
node = removesuffix(hostname, ".server.mila.quebec")

if not use_v1:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still need to test v1? If it's deprecated code and we're note touching / supporting it maybe we could also clean-up the test code?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're still supporting it though, at least for now.
When I make a PR to remove the v1, I'll also remove anything related to v1 from here, but in the meantime this makes it so we have decent coverage of code_v1 also

Comment on lines +105 to +114
compute_node_or_job_id = await code(
path=relative_path,
command="echo", # replace the usual `code` with `echo` for testing.
# idea: Could probably also return the process ID of the `code` editor?
persist=persist,
job=None,
node=None,
alloc=allocation_flags,
cluster=login_node_v2.hostname, # type: ignore
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's the case but just to be sure, are we using the job name test-milatools or something like that to identify test jobs or something like that and is that why we are sure to delete the remaining jobs even when we encounter errors?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not quite. When using mila code, we're preventing users from setting the job name flag, which also (at the moment) forces us to use the "real" mila-code job-name. I filter out my own jobs when scraping the recent jobs on the cluster. This is not ideal, and I'd be interested if you have suggestions on what we could do differently here, for sure.

As for the "we are sure to delete the remaining jobs even when we encounter errors", are you referring to the cancel_new_jobs_on_interrupt?

Comment on lines 249 to 257
"""Gets a compute node connecting to a running job on the cluster.

This avoids making an allocation if possible, by reusing an already-running job
with the name `persistent_job_name` or `job_name`, or any running job if found.

This will try to use either persistent or temporary jobs (small jobs that are used
across test runs, or jobs that are spawned just for a single test run) to avoid
making a new allocation on the cluster.
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a change a non-persistent job could die before the end of the tests? Otherwise, what would be the difference between the persistent and non-persistent job?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I simplified this slightly in 27fbd5a, removing the idea of "persistent" jobs.
This will now try to reuse a running test job if one is found. If not, then it will spawn a new job with salloc.
If we wanted to, we could also just make a new job allocation per test run and not try to reuse test jobs at all, but this seems a bit wasteful to me.

Comment on lines +99 to +104
assert "CANCELLED" in job_info["State"] or job_info["State"] in [
"RUNNING",
# fixme: Not sure why this is the case, but the function is being
# deprecated anyway.
"COMPLETED",
]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code is being deprecated anyway but I was wondering why it was "CANCELLED" in job_info["State"] instead of "CANCELLED" in job_info["State"]? Can there be a suffix or a prefix to CANCELLED in job_info["State"] and what would it be?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I sometimes saw it be "CANCELLED by (...)", cant recall what exactly.

lebrice added 15 commits June 3, 2024 09:51
Signed-off-by: Fabrice Normandin <[email protected]>
Signed-off-by: Fabrice Normandin <[email protected]>
Signed-off-by: Fabrice Normandin <[email protected]>
Signed-off-by: Fabrice Normandin <[email protected]>
@lebrice lebrice requested a review from satyaog June 5, 2024 18:59
@lebrice lebrice merged commit 57a7b2c into mila-iqia:master Jul 11, 2024
19 checks passed
@lebrice lebrice deleted the support_ssh_keys_with_passphrase_mila_code branch July 11, 2024 15:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants