Skip to content

Commit

Permalink
Enable specification of target hostname for a dragon task (#660)
Browse files Browse the repository at this point in the history
## Description

This PR adds two features:

1. Ability to specify hostnames that tasks should run on
2. Enable tasks colocation

### Specifying Hostnames

The existing `DragonRunRequest` supported the ability to specify a
hostname when creating a policy used to run a task. However, the
hostnames were not exposed to clients.

This ticket allows clients to pass a list of hosts that will be used in
place of the default "first available host" behavior.

### Task Colocation

The prior system for finding nodes to execute a task worked worked only
with unassigned nodes. Any node assigned a task could not be assigned
another task.

This ticket adds a more capable prioritizer class that enables clients
using hostnames to colocate tasks. It also retains the capability to
return open nodes when no hostname is specified.
  • Loading branch information
ankona authored Aug 26, 2024
1 parent f7ef49b commit ef034d5
Show file tree
Hide file tree
Showing 10 changed files with 1,542 additions and 203 deletions.
1 change: 1 addition & 0 deletions doc/changelog.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@ Jump to:

Description

- Enable hostname selection for dragon tasks
- Remove pydantic dependency from MLI code
- Update MLI environment variables using new naming convention
- Reduce a copy by using torch.from_numpy instead of torch.tensor
Expand Down
225 changes: 173 additions & 52 deletions smartsim/_core/launcher/dragon/dragonBackend.py

Large diffs are not rendered by default.

2 changes: 2 additions & 0 deletions smartsim/_core/launcher/dragon/dragonLauncher.py
Original file line number Diff line number Diff line change
Expand Up @@ -170,6 +170,7 @@ def run(self, step: Step) -> t.Optional[str]:
merged_env = self._connector.merge_persisted_env(os.environ.copy())
nodes = int(run_args.get("nodes", None) or 1)
tasks_per_node = int(run_args.get("tasks-per-node", None) or 1)
hosts = str(run_args.get("host-list", ""))

policy = DragonRunPolicy.from_run_args(run_args)

Expand All @@ -187,6 +188,7 @@ def run(self, step: Step) -> t.Optional[str]:
output_file=out,
error_file=err,
policy=policy,
hostlist=hosts,
)
),
DragonRunResponse,
Expand Down
Loading

0 comments on commit ef034d5

Please sign in to comment.