Section | Description |
---|---|
Overview | Overview of the CLI. |
Usage | Usage Instructions. |
The AXLearn CLI provides utilities for launching and managing resources across different cloud environments.
Taking inspiration from Python virtual environments, the CLI provides functionality to "activate" different cloud environments, allowing users to manage resources in these environments seamlessly.
As a motivating example, suppose that users have access to two different TPU clusters tpu-cluster1
and tpu-cluster2
, possibly across different availability zones.
A common usage pattern may look like:
# Authenticate to the cloud provider.
axlearn gcp auth
# "Activate" `tpu-cluster1` configuration.
axlearn gcp config activate --label=tpu-cluster1
# Launch jobs to `tpu-cluster1`.
axlearn gcp launch ...
# "Activate" `tpu-cluster2` configuration.
axlearn gcp config activate --label=tpu-cluster2
# Launch jobs to `tpu-cluster2` with the same launch command.
axlearn gcp launch ...
Users therefore do not need to worry about the underlying configuration details of each cluster.
While the above example uses the gcp
commands, the CLI is agnostic to the underlying cloud provider and can support arbitrary configurations.
To setup the CLI, we'll need to first create a config file under at least one of the following paths:
.axlearn/axlearn.default.config
in the current working directory, or.axlearn/.axlearn.config
in the current working directory, or~/.axlearn.config
in your home directory.
To create the config, you can copy from the template config from the root of the repo:
cp .axlearn/axlearn.default.config .axlearn/.axlearn.config
Tip: In an organization setting, you can directly modify
.axlearn/axlearn.default.config
, which can be stored in the root of your git repository. This way, developers will automatically pick up the latest CLI configurations on each git pull.
Tip: You can always run
axlearn gcp config cleanup
to delete all AXLearn config files from your system.
Here's a sample config file for launching v4-tpu
s in us-central2-b
, under the project my-gcp-project
.
You may recognize it as a toml
file:
[gcp."my-gcp-project:us-central2-b"]
# Basic project configs.
project = "my-gcp-project"
zone = "us-central2-b"
network = "projects/my-gcp-project/global/networks/default"
subnetwork = "projects/my-gcp-project/regions/us-central2/subnetworks/default"
# Used when launching VMs and TPUs.
service_account_email = "[email protected]"
# Used for permanent artifacts like checkpoints. Should be writable by users who intend to launch jobs.
permanent_bucket = "public-permanent-us-central2"
# Used for private artifacts, like quota files. Should be readable by users who intend to launch jobs.
private_bucket = "private-permanent-us-central2"
# Used for temporary artifacts, like logs. Should be writable by users who intend to launch jobs.
ttl_bucket = "ttl-30d-us-central2"
# (Optional) Used by the AXLearn CLI.
labels = "v4-tpu"
# (Optional) Used for pushing docker images.
docker_repo = "us-docker.pkg.dev/my-gcp-project/axlearn"
# (Optional) Configure whether to use on-demand or reserved TPUs.
reserved_tpu = true
# (Optional) Configure a default Dockerfile to use when launching jobs with docker.
default_dockerfile = "Dockerfile"
# (Optional) Enable VertexAI Tensorboard support during training.
vertexai_tensorboard = "1231231231231231231"
vertexai_region = "us-central1"
To confirm that the CLI can locate your config file, run:
# Lists all environments that the CLI is aware of.
$ axlearn gcp config list
No GCP project has been activated; please run `axlearn gcp config activate`.
Found default config at /path/to/axlearn/.axlearn/axlearn.default.config
Found user config at /path/to/axlearn/.axlearn/.axlearn.config
[ ] my-gcp-project:us-central2-b [v4-tpu]
As the output indicates, we have not yet activated a project. To do so, run:
# Activate a specific environment.
$ axlearn gcp config activate
...
Setting my-gcp-project:us-central2-b to active.
Configs written to /path/to/axlearn/.axlearn/.axlearn.config
You can also directly target a config by specifying --label
:
# Activate the environment with label "v4-tpu".
axlearn gcp config activate --label=v4-tpu
In this case we only have one config, so --label
is redundant.
The CLI is structured as a tree and is intended to be self-documenting.
The tree can be traversed simply by invoking the CLI with --help
.
For instance, running the root command with --help
prints available sub-commands:
$ axlearn --help
usage: axlearn [-h] [--helpfull] {gcp} ...
AXLearn: An Extensible Deep Learning Library.
positional arguments:
{gcp}
options:
-h, --help show this help message and exit
--helpfull show full help message and exit
We can traverse the tree by running a subcommand with --help
.
$ axlearn gcp --help
usage: axlearn gcp [-h] [--helpfull] [--project PROJECT] [--zone ZONE]
{config,sshvm,sshtpu,bundle,launch,tpu,vm,bastion,dataflow,auth} ...
positional arguments:
{config,sshvm,sshtpu,bundle,launch,tpu,vm,bastion,dataflow,auth}
config Configure GCP settings
sshvm SSH into a VM
sshtpu SSH into a TPU-VM
bundle Bundle the local directory
launch Launch arbitrary commands on remote compute
tpu Create a TPU-VM and execute the given command on it
vm Create a VM and execute the given command on it
bastion Launch jobs through Bastion orchestrator
dataflow Run Dataflow jobs locally or on GCP
auth Authenticate to GCP
options:
-h, --help show this help message and exit
--helpfull show full help message and exit
--project PROJECT
--zone ZONE
The leaves of the command tree may be implementation dependent. Typically, they correspond to an abseil-py module.1
For example, axlearn gcp launch
simply maps to:
https://github.com/apple/axlearn/blob/204f3de2650c098410749d078195e4f8db96d6f6/axlearn/cli/gcp.py#L56-L60
In some cases, they can map to shell commands. For example, axlearn gcp auth
simply maps to:
https://github.com/apple/axlearn/blob/204f3de2650c098410749d078195e4f8db96d6f6/axlearn/cli/gcp.py#L86-L96
In general, when in doubt, run --help
.
Footnotes
-
You can learn more about how abseil flags work from their Python Devguide. ↩