Skip to content

Commit

Permalink
Testing bso examples on CloudLab and EKS (#54)
Browse files Browse the repository at this point in the history
Co-authored-by: Jae-Won Chung <[email protected]>
  • Loading branch information
show981111 and jaywonchung authored Apr 29, 2024
1 parent 244fb61 commit a25a533
Show file tree
Hide file tree
Showing 11 changed files with 65 additions and 31 deletions.
6 changes: 3 additions & 3 deletions docker/batch_size_optimizer/migration.Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,6 @@ ADD . /workspace
# For sqlite
# RUN pip install --no-cache-dir aiosqlite

# For mysql
RUN pip install --no-cache-dir asyncmy
RUN pip install --no-cache-dir '.[migration]'
# For mysql, we need asyncmy and cryptography (for sha256_password)
RUN pip install --no-cache-dir asyncmy cryptography
RUN pip install --no-cache-dir '.[migration]'
6 changes: 5 additions & 1 deletion docker/batch_size_optimizer/server-docker-compose.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,9 @@ services:
retries: 10
start_period: 2s
start_interval: 1s
labels:
kompose.image-pull-policy: IfNotPresent
kompose.volume.type: "emptyDir" # For testing and debugging. In production, use block storage of the cloud. ex) Amazon Elastic Block Store volume

migration:
image: bso-migration
Expand All @@ -74,9 +77,10 @@ services:
- servernet
volumes:
# mount version scripts we generated.
- ./zeus/optimizer/batch_size/migrations/versions:/workspace/zeus/optimizer/batch_size/migrations/versions
- ../../zeus/optimizer/batch_size/migrations/versions:/workspace/zeus/optimizer/batch_size/migrations/versions
labels:
kompose.image-pull-policy: IfNotPresent
kompose.volume.type: "emptyDir"


networks:
Expand Down
4 changes: 2 additions & 2 deletions docker/batch_size_optimizer/server.Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ ADD . /workspace
# RUN pip install --no-cache-dir aiosqlite

# For mysql
RUN pip install --no-cache-dir asyncmy
RUN pip install --no-cache-dir '.[bso-server]'
RUN pip install --no-cache-dir asyncmy cryptography
RUN pip install --no-cache-dir '.[bso-server]'

CMD ["uvicorn", "zeus.optimizer.batch_size.server.router:app", "--host", "0.0.0.0", "--port", "80"]
15 changes: 9 additions & 6 deletions docs/batch_size_optimizer/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,7 @@ sequenceDiagram;
git clone https://github.com/ml-energy/zeus/tree/master
```

2. Create `.env` under `/docker`. An example of `.env` is provided below.
2. Create `.env` under `/docker/batch_size_optimizer`. An example of `.env` is provided below.

By default, we are using the MySQL for the database.

Expand All @@ -62,8 +62,8 @@ sequenceDiagram;
- Using docker-compose

```Shell
cd docker
docker-compose -f ./docker/docker-compose.yaml up
cd docker/batch_size_optimizer
docker-compose -f ./server-docker-compose.yaml up
```

This will build images for each container: db, migration, and the server. Then, it will spin those containers.
Expand All @@ -78,14 +78,17 @@ sequenceDiagram;
docker build -f ./docker/batch_size_optimizer/migration.Dockerfile -t bso-migration .
```

If you are using `minikube`, then you do not have to fix anything in the `server-docker-compose.yaml`. However, if you are using the cloud such as AWS, you should push the image to the registry and modify the image path in `server-docker-compose.yaml` to correctly pull the image from the registry not from the local machine.

2. Create Kubernetes yaml files using Kompose. Kompose is a tool that converts docker-compose files into Kubernetes files. For more information, visit [Kompose Reference](#kompose-references)

```Shell
cd docker
docker-compose config > docker-compose-resolved.yaml && kompose convert -f docker-compose-resolved.yaml -o ./kube/ && rm docker-compose-resolved.yaml
cd docker/batch_size_optimizer
docker-compose -f server-docker-compose.yaml config > server-docker-compose-resolved.yaml && kompose convert -f server-docker-compose-resolved.yaml -o ./kube/ && rm server-docker-compose-resolved.yaml
```

It first resolves env files using docker-compose, then creates Kubernetes yaml files under `docker/kube/`
It first resolves env files using docker-compose, then creates Kubernetes yaml files under `./kube/`.
Note that the output Kubernetes yaml file is using `emptyDir` for persistent volume. In production, you should configure the corresponding volume you want to use.

3. Run kubernetes.

Expand Down
14 changes: 11 additions & 3 deletions examples/batch_size_optimizer/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,20 +11,28 @@ Refer to the `examples/batch_size_optimizer/mnist_dp.py` for the use case.

Kubeflow is a tool to easily deploy your ML workflows to kubernetes. We provides some examples of using kubeflow with Zeus. In order to run your training in Kubeflow with Zeus, follow the `docs/batch_size_optimizer/server.md` to deploy batch size optimizer to kubernetes. After then, you can deploy your training script using kubeflow.

1. Install kubeflow training operator.
1. Set up Kubernetes and install kubeflow training operator.

Refer [minikube](https://minikube.sigs.k8s.io/docs/start/) for local development of Kubernetes.
Refer [Kubeflow training operator](https://github.com/kubeflow/training-operator) to how to install kubeflow.

2. Build mnist example docker image.
2. Run server batch size optimizer server using Kubernetes.

Refer docs to start the server [Quick start](../../docs/batch_size_optimizer/index.md).

3. Build mnist example docker image.

```Shell
# From project root directory
docker build -f ./examples/batch_size_optimizer/mnist.Dockerfile -t mnist-example .
```

3. Deploy training script.
If you are using the cloud such as AWS, modify the `image` and `imagePullPolicy` in `mnist_dp.yaml` to pull it from the corresponding registry.

4. Deploy training script.

```Shell
cd examples/batch_size_optimizer
kubectl apply -f mnist_dp.yaml # For distributed training example
kubectl apply -f mnist_single_gpu.yaml # For single gpu training example
```
14 changes: 11 additions & 3 deletions examples/batch_size_optimizer/mnist_dp.py
Original file line number Diff line number Diff line change
Expand Up @@ -168,6 +168,12 @@ def main():
choices=[dist.Backend.GLOO, dist.Backend.NCCL, dist.Backend.MPI],
default=dist.Backend.GLOO,
)
parser.add_argument(
"--target-accuracy",
type=float,
default=0.5,
help="Target accuracy (default: 0.5)",
)

args = parser.parse_args()
use_cuda = not args.no_cuda and torch.cuda.is_available()
Expand Down Expand Up @@ -226,7 +232,8 @@ def main():
job_id_prefix="mnist-dev",
default_batch_size=256,
batch_sizes=[32, 64, 256, 512, 1024, 4096, 2048],
max_epochs=5,
max_epochs=args.epochs,
target_metric=args.target_accuracy,
),
rank=rank,
)
Expand Down Expand Up @@ -259,10 +266,11 @@ def main():
print("Rank", dist.get_rank())

dist.broadcast(bs_trial_tensor, src=0)
batch_size = bs_trial_tensor[0].item() // world_size
bso.current_batch_size = bs_trial_tensor[0].item()
bso.trial_number = bs_trial_tensor[1].item()
batch_size = bso.current_batch_size // world_size

print(f"Batach_size to use for gpu[{rank}]: {batch_size}")
print(f"Batach_size to use for gpu[{rank}]: {batch_size}, trial number: {bso.trial_number}")
callbacks = CallbackSet(callback_set)

########################### ZEUS INIT END ###########################
Expand Down
8 changes: 6 additions & 2 deletions examples/batch_size_optimizer/mnist_dp.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -19,11 +19,13 @@ spec:
- "/workspace/examples/batch_size_optimizer/mnist_dp.py"
- "--epochs=5"
- "--backend=nccl"
# Adjust target accuracy if you want to test failure
- "--target-accuracy=0.2"
env:
- name: ZEUS_SERVER_URL
value: "http://server:80"
- name: ZEUS_JOB_ID
value: "mnist-dev-dp-2"
value: "mnist-dev-dp-1"
securityContext:
capabilities:
add: ["SYS_ADMIN"]
Expand All @@ -41,11 +43,13 @@ spec:
- "/workspace/examples/batch_size_optimizer/mnist_dp.py"
- "--epochs=5"
- "--backend=nccl"
# Adjust target accuracy if you want to test failure
- "--target-accuracy=0.2"
env:
- name: ZEUS_SERVER_URL
value: "http://server:80"
- name: ZEUS_JOB_ID
value: "mnist-dev-dp-2"
value: "mnist-dev-dp-1"
securityContext:
capabilities:
add: ["SYS_ADMIN"]
13 changes: 10 additions & 3 deletions examples/batch_size_optimizer/mnist_single_gpu.py
Original file line number Diff line number Diff line change
Expand Up @@ -162,6 +162,12 @@ def main():
metavar="L",
help="directory where summary logs are stored",
)
parser.add_argument(
"--target-accuracy",
type=float,
default=0.5,
help="Target accuracy (default: 0.5)",
)
if dist.is_available():
parser.add_argument(
"--backend",
Expand Down Expand Up @@ -199,12 +205,13 @@ def main():
job_id_prefix="mnist-dev",
default_batch_size=256,
batch_sizes=[32, 64, 256, 512, 1024, 4096, 2048],
max_epochs=5
max_epochs=args.epochs,
target_metric=args.target_accuracy,
),
)
# Get batch size from bso
batch_size = bso.get_batch_size()
print("Chosen batach_size:", batch_size)
print("Chosen batach_size:", batch_size, "Trial number:", bso.trial_number)

##################### ZEUS INIT END ##########################
train_loader = torch.utils.data.DataLoader(
Expand Down Expand Up @@ -252,7 +259,7 @@ def main():
plo.on_epoch_begin()
train(args, model, device, train_loader, optimizer, epoch, writer, plo)
plo.on_epoch_end()
acc = test(args, model, device, test_loader, writer, epoch,bso)
acc = test(args, model, device, test_loader, writer, epoch)
bso.on_evaluate(acc)

################### ZEUS OPTIMIZER USAGE END #########################
Expand Down
2 changes: 1 addition & 1 deletion zeus/optimizer/batch_size/client.py
Original file line number Diff line number Diff line change
Expand Up @@ -196,7 +196,7 @@ def on_evaluate(
self.training_finished = True
if not parsed_response.converged:
raise ZeusBSOTrainFailError(
f"Train failed: {parsed_response.message}. This batch size will not be selected again. Please re-launch the training"
f"Train failed: {parsed_response.message} This batch size will not be selected again. Please re-launch the training"
)

def _handle_response(self, res: httpx.Response) -> None:
Expand Down
8 changes: 4 additions & 4 deletions zeus/optimizer/batch_size/server/explorer.py
Original file line number Diff line number Diff line change
Expand Up @@ -118,9 +118,9 @@ async def next_batch_size(
batch_sizes,
)

if len(batch_sizes) == 0:
raise ZeusBSOServerRuntimeError(
"No converged batch sizes has observed. Reconfigure batch_sizes and re-launch the job."
)
if len(batch_sizes) == 0:
raise ZeusBSOServerRuntimeError(
"No converged batch sizes has observed. Reconfigure batch_sizes and re-launch the job."
)
# After going through pruning rounds, we couldn't find the bs. Should go to MAB stage, so return good batch_sizes.
return sorted(batch_sizes)
6 changes: 3 additions & 3 deletions zeus/optimizer/batch_size/server/optimizer.py
Original file line number Diff line number Diff line change
Expand Up @@ -165,6 +165,9 @@ async def report(self, result: TrainingResult) -> ReportResponse:
)
)

if trial is None:
raise ZeusBSOServiceBadOperationError(f"Unknown trial {result}")

if trial.status != TrialStatus.Dispatched:
# result is already reported
return ReportResponse(
Expand All @@ -173,9 +176,6 @@ async def report(self, result: TrainingResult) -> ReportResponse:
message=f"Result for this trial({trial.trial_number}) is already reported.",
)

if trial is None:
raise ZeusBSOServiceBadOperationError(f"Unknown trial {result}")

if job.beta_knob is not None and job.min_cost is not None: # Early stop enabled
cost_ub = job.beta_knob * job.min_cost

Expand Down

0 comments on commit a25a533

Please sign in to comment.