Testing bso examples on CloudLab and EKS (#54)

Co-authored-by: Jae-Won Chung <[email protected]>
ml-energy · Apr 29, 2024 · a25a533 · a25a533
1 parent 244fb61
commit a25a533
Show file tree

Hide file tree

Showing 11 changed files with 65 additions and 31 deletions.
diff --git a/docker/batch_size_optimizer/migration.Dockerfile b/docker/batch_size_optimizer/migration.Dockerfile
@@ -7,6 +7,6 @@ ADD . /workspace
 # For sqlite 
 # RUN  pip install --no-cache-dir aiosqlite
 
-# For mysql 
-RUN  pip install --no-cache-dir asyncmy
-RUN  pip install --no-cache-dir '.[migration]'
+# For mysql, we need asyncmy and cryptography (for sha256_password)
+RUN pip install --no-cache-dir asyncmy cryptography 
+RUN pip install --no-cache-dir '.[migration]'
diff --git a/docker/batch_size_optimizer/server-docker-compose.yaml b/docker/batch_size_optimizer/server-docker-compose.yaml
@@ -51,6 +51,9 @@ services:
       retries: 10
       start_period: 2s
       start_interval: 1s
+    labels:
+      kompose.image-pull-policy: IfNotPresent
+      kompose.volume.type: "emptyDir" # For testing and debugging. In production, use block storage of the cloud. ex) Amazon Elastic Block Store volume
 
   migration:
     image: bso-migration
@@ -74,9 +77,10 @@ services:
       - servernet
     volumes:
       # mount version scripts we generated.
-      - ./zeus/optimizer/batch_size/migrations/versions:/workspace/zeus/optimizer/batch_size/migrations/versions
+      - ../../zeus/optimizer/batch_size/migrations/versions:/workspace/zeus/optimizer/batch_size/migrations/versions
     labels:
       kompose.image-pull-policy: IfNotPresent
+      kompose.volume.type: "emptyDir" 
 
 
 networks:

diff --git a/docker/batch_size_optimizer/server.Dockerfile b/docker/batch_size_optimizer/server.Dockerfile
@@ -8,7 +8,7 @@ ADD . /workspace
 # RUN  pip install --no-cache-dir aiosqlite
 
 # For mysql 
-RUN  pip install --no-cache-dir asyncmy
-RUN  pip install --no-cache-dir '.[bso-server]'
+RUN pip install --no-cache-dir asyncmy cryptography 
+RUN pip install --no-cache-dir '.[bso-server]'
 
 CMD ["uvicorn", "zeus.optimizer.batch_size.server.router:app", "--host", "0.0.0.0", "--port", "80"]
diff --git a/docs/batch_size_optimizer/index.md b/docs/batch_size_optimizer/index.md
@@ -41,7 +41,7 @@ sequenceDiagram;
     git clone https://github.com/ml-energy/zeus/tree/master
     ```
 
-2. Create `.env` under `/docker`. An example of `.env` is provided below.
+2. Create `.env` under `/docker/batch_size_optimizer`. An example of `.env` is provided below.
 
     By default, we are using the MySQL for the database.
 
@@ -62,8 +62,8 @@ sequenceDiagram;
     - Using docker-compose
 
         ```Shell
-        cd docker 
-        docker-compose -f ./docker/docker-compose.yaml up
+        cd docker/batch_size_optimizer
+        docker-compose -f ./server-docker-compose.yaml up
         ```
 
         This will build images for each container: db, migration, and the server. Then, it will spin those containers.
@@ -78,14 +78,17 @@ sequenceDiagram;
             docker build -f ./docker/batch_size_optimizer/migration.Dockerfile -t bso-migration .
             ```
 
+            If you are using `minikube`, then you do not have to fix anything in the `server-docker-compose.yaml`. However, if you are using the cloud such as AWS, you should push the image to the registry and modify the image path in `server-docker-compose.yaml` to correctly pull the image from the registry not from the local machine.
+
         2. Create Kubernetes yaml files using Kompose. Kompose is a tool that converts docker-compose files into Kubernetes files. For more information, visit [Kompose Reference](#kompose-references)
 
             ```Shell
-            cd docker 
-            docker-compose config > docker-compose-resolved.yaml && kompose convert -f docker-compose-resolved.yaml -o ./kube/ && rm docker-compose-resolved.yaml
+            cd docker/batch_size_optimizer
+            docker-compose -f server-docker-compose.yaml config > server-docker-compose-resolved.yaml && kompose convert -f server-docker-compose-resolved.yaml -o ./kube/ && rm server-docker-compose-resolved.yaml
             ```
 
-            It first resolves env files using docker-compose, then creates Kubernetes yaml files under `docker/kube/`
+            It first resolves env files using docker-compose, then creates Kubernetes yaml files under `./kube/`.
+            Note that the output Kubernetes yaml file is using `emptyDir` for persistent volume. In production, you should configure the corresponding volume you want to use.
 
         3. Run kubernetes.
 

diff --git a/examples/batch_size_optimizer/README.md b/examples/batch_size_optimizer/README.md
@@ -11,20 +11,28 @@ Refer to the `examples/batch_size_optimizer/mnist_dp.py` for the use case.
 
 Kubeflow is a tool to easily deploy your ML workflows to kubernetes. We provides some examples of using kubeflow with Zeus. In order to run your training in Kubeflow with Zeus, follow the `docs/batch_size_optimizer/server.md` to deploy batch size optimizer to kubernetes. After then, you can deploy your training script using kubeflow.
 
-1. Install kubeflow training operator.
+1. Set up Kubernetes and install kubeflow training operator.
 
+    Refer [minikube](https://minikube.sigs.k8s.io/docs/start/) for local development of Kubernetes.
     Refer [Kubeflow training operator](https://github.com/kubeflow/training-operator) to how to install kubeflow.
 
-2. Build mnist example docker image.
+2. Run server batch size optimizer server using Kubernetes.
+
+    Refer docs to start the server [Quick start](../../docs/batch_size_optimizer/index.md).
+
+3. Build mnist example docker image.
 
     ```Shell
     # From project root directory
     docker build -f ./examples/batch_size_optimizer/mnist.Dockerfile -t mnist-example . 
     ```
 
-3. Deploy training script.
+    If you are using the cloud such as AWS, modify the `image` and `imagePullPolicy` in `mnist_dp.yaml` to pull it from the corresponding registry.
+
+4. Deploy training script.
 
     ```Shell
+    cd examples/batch_size_optimizer
     kubectl apply -f mnist_dp.yaml # For distributed training example
     kubectl apply -f mnist_single_gpu.yaml # For single gpu training example
     ```
diff --git a/examples/batch_size_optimizer/mnist_dp.py b/examples/batch_size_optimizer/mnist_dp.py
@@ -168,6 +168,12 @@ def main():
         choices=[dist.Backend.GLOO, dist.Backend.NCCL, dist.Backend.MPI],
         default=dist.Backend.GLOO,
     )
+    parser.add_argument(
+        "--target-accuracy",
+        type=float,
+        default=0.5,
+        help="Target accuracy (default: 0.5)",
+    )
 
     args = parser.parse_args()
     use_cuda = not args.no_cuda and torch.cuda.is_available()
@@ -226,7 +232,8 @@ def main():
             job_id_prefix="mnist-dev",
             default_batch_size=256,
             batch_sizes=[32, 64, 256, 512, 1024, 4096, 2048],
-            max_epochs=5,
+            max_epochs=args.epochs,
+            target_metric=args.target_accuracy,
         ),
         rank=rank,
     )
@@ -259,10 +266,11 @@ def main():
         print("Rank", dist.get_rank())
 
     dist.broadcast(bs_trial_tensor, src=0)
-    batch_size = bs_trial_tensor[0].item() // world_size
+    bso.current_batch_size = bs_trial_tensor[0].item()
     bso.trial_number = bs_trial_tensor[1].item()
+    batch_size = bso.current_batch_size // world_size
 
-    print(f"Batach_size to use for gpu[{rank}]: {batch_size}")
+    print(f"Batach_size to use for gpu[{rank}]: {batch_size}, trial number: {bso.trial_number}")
     callbacks = CallbackSet(callback_set)
 
     ########################### ZEUS INIT END ###########################

diff --git a/examples/batch_size_optimizer/mnist_dp.yaml b/examples/batch_size_optimizer/mnist_dp.yaml
@@ -19,11 +19,13 @@ spec:
                 - "/workspace/examples/batch_size_optimizer/mnist_dp.py"
                 - "--epochs=5"
                 - "--backend=nccl"
+                # Adjust target accuracy if you want to test failure
+                - "--target-accuracy=0.2"
               env:
                 - name: ZEUS_SERVER_URL
                   value: "http://server:80"
                 - name: ZEUS_JOB_ID
-                  value: "mnist-dev-dp-2"
+                  value: "mnist-dev-dp-1"
               securityContext:
                 capabilities:
                   add: ["SYS_ADMIN"]
@@ -41,11 +43,13 @@ spec:
                 - "/workspace/examples/batch_size_optimizer/mnist_dp.py"
                 - "--epochs=5"
                 - "--backend=nccl"
+                # Adjust target accuracy if you want to test failure
+                - "--target-accuracy=0.2"
               env:
                 - name: ZEUS_SERVER_URL
                   value: "http://server:80"
                 - name: ZEUS_JOB_ID
-                  value: "mnist-dev-dp-2"
+                  value: "mnist-dev-dp-1"
               securityContext:
                 capabilities:
                   add: ["SYS_ADMIN"]
diff --git a/examples/batch_size_optimizer/mnist_single_gpu.py b/examples/batch_size_optimizer/mnist_single_gpu.py
@@ -162,6 +162,12 @@ def main():
         metavar="L",
         help="directory where summary logs are stored",
     )
+    parser.add_argument(
+        "--target-accuracy",
+        type=float,
+        default=0.5,
+        help="Target accuracy (default: 0.5)",
+    )
     if dist.is_available():
         parser.add_argument(
             "--backend",
@@ -199,12 +205,13 @@ def main():
             job_id_prefix="mnist-dev",
             default_batch_size=256,
             batch_sizes=[32, 64, 256, 512, 1024, 4096, 2048],
-            max_epochs=5
+            max_epochs=args.epochs,
+            target_metric=args.target_accuracy,
         ),
     )
     # Get batch size from bso 
     batch_size = bso.get_batch_size()
-    print("Chosen batach_size:", batch_size)
+    print("Chosen batach_size:", batch_size, "Trial number:", bso.trial_number)
 
     ##################### ZEUS INIT END ##########################
     train_loader = torch.utils.data.DataLoader(
@@ -252,7 +259,7 @@ def main():
         plo.on_epoch_begin()
         train(args, model, device, train_loader, optimizer, epoch, writer, plo)
         plo.on_epoch_end()
-        acc = test(args, model, device, test_loader, writer, epoch,bso)
+        acc = test(args, model, device, test_loader, writer, epoch)
         bso.on_evaluate(acc)
 
     ################### ZEUS OPTIMIZER USAGE END #########################

diff --git a/zeus/optimizer/batch_size/client.py b/zeus/optimizer/batch_size/client.py
@@ -196,7 +196,7 @@ def on_evaluate(
             self.training_finished = True
             if not parsed_response.converged:
                 raise ZeusBSOTrainFailError(
-                    f"Train failed: {parsed_response.message}. This batch size will not be selected again. Please re-launch the training"
+                    f"Train failed: {parsed_response.message} This batch size will not be selected again. Please re-launch the training"
                 )
 
     def _handle_response(self, res: httpx.Response) -> None:

diff --git a/zeus/optimizer/batch_size/server/explorer.py b/zeus/optimizer/batch_size/server/explorer.py
@@ -118,9 +118,9 @@ async def next_batch_size(
                 batch_sizes,
             )
 
-        if len(batch_sizes) == 0:
-            raise ZeusBSOServerRuntimeError(
-                "No converged batch sizes has observed. Reconfigure batch_sizes and re-launch the job."
-            )
+            if len(batch_sizes) == 0:
+                raise ZeusBSOServerRuntimeError(
+                    "No converged batch sizes has observed. Reconfigure batch_sizes and re-launch the job."
+                )
         # After going through pruning rounds, we couldn't find the bs. Should go to MAB stage, so return good batch_sizes.
         return sorted(batch_sizes)
diff --git a/zeus/optimizer/batch_size/server/optimizer.py b/zeus/optimizer/batch_size/server/optimizer.py
@@ -165,6 +165,9 @@ async def report(self, result: TrainingResult) -> ReportResponse:
             )
         )
 
+        if trial is None:
+            raise ZeusBSOServiceBadOperationError(f"Unknown trial {result}")
+
         if trial.status != TrialStatus.Dispatched:
             # result is already reported
             return ReportResponse(
@@ -173,9 +176,6 @@ async def report(self, result: TrainingResult) -> ReportResponse:
                 message=f"Result for this trial({trial.trial_number}) is already reported.",
             )
 
-        if trial is None:
-            raise ZeusBSOServiceBadOperationError(f"Unknown trial {result}")
-
         if job.beta_knob is not None and job.min_cost is not None:  # Early stop enabled
             cost_ub = job.beta_knob * job.min_cost