This is a testing setup for running an experiment to test fluence against the default scheduler. We will use the instance type that we got working for LAMMPS previously.
Note that for testing, you can create a cluster with kind. This will create a control plane and 4 nodes.
kind create cluster --config ./crd/kind-config.yaml
kubectl get nodes
NAME STATUS ROLES AGE VERSION
kind-control-plane Ready control-plane 78s v1.27.3
kind-worker Ready <none> 54s v1.27.3
kind-worker2 Ready <none> 54s v1.27.3
kind-worker3 Ready <none> 55s v1.27.3
I found when I needed to do actual multiple runs, I needed a real cluster. After prototyping, you can create a cluster on c2d-standard-8 for size 4. Note that I'm leaving out the network optimization. We will follow these best practices.
GOOGLE_PROJECT=myproject
gcloud container clusters create test-cluster \
--threads-per-core=1 \
--placement-type=COMPACT \
--num-nodes=5 \
--region=us-central1-a \
--project=${GOOGLE_PROJECT} \
--machine-type=c2d-standard-8
We are going to follow the instructions from the branch here to build Fluence (I'll push to my own registry for now) and then install Fluence.
git clone -b modular-fluence-build https://github.com/flux-framework/flux-k8s.git
cd ./flux-k8s
# Build the custom images
make prepare
make build REGISTRY=vanessa
make build-sidecar REGISTRY=vanessa
docker push vanessa/fluence
docker push vanessa/fluence-sidecar
cd upstream/manifests/install/charts
helm install \
--set scheduler.image=vanessa/fluence:latest \
--set scheduler.sidecarimage=vanessa/fluence-sidecar:latest \
schedscheduler-plugins as-a-second-scheduler/
Ensure both pods are running:
kubectl get pods
NAME READY STATUS RESTARTS AGE
fluence-757fdcd854-cbqn2 2/2 Running 0 24s
scheduler-plugins-controller-9f778469-c5wg9 1/1 Running 0 24s
You can check the logs for fluence to see the sidecar (that has fluence) and the main scheduler plugins pod (that should primarily have health checks).
kubectl logs fluence-757fdcd854-cbqn2
Defaulted container "sidecar" out of: sidecar, scheduler-plugins-scheduler
This is the fluxion grpc server
Created cli context &{}
&{}
Number nodes 4
node in flux group gke-test-cluster-default-pool-92dbd7e9-bgv5
Node gke-test-cluster-default-pool-92dbd7e9-bgv5 flux cpu 3
Node gke-test-cluster-default-pool-92dbd7e9-bgv5 total mem 29207936768
node in flux group gke-test-cluster-default-pool-92dbd7e9-h4sc
Node gke-test-cluster-default-pool-92dbd7e9-h4sc flux cpu 3
Node gke-test-cluster-default-pool-92dbd7e9-h4sc total mem 29202693888
node in flux group gke-test-cluster-default-pool-92dbd7e9-ln89
Node gke-test-cluster-default-pool-92dbd7e9-ln89 flux cpu 3
Node gke-test-cluster-default-pool-92dbd7e9-ln89 total mem 29298037248
node in flux group gke-test-cluster-default-pool-92dbd7e9-xzb4
Node gke-test-cluster-default-pool-92dbd7e9-xzb4 flux cpu 3
Node gke-test-cluster-default-pool-92dbd7e9-xzb4 total mem 29020379008
Can request at most 12 exclusive cpu
Match policy: {"matcher_policy": "lonode"}
[GRPCServer] gRPC Listening on [::]:4242
And you should see health checks here:
kubectl logs fluence-757fdcd854-cbqn2 -c scheduler-plugins-scheduler
Now let's install the Flux Operator from the development branch.
kubectl apply -f https://raw.githubusercontent.com/flux-framework/flux-operator/test-refactor-modular/examples/dist/flux-operator-refactor.yaml
- TODO: check if a Pod group needs to be unique to specific pods (it seems so)?
We've already tested that examples work here in run2 so let's go right into running experiments with Fluence. We will use a modified run_experiments.py that templates a lammps.yaml file. First, create a virtual Python environment (your preferred way) and install dependencies:
pip install -r requirements.txt
The templates in crd will be used to run experiments. Specifically we are expecting to find a lammps.yaml
. Let's run experiments, here are a few examples. Note that memory is in GiB, and we set the --outdir
to create separation between experiment results. Also note that fluence doesn't seem to be working yet - we need to merge in the current work and come back here to test again.
# Prototype with default scheduler
python run_experiments.py --cpus 24 --memory 192 --outdir ./results/test-six --config-name lammps-six
▶️ Output directory: /home/vanessa/Desktop/Code/operator-experiments/google/scheduler/run3/results/test-six
▶️ Memory per node: 192
▶️ CPUs per node: 24
▶️ Using Fluence: False
▶️ Config name: lammps-six
▶️ Iterations: 10
Would you like to continue? (yes/no)? yes
When you are done:
gcloud container clusters delete test-cluster --region=us-central1-a
- Merge in current PRs (two essential) for Fluence
- Debug issue here with nil value
- Do testing with Fluence here
- Discuss output timings we need (right now I am not calculating and saving any timings)
- Run experiments slightly larger scale as a test run
- Discuss larger run / strategy