Memory prediction and task scaling #10

friederici · 2024-02-15T14:34:25Z

This PR includes all changes required for memory prediction and task scaling.

friederici · 2024-02-20T08:59:58Z

src/main/resources/application.yml

This change is not required, but very useful during development. It allows different settings when running inside the development environment. Please advise if this shall be removed, or kept.

…enabled

…sk scaling

Lehmann-Fabian

Thanks a lot for this great feature. I reviewed it in detail now and commented on things that we should discuss or that need adaptions. Please get back to me if something is unclear.

Lehmann-Fabian · 2024-03-01T09:14:17Z

src/main/java/cws/k8s/scheduler/memory/ConstantPredictor.java

+        if (Boolean.TRUE.equals(o.success)) {
+            // set model to peakRss + 10%
+            if (model.containsKey(o.task)) {
+                model.replace(o.task, o.peakRss.multiply(new BigDecimal("1.1")).setScale(0, RoundingMode.CEILING));


Should not const always use the highest value ever seen? This would replace it with the most recent one?

The current behaviour is:

* - In case task was successful: * - let the next prediction be 10% higher, then the peakRss was * * - In case task has failed: * - reset to initial value

If the scheduler provides the tasks in order, with the tasks that has the biggest input first, this would cause the prediction to follow and always shrink.

But I agree that different "constant" strategies could be taken, e.g.:

constant, biggest value seen

constant, lowest value seen

constant, latest value seen <- current approach
...

Ordering it by size desc is only one of many possible scheduling strategies. Ordering could also be asc, random or FIFO. We should use maximum with x% offset.

Lehmann-Fabian · 2024-03-01T09:20:53Z

src/main/java/cws/k8s/scheduler/memory/LinearPredictor.java

+        } else {
+            log.debug("overprovisioning value will increase due to task failure");
+            Double old = overprovisioning.get(o.task);
+            overprovisioning.put(o.task, old+0.05);


I assume this will increase at the beginning, where we might make a few wrong predictions. However, the overprovisioning is never decreased if we have more observations and maybe better predictions.
We can also leave this for the future as this is a very cautious approach.

The overprovisioning was my first attempt in solving the problem of to low estimates.
Later I learned that errorstrategy is set to terminate by default and maxretries is very low (I belive this is 1 by default). So this will mostly do not occur, that this raises very much.

How about a strategy that checks the predictions for any task in the training data and determines the highest offset needed to fit all/95%/99% values? The percent value could be a user defined hyper parameter.

src/main/java/cws/k8s/scheduler/memory/LinearPredictor.java

src/main/java/cws/k8s/scheduler/memory/NfTrace.java

Lehmann-Fabian · 2024-03-01T09:55:22Z

src/main/java/cws/k8s/scheduler/memory/TaskScaler.java

+        if ("default".equalsIgnoreCase(predictor)) {
+            predictor = System.getenv("MEMORY_PREDICTOR_DEFAULT");
+        }
+        if (predictor == null) {


This does not match the readme having no predictor if nothing is set.

Lehmann-Fabian · 2024-03-01T15:25:05Z

src/main/java/cws/k8s/scheduler/memory/WaryPredictor.java

+        for (Pair<Double, Double> o : observationList) {
+            double p = simpleRegression.predict(o.getLeft());
+            double op = overprovisioning.get(taskName);
+            if ( (p*op) < o.getRight() ) {
+                // The model predicted value would have been smaller then the 
+                // observed value. Our model is not (yet) appropriate.
+                // Increase overprovisioning
+                log.debug("overprovisioning value will increase due to model mismatch");
+                Double old = overprovisioning.get(taskName);
+                overprovisioning.put(taskName, old+0.05);
+                // Don't make a prediction this time
+                return null;
+            }
+        }


Isn't such a logic appropriate to also reduce the overprovisioning later?

Lehmann-Fabian · 2024-03-01T15:25:51Z

src/main/java/cws/k8s/scheduler/memory/WaryPredictor.java

+        if (prediction > initialValue.get(taskName).doubleValue()) {
+            log.warn("prediction would exceed initial value");
+            return initialValue.get(taskName).toPlainString();
+        }


Again, compare to actual task's value

Lehmann-Fabian · 2024-03-01T15:32:24Z

src/main/java/cws/k8s/scheduler/scheduler/Scheduler.java

+
+        if (taskScaler!=null) {
+            // change memory resource requests and limits here
+            taskScaler.beforeTasksScheduled(unscheduledTasks);


This loop is called frequently.
Can we predict the task's values only when new observations came available. E.g. save the version of the predictor (simply the number of observations) into the task when a prediction is done. Then we only recalculate if the predictor has more observations.
Can we patch the task only if it is actually scheduled. I assume this is an expensive operation.

Lehmann-Fabian · 2024-03-01T15:32:51Z

src/main/resources/application.yml

src/main/java/cws/k8s/scheduler/memory/MemoryPredictor.java

friederici added 30 commits September 27, 2023 11:34

collect task execution results and store them in the memory optimizer

60a58ee

restructured TaskScaler

dda273a

cleanup debug logs

a846aa2

cleanup Scheduler

a87afc1

cleanup Task

ffcddd4

cleanup Task

4b7640d

add hook after workflow is completed

8eadf40

change MemoryOptimizer to be an interface, add two different Optimizers

537df5d

round suggestions to ceiling

82aaf7f

introduced LinearPredictor

0826b91

changed indentation to 4 spaces, like rest of the project uses

440add6

fix mis-formatting

cd5a66a

fix typos

43f07e7

initial implementation linearPredictor

445a84b

builder for observations

02fee25

added test for constant predictor

8755f00

remove wasted calculation from observation

fde2665

remove wasted calculation from observation

daecba1

add NonePredictorTest

01185df

sanity checks for observations

827e2b4

add negative case for ConstantPredictor

cbcee82

assert rise and fall of suggestions

66ab197

added LinearPredictorTest

9f8b576

avoid negative preditions

61bd311

use SimpleRegression for LinearPredictor

14ef86c

fix naming to always be prediction, instead of suggestion

4098e74

fix naming

a081142

fix some minor issues

dceab80

collect statistics

f705f61

removed Limits, rely solely on Requests instead

c126281

friederici added 10 commits November 18, 2023 07:18

removed flawed wasted from csv, added assigned node

c878d7f

lower limit for request size 256MiB

a74c20e

fix bad naming

e32c74a

junit test for TaskScaler

98b64c2

add remark for TaskScalerTest

34117eb

fix used predictor

a96a151

removed unimplemended square predictor

8f4ca6f

removed unused generation feature from constant predictor

0fb4ea2

removed unused generation feature

d513125

cleanup classname

1791a1a

friederici commented Feb 20, 2024

View reviewed changes

friederici added 8 commits February 20, 2024 10:02

removed testcase that is no longer in line with desired behaviour

aa6f7ea

fixed comments

18fa249

add description to README

f698ce6

add description to README

e34a4e5

catch exception that is thrown when InPlacePodVerticalScaling is not …

5c67ec1

…enabled

add note on profiles in README

e6e3f63

always write log to file

9a1064f

check reason for exception and improve error message, then disable ta…

bdf74ce

…sk scaling

Lehmann-Fabian requested changes Mar 1, 2024

View reviewed changes

friederici added 8 commits March 2, 2024 15:11

fix comment

fadd42d

fix formatting

6bd513e

moved patchTaskMemory method

89f1360

add tracing note in README

647da6d

reduce loglevel

49bebe7

change predictor interface to return BigDecimal

7ba49db

extracted constant for lowest memory request value

fc76c77

add o.taskName to log, when available

5f6fb5e

Lehmann-Fabian changed the base branch from master to memoryPrediction March 4, 2024 15:35

Lehmann-Fabian merged commit fb7ebac into CommonWorkflowScheduler:memoryPrediction Mar 4, 2024
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory prediction and task scaling #10

Memory prediction and task scaling #10

friederici commented Feb 15, 2024

friederici Feb 20, 2024

Lehmann-Fabian Mar 1, 2024

Lehmann-Fabian left a comment

Lehmann-Fabian Mar 1, 2024

friederici Mar 2, 2024

Lehmann-Fabian Mar 4, 2024

Lehmann-Fabian Mar 1, 2024

friederici Mar 2, 2024

Lehmann-Fabian Mar 4, 2024

Lehmann-Fabian Mar 1, 2024

Lehmann-Fabian Mar 1, 2024

Lehmann-Fabian Mar 1, 2024

Lehmann-Fabian Mar 1, 2024

Lehmann-Fabian Mar 1, 2024

Memory prediction and task scaling #10

Memory prediction and task scaling #10

Conversation

friederici commented Feb 15, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Lehmann-Fabian left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment