Use 'parallel' policy for workspace pod rollouts to avoid stalls #802

EronWright · 2025-02-04T19:15:54Z

Proposed changes

This PR seeks to address this issue (k8s: "Forced rollback") that occurs when the workspace pod is in a crashloop:

When using Rolling Updates with the default Pod Management Policy (OrderedReady), it's possible to get into a broken state that requires manual intervention to repair.

The parallel policy seems to enable the statefulset controller to forcibly remove a pod when a new revision is available. The controller seems to obey the termination grace period as is important, and I can't think of any other negatives. But there's a concern in the k8s community about this approach: kubernetes/website#47085

Note that a workspace consists of one replica, and is rather like a singleton with good behavior w.r.t. Pulumi state locking and compatible with persistent volumes.

Related issues (optional)

Closes #801

codecov · 2025-02-04T19:20:01Z

Codecov Report

Attention: Patch coverage is 12.50000% with 7 lines in your changes missing coverage. Please review.

Project coverage is 51.16%. Comparing base (fc48798) to head (4e33702).
Report is 1 commits behind head on master.

Files with missing lines	Patch %	Lines
...r/internal/controller/auto/workspace_controller.go	12.50%	7 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #802      +/-   ##
==========================================
- Coverage   51.22%   51.16%   -0.06%     
==========================================
  Files          31       31              
  Lines        4318     4325       +7     
==========================================
+ Hits         2212     2213       +1     
- Misses       1917     1923       +6     
  Partials      189      189

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

rquitales · 2025-02-06T19:45:38Z

Per offline discussion, the workaround makes sense, but we should somehow exercise this in an e2e test to ensure that taking advantage of the implementation detail of parallel for StatefulSet updates carries through for future Kubernetes releases.

Ref from docs:

This option only affects the behaviour for scaling operations. Updates are not affected.

operator/e2e/e2e_test.go

operator/internal/controller/auto/workspace_controller.go

operator/e2e/e2e_test.go

…Finalize

EronWright requested a review from rquitales February 4, 2025 19:16

EronWright changed the title ~~Use 'parallel' strategy for workspace pod rollouts to avoid stalls~~ Use 'parallel' policy for workspace pod rollouts to avoid stalls Feb 4, 2025

EronWright mentioned this pull request Feb 5, 2025

Reduce volatility of the workspace due to ordering and caching issues #803

Merged

EronWright force-pushed the issue-801 branch from a71e9f9 to c834f0e Compare February 7, 2025 21:08

EronWright commented Feb 8, 2025

View reviewed changes

operator/e2e/e2e_test.go Show resolved Hide resolved

rquitales approved these changes Feb 8, 2025

View reviewed changes

operator/internal/controller/auto/workspace_controller.go Outdated Show resolved Hide resolved

operator/e2e/e2e_test.go Show resolved Hide resolved

EronWright force-pushed the issue-801 branch from 72f80c8 to 24bde62 Compare February 10, 2025 17:48

EronWright added 9 commits February 10, 2025 11:06

use parallel mode

2b6e194

changelog

b69c1ca

e2e test

7c913a0

auto-upgrade from beta3

16514c6

e2etest

7c1726e

e2e

ba022cd

cleanup

178ee13

vscode action

14ead8e

wip

028c916

EronWright force-pushed the issue-801 branch from 632a578 to 028c916 Compare February 10, 2025 19:08

EronWright added 3 commits February 10, 2025 12:06

no cleanup

fc1db7b

debug

b8e74ea

pinned the image to save storage space in kind, and disable destroyOn…

4e33702

…Finalize

EronWright merged commit 1795411 into master Feb 10, 2025
7 checks passed

EronWright deleted the issue-801 branch February 10, 2025 21:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use 'parallel' policy for workspace pod rollouts to avoid stalls #802

Use 'parallel' policy for workspace pod rollouts to avoid stalls #802

EronWright commented Feb 4, 2025 •

edited

Loading

codecov bot commented Feb 4, 2025 •

edited

Loading

rquitales commented Feb 6, 2025

Use 'parallel' policy for workspace pod rollouts to avoid stalls #802

Use 'parallel' policy for workspace pod rollouts to avoid stalls #802

Conversation

EronWright commented Feb 4, 2025 • edited Loading

Proposed changes

Related issues (optional)

codecov bot commented Feb 4, 2025 • edited Loading

Codecov Report

rquitales commented Feb 6, 2025

EronWright commented Feb 4, 2025 •

edited

Loading

codecov bot commented Feb 4, 2025 •

edited

Loading