Skip to content

Commit f811169

Browse files
authored
scheduler: recover from panic (#12009)
If processing a specific evaluation causes the scheduler (and therefore the entire server) to panic, that evaluation will never get a chance to be nack'd and cleared from the state store. It will get dequeued by another scheduler, causing that server to panic, and so forth until all servers are in a panic loop. This prevents the operator from intervening to remove the evaluation or update the state. Recover the goroutine from the top-level `Process` methods for each scheduler so that this condition can be detected without panicking the server process. This will lead to a loop of recovering the scheduler goroutine until the eval can be removed or nack'd, but that's much better than taking a downtime.
1 parent 0263650 commit f811169

File tree

3 files changed

+18
-2
lines changed

3 files changed

+18
-2
lines changed

.changelog/12009.txt

+3
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
```release-note:improvement
2+
scheduler: recover scheduler goroutines on panic
3+
```

scheduler/generic_sched.go

+8-1
Original file line numberDiff line numberDiff line change
@@ -125,7 +125,14 @@ func NewBatchScheduler(logger log.Logger, eventsCh chan<- interface{}, state Sta
125125
}
126126

127127
// Process is used to handle a single evaluation
128-
func (s *GenericScheduler) Process(eval *structs.Evaluation) error {
128+
func (s *GenericScheduler) Process(eval *structs.Evaluation) (err error) {
129+
130+
defer func() {
131+
if r := recover(); r != nil {
132+
err = fmt.Errorf("processing eval %q panicked scheduler - please report this as a bug! - %v", eval.ID, r)
133+
}
134+
}()
135+
129136
// Store the evaluation
130137
s.eval = eval
131138

scheduler/scheduler_system.go

+7-1
Original file line numberDiff line numberDiff line change
@@ -72,7 +72,13 @@ func NewSysBatchScheduler(logger log.Logger, eventsCh chan<- interface{}, state
7272
}
7373

7474
// Process is used to handle a single evaluation.
75-
func (s *SystemScheduler) Process(eval *structs.Evaluation) error {
75+
func (s *SystemScheduler) Process(eval *structs.Evaluation) (err error) {
76+
77+
defer func() {
78+
if r := recover(); r != nil {
79+
err = fmt.Errorf("processing eval %q panicked scheduler - please report this as a bug! - %v", eval.ID, r)
80+
}
81+
}()
7682

7783
// Store the evaluation
7884
s.eval = eval

0 commit comments

Comments
 (0)