Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

control-service: Classify OOM as User Errors #479

Merged
merged 1 commit into from
Nov 9, 2021

Conversation

doks5
Copy link
Contributor

@doks5 doks5 commented Nov 4, 2021

Currently, if a data job execution exceeds the allowed memory
quota, the pod of the execution is immediately killed by
Kubernetes and restarted. Which in turn usually results in a
similar failure. This causes a Platform Error to be raised, whereas
it should be classified as a User Error.

This change re-classifies such errors as User Errors.

Testing Done: Added unit test.

Signed-off-by: Andon Andonov [email protected]

@doks5 doks5 linked an issue Nov 4, 2021 that may be closed by this pull request
Copy link
Collaborator

@antoniivanov antoniivanov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Let's merge this. But we need to make sure we can report to the user what is going on correctly (see my other comment). We can handle this in separate PR.

@tpalashki
Copy link
Contributor

Are you sure that the reason field of the Kubernetes job, in this case, will be "OOMKilled"? I think the job will have a status.condition[0].reason: BackoffLimitExceeded and the OOMKilled will be set as a status of the pod itself:

status:
  containerStatuses:
  - ...
    lastState:
      terminated:
        containerID: ...
        exitCode: 137
        finishedAt: "2021-10-31T09:55:22Z"
        reason: OOMKilled
        startedAt: "2021-10-31T09:47:17Z"

@doks5 doks5 self-assigned this Nov 5, 2021
@doks5 doks5 force-pushed the person/andonova/fix-oom-error branch 3 times, most recently from 95bf58e to ec81690 Compare November 8, 2021 08:58
@doks5 doks5 force-pushed the person/andonova/fix-oom-error branch from c54d9ed to 5535add Compare November 8, 2021 16:31
@doks5 doks5 requested a review from tpalashki November 9, 2021 05:47
@doks5 doks5 force-pushed the person/andonova/fix-oom-error branch from 5535add to e8891c2 Compare November 9, 2021 08:15
Currently, if a data job execution exceeds the allowed memory
quota, the pod of the execution is immediately killed by
Kubernetes and restarted. Which in turn usually results in a
similar failure. This causes a Platform Error to be raised, whereas
it should be classified as a User Error.

This change re-classifies such errors as User Errors.

Testing Done: Added unit test.

Signed-off-by: Andon Andonov <[email protected]>
@doks5 doks5 force-pushed the person/andonova/fix-oom-error branch from e8891c2 to 4c1359d Compare November 9, 2021 08:48
@doks5 doks5 merged commit 13d9e37 into main Nov 9, 2021
@doks5 doks5 deleted the person/andonova/fix-oom-error branch November 9, 2021 09:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Classify job failures due to OOM as User Error
5 participants