Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

control-service: killed job was shown as successful #2116

Merged

Conversation

mivanov1988
Copy link
Collaborator

Why

We recently got the following feedback from our internal client: A data job was listed as successful even though it hit the 12 hour limit and was killed; the logs do not show that either - the last entry in the log just shows the last object that was sent for ingestion, but there is no summary of the data job.

The problem is caused by the following fix - #1586.

When the job hit the 12-hour limit the K8S Pod is terminated and we construct partial JobExecutionStatus which enters in the following if statement and returns Optional.empty() rather than the constructed object.

As a result, this job execution becomes stuck in the Running status until it is detected by emergency logic, which marks such executions as successful due to the lack of associated Pods to them.

What

Added validation for an already completed job in a more appropriate place.

Testing Done

Added integration test

Signed-off-by: Miroslav Ivanov [email protected]

Why
We recently got the following feedback from our internal client: A data job was listed as successful even though it hit the 12 hour limit and was killed; the logs do not show that either - the last entry in the log just shows the last object that was sent for ingestion, but there is no summary of the data job.

The problem is caused by the following fix - #1586.

When the job hit the 12-hour limit the K8S Pod is terminated and we construct partial JobExecutionStatus which enters in the following if statement and returns Optional.empty() rather than the constructed object.

https://github.com/vmware/versatile-data-kit/blob/4763ba877f43b270fbd4770bc1533216f7c5d618/projects/control-service/projects/pipelines_control_service/src/main/java/com/vmware/taurus/service/KubernetesService.java#L1656

As a result, this job execution becomes stuck in the Running status until it is detected by emergency logic, which marks such executions as successful due to the lack of associated Pods to them.

What
Added validation for an already completed job in a more appropriate place.

Testing Done
Added integration test

Signed-off-by: Miroslav Ivanov [email protected]
@mivanov1988 mivanov1988 enabled auto-merge (squash) May 25, 2023 12:04
@mivanov1988 mivanov1988 merged commit 67e739c into main May 25, 2023
@mivanov1988 mivanov1988 deleted the person/miroslavi/killed-job-was-shown-as-successful2 branch May 25, 2023 12:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants