-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
control-service: killed job was shown as successful #2103
Closed
mivanov1988
wants to merge
16
commits into
main
from
person/miroslavi/killed-job-was-shown-as-successful
Closed
control-service: killed job was shown as successful #2103
mivanov1988
wants to merge
16
commits into
main
from
person/miroslavi/killed-job-was-shown-as-successful
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Why We recently got the following feedback from our internal client: A data job was listed as successful even though it hit the 12 hour limit and was killed; the logs do not show that either - the last entry in the log just shows the last object that was sent for ingestion, but there is no summary of the data job. The problem is caused by the following fix - #1586. When the job hit the 12-hour limit the K8S Pod is terminated and we construct partial JobExecutionStatus which enters in the following if statement and returns Optional.empty() rather than the constructed object. https://github.com/vmware/versatile-data-kit/blob/4763ba877f43b270fbd4770bc1533216f7c5d618/projects/control-service/projects/pipelines_control_service/src/main/java/com/vmware/taurus/service/KubernetesService.java#L1656 As a result, this job execution becomes stuck in the Running status until it is detected by emergency logic, which marks such executions as successful due to the lack of associated Pods to them. What Added validation for an already completed job in a more appropriate place. Testing Done Added integration test Signed-off-by: Miroslav Ivanov [email protected]
<!--pre-commit.ci start--> updates: - https://github.com/asottile/reorder_python_imports → https://github.com/asottile/reorder-python-imports - [github.com/asottile/pyupgrade: v3.3.1 → v3.4.0](asottile/pyupgrade@v3.3.1...v3.4.0) - [github.com/pre-commit/mirrors-prettier: v3.0.0-alpha.6 → v3.0.0-alpha.9-for-vscode](pre-commit/mirrors-prettier@v3.0.0-alpha.6...v3.0.0-alpha.9-for-vscode) <- simply republished package; no need <!--pre-commit.ci end--> --------- Signed-off-by: ivakoleva <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: ivakoleva <[email protected]>
dakodakov
approved these changes
May 22, 2023
## Why A recent test run in CI broke with an error related to the staticmethod decorator https://gitlab.com/vmware-analytics/versatile-data-kit/-/jobs/4325250349 The latest heartbeat changes were tested on python3.11 They turned out not to be compatible with python 3.7 ## What Remove the @staticmethod decorator from the test instantiation method (we're not using this method outside the runner) Rename config memeber field of test classes to _config ## How was this tested Locally, in virtual environment using python 3.7 ## What kind of change is this Bugfix Signed-off-by: Dilyan Marinov <[email protected]>
…2.472 in /projects/control-service/projects/pipelines_control_service (#2111) Bumps [com.amazonaws:aws-java-sdk-sts](https://github.com/aws/aws-sdk-java) from 1.12.468 to 1.12.472. <details> <summary>Changelog</summary> <p><em>Sourced from <a href="https://github.com/aws/aws-sdk-java/blob/master/CHANGELOG.md">com.amazonaws:aws-java-sdk-sts's changelog</a>.</em></p> <blockquote> <h1><strong>1.12.472</strong> <strong>2023-05-19</strong></h1> <h2><strong>AWS Backup</strong></h2> <ul> <li> <h3>Features</h3> <ul> <li>Add ResourceArn, ResourceType, and BackupVaultName to ListRecoveryPointsByLegalHold API response.</li> </ul> </li> </ul> <h2><strong>AWS Elemental MediaPackage v2</strong></h2> <ul> <li> <h3>Features</h3> <ul> <li>Adds support for the MediaPackage Live v2 API</li> </ul> </li> </ul> <h2><strong>Amazon Connect Cases</strong></h2> <ul> <li> <h3>Features</h3> <ul> <li>This release adds the ability to create fields with type Url through the CreateField API. For more information see <a href="https://docs.aws.amazon.com/cases/latest/APIReference/Welcome.html">https://docs.aws.amazon.com/cases/latest/APIReference/Welcome.html</a></li> </ul> </li> </ul> <h2><strong>Amazon Simple Email Service</strong></h2> <ul> <li> <h3>Features</h3> <ul> <li>This release allows customers to update scaling mode property of dedicated IP pools with PutDedicatedIpPoolScalingAttributes call.</li> </ul> </li> </ul> <h1><strong>1.12.471</strong> <strong>2023-05-18</strong></h1> <h2><strong>AWS CloudTrail</strong></h2> <ul> <li> <h3>Features</h3> <ul> <li>Add ConflictException to PutEventSelectors, add (Channel/EDS)ARNInvalidException to Tag APIs. These exceptions provide customers with more specific error messages instead of internal errors.</li> </ul> </li> </ul> <h2><strong>AWS Compute Optimizer</strong></h2> <ul> <li> <h3>Features</h3> <ul> <li>In this launch, we add support for showing integration status with external metric providers such as Instana, Datadog ...etc in GetEC2InstanceRecommendations and ExportEC2InstanceRecommendations apis</li> </ul> </li> </ul> <h2><strong>AWS Elemental MediaConvert</strong></h2> <ul> <li> <h3>Features</h3> <ul> <li>This release introduces a new MXF Profile for XDCAM which is strictly compliant with the SMPTE RDD 9 standard and improved handling of output name modifiers.</li> </ul> </li> </ul> <h2><strong>AWS Security Token Service</strong></h2> <ul> <li> <h3>Features</h3> <ul> <li>API updates for the AWS Security Token Service</li> </ul> </li> </ul> <h2><strong>Amazon Athena</strong></h2> <ul> <li> <h3>Features</h3> <ul> <li>Removing SparkProperties from EngineConfiguration object for StartSession API call</li> </ul> </li> </ul> <h2><strong>Amazon Connect Service</strong></h2> <ul> <li> <h3>Features</h3> <ul> <li>You can programmatically create and manage prompts using APIs, for example, to extract prompts stored within Amazon Connect and add them to your Amazon S3 bucket. AWS CloudTrail, AWS CloudFormation and tagging are supported.</li> </ul> </li> </ul> <h2><strong>Amazon EC2 Container Service</strong></h2> <ul> <li> <h3>Features</h3> <ul> <li>Documentation only release to address various tickets.</li> </ul> </li> </ul> <h2><strong>Amazon Elastic Compute Cloud</strong></h2> <ul> <li> <h3>Features</h3> <ul> <li>Add support for i4g.large, i4g.xlarge, i4g.2xlarge, i4g.4xlarge, i4g.8xlarge and i4g.16xlarge instances powered by AWS Graviton2 processors that deliver up to 15% better compute performance than our other storage-optimized instances.</li> </ul> </li> </ul> <!-- raw HTML omitted --> </blockquote> <p>... (truncated)</p> </details> <details> <summary>Commits</summary> <ul> <li><a href="https://github.com/aws/aws-sdk-java/commit/2e98b99f26af3e82b36431b46f03b277b678c7ae"><code>2e98b99</code></a> AWS SDK for Java 1.12.472</li> <li><a href="https://github.com/aws/aws-sdk-java/commit/209a868639f959c497f4f64099c1b3fd5b8dc624"><code>209a868</code></a> Update GitHub version number to 1.12.472-SNAPSHOT</li> <li><a href="https://github.com/aws/aws-sdk-java/commit/f9f380828de6200b076b9d728873a508a61d847d"><code>f9f3808</code></a> AWS SDK for Java 1.12.471</li> <li><a href="https://github.com/aws/aws-sdk-java/commit/56c257574afcd973f9a4c7935a82927fe5726c0c"><code>56c2575</code></a> Update GitHub version number to 1.12.471-SNAPSHOT</li> <li><a href="https://github.com/aws/aws-sdk-java/commit/eda03ac5e368d3cfbb63324b7d1305bf8ae858f2"><code>eda03ac</code></a> AWS SDK for Java 1.12.470</li> <li><a href="https://github.com/aws/aws-sdk-java/commit/031a2c783caa27e18d4ea83fca4f91c014096c59"><code>031a2c7</code></a> Update GitHub version number to 1.12.470-SNAPSHOT</li> <li><a href="https://github.com/aws/aws-sdk-java/commit/b4fa3d162884584caed370859c8b44e26396d4c7"><code>b4fa3d1</code></a> AWS SDK for Java 1.12.469</li> <li><a href="https://github.com/aws/aws-sdk-java/commit/96a0c2f8f11842c7e58498dbf294173fe2356221"><code>96a0c2f</code></a> Update GitHub version number to 1.12.469-SNAPSHOT</li> <li>See full diff in <a href="https://github.com/aws/aws-sdk-java/compare/1.12.468...1.12.472">compare view</a></li> </ul> </details> <br /> [](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) </details> Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
… github.com:vmware/versatile-data-kit into person/miroslavi/killed-job-was-shown-as-successful
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Why
We recently got the following feedback from our internal client: A data job was listed as successful even though it hit the 12 hour limit and was killed; the logs do not show that either - the last entry in the log just shows the last object that was sent for ingestion, but there is no summary of the data job.
The problem is caused by the following fix - #1586.
When the job hit the 12-hour limit the K8S Pod is terminated and we construct partial JobExecutionStatus which enters in the following if statement and returns Optional.empty() rather than the constructed object.
versatile-data-kit/projects/control-service/projects/pipelines_control_service/src/main/java/com/vmware/taurus/service/KubernetesService.java
Line 1656 in 4763ba8
As a result, this job execution becomes stuck in the Running status until it is detected by emergency logic, which marks such executions as successful due to the lack of associated Pods to them.
What
Added validation for an already completed job in a more appropriate place.
Testing Done
Added integration test
Signed-off-by: Miroslav Ivanov [email protected]