Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Hive deployment to fail fast on fatal error, and fix deprovision logic #1078

Merged
merged 4 commits into from
Dec 17, 2023

Conversation

manosnoam
Copy link
Contributor

  • Fail fast on OCP install Fatal error, instead of waiting for 50 minutes timeout.
    This is implemented by watching Hive pods completion status.

  • Fix deprovision that was running always (in Teardown) if provision failed -
    even when provision failed since the cluster name already existed (legitimate failure),
    which caused deprovision of existing cluster without intention.

  • 'Wait For Cluster To Be Ready' not only watches for Install to be completed,
    but also for Hive cluster deployment to include API endpoint.

Copy link

Quality Gate Passed Quality Gate passed

Kudos, no new issues were introduced!

0 New issues
0 Security Hotspots
No data about Coverage
0.0% Duplication on New Code

See analysis details on SonarCloud

@@ -32,7 +32,16 @@
${clustername_exists} = Does ClusterName Exists
${template} = Select Provisioner Template ${provider_type}
IF ${clustername_exists}
... FAIL Cluster name '${cluster_name}' already exists. Please choose a different name.
Log Cluster name '${cluster_name}' already exists in Hive pool '${pool_name}' - Checking if it has a valid web-console console=True

Check warning

Code scanning / Robocop

Line is too long ({{ line_length }}/{{ allowed_length }}) Warning

Line is too long (147/120)
... FAIL Cluster name '${cluster_name}' already exists. Please choose a different name.
Log Cluster name '${cluster_name}' already exists in Hive pool '${pool_name}' - Checking if it has a valid web-console console=True
${pool_namespace} = Get Cluster Pool Namespace ${pool_name}
${result} = Run Process oc -n ${pool_namespace} get cd ${pool_namespace} -o json | jq -r '.status.webConsoleURL' --exit-status shell=yes

Check warning

Code scanning / Robocop

Line is too long ({{ line_length }}/{{ allowed_length }}) Warning

Line is too long (151/120)
Log Cluster name '${cluster_name}' already exists in Hive pool '${pool_name}' - Checking if it has a valid web-console console=True
${pool_namespace} = Get Cluster Pool Namespace ${pool_name}
${result} = Run Process oc -n ${pool_namespace} get cd ${pool_namespace} -o json | jq -r '.status.webConsoleURL' --exit-status shell=yes
IF ${result.rc} != 0

Check notice

Code scanning / Robocop

'{{ block_name }}' condition can be simplified Note

'IF' condition can be simplified
${pool_namespace} = Get Cluster Pool Namespace ${pool_name}
${result} = Run Process oc -n ${pool_namespace} get cd ${pool_namespace} -o json | jq -r '.status.webConsoleURL' --exit-status shell=yes
IF ${result.rc} != 0
Log Cluster '${cluster_name}' has previously failed to be provisioned - Cleaning Hive resources console=True

Check warning

Code scanning / Robocop

Line is too long ({{ line_length }}/{{ allowed_length }}) Warning

Line is too long (126/120)
@{new_lines} = Split To Lines ${install_log_data} ${last_line_index}
FOR ${line} IN @{new_lines}
Log To Console ${line}
Watch Hive Install Log

Check warning

Code scanning / Robocop

Missing documentation in '{{ name }}' keyword Warning

Missing documentation in 'Watch Hive Install Log' keyword
${result} = Run Process oc -n ${pool_namespace} get cd ${pool_namespace} -o json | jq -r '.status.webConsoleURL' --exit-status shell=yes
IF ${result.rc} != 0
${result} = Run Process oc -n ${pool_namespace} get cd ${pool_namespace} -o json shell=yes
Log Cluster '${cluster_name}' install completed, but it is not accesible - Cleaning Hive resources console=True

Check warning

Code scanning / Robocop

Line is too long ({{ line_length }}/{{ allowed_length }}) Warning

Line is too long (125/120)
FAIL Cluster '${cluster_name}' provisioning failed. Please look into the logs for more details.
END
Log Cluster ${cluster_name} created successfully. Web Console: ${result.stdout} console=True

Check warning

Code scanning / Robocop

Trailing whitespace at the end of line Warning

Trailing whitespace at the end of line
${result} = Run Process oc -n ${pool_namespace} get cd ${pool_namespace} -o jsonpath\='{ .status.webConsoleURL }' shell=yes
Log Cluster ${cluster_name} Web Console: ${result.stdout} console=True
Should Be True ${result.rc} == 0
${result} = Run Process oc -n ${pool_namespace} get cd ${pool_namespace} -o json | jq -r '.status.apiURL' --exit-status shell=yes

Check warning

Code scanning / Robocop

Line is too long ({{ line_length }}/{{ allowed_length }}) Warning

Line is too long (140/120)
Log Cluster ${cluster_name} Web Console: ${result.stdout} console=True
Should Be True ${result.rc} == 0
${result} = Run Process oc -n ${pool_namespace} get cd ${pool_namespace} -o json | jq -r '.status.apiURL' --exit-status shell=yes
Should Be True ${result.rc} == 0 Hive Cluster deployment '${pool_namespace}' does not have a valid API access

Check notice

Code scanning / Robocop

'{{ block_name }}' condition can be simplified Note

'Should Be True' condition can be simplified
${result} = Run Process KUBECONFIG\=${cluster_kubeconf} oc login --username\=${username} --password\=${password} ${api} --insecure-skip-tls-verify
... shell=yes
# Test the extracted credentials
${result} = Run Process KUBECONFIG\=${cluster_kubeconf} oc login --username\=${username} --password\=${password} ${api} --insecure-skip-tls-verify shell=yes

Check warning

Code scanning / Robocop

Line is too long ({{ line_length }}/{{ allowed_length }}) Warning

Line is too long (169/120)
Copy link
Contributor

Robot Results

✅ Passed ❌ Failed ⏭️ Skipped Total Pass %
389 0 0 389 100

@manosnoam
Copy link
Contributor Author

This has been tested on PSI, GCP, and AWS.

Output example of the Provision process exiting right after a fatal error (without waiting for the 50 minutes timeout):

image

@manosnoam manosnoam changed the title Improove Hive deployment to fail fast on fatal error, and fix deprovision logic Improve Hive deployment to fail fast on fatal error, and fix deprovision logic Dec 13, 2023
@manosnoam manosnoam requested a review from jstourac December 14, 2023 07:32
Copy link
Member

@jstourac jstourac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't know the code good enough but in general LGTM. Thank you. Hopefully we won't miss any resources to be left over 🙂

@manosnoam manosnoam requested a review from bdattoma December 14, 2023 11:31
@manosnoam manosnoam merged commit ea4ddeb into red-hat-data-services:master Dec 17, 2023
7 checks passed
@manosnoam manosnoam added the enhancements Bugfixes, enhancements, refactoring, ... in tests or libraries (PR will be listed in release-notes) label Jan 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancements Bugfixes, enhancements, refactoring, ... in tests or libraries (PR will be listed in release-notes)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants