Fix broken sanity tests + Add skip wait option in test teardown in model serving #1157

bdattoma · 2024-01-25T15:33:00Z

Fixing:

Verify RHODS Admins Can Import A Custom Serving Runtime Template For Each Serving Platform
Verify RHODS Users Can Deploy A Model Using A Custom Serving Runtime
Verify Model Upgrade Using Canaray Rollout
Verify User Can Autoscale Using Concurrency

In addition, the PR is adding the option to skip waiting for project to be deleted in test teardown. The reason is that the project deletion takes very much time in model serving test (OCP takes from 3 to 6 minutes to delete it)

ods_ci/tests/Tests/400__ods_dashboard/420__model_serving/LLMs/422__model_serving_llm.robot


 Verify User Can Autoscale Using Concurrency
    [Documentation]    Checks if model successfully scale up based on concurrency metrics (KPA)
    [Tags]    Sanity    Tier1    ODS-2377
    [Setup]    Set Project And Runtime    namespace=autoscale-con
    ${test_namespace}=    Set Variable    autoscale-con
    ${flan_model_name}=    Set Variable    flan-t5-small-caikit
-    ${model_name}=    Create List    ${flan_model_name}
+    ${models_names}=    Create List    ${flan_model_name}


ods_ci/tests/Resources/CLI/ModelServing/llm.resource

@@ -292,6 +292,7 @@
    [Documentation]    Group together the test steps for preparing, deploying
    ...                and querying a model
    [Arguments]    ${model_storage_uri}    ${model_name}    ${isvc_name}=${model_name}
+    ...            ${runtime}=caikit-tgis-runtime    ${protocol}=grpc    ${inference_type}=all-tokens


ods_ci/tests/Resources/CLI/ModelServing/llm.resource

bdattoma · 2024-01-25T16:01:05Z

PR validation:

CustomServingRuntime tests + some of Model Serving Sanity rhods-ci-pr-test/2386 PASS
Model Serving Sanity: rhods-ci-pr-test/2394 - 2 partial failures* - think they happen because the model is not fully loaded in memory yet. Needs further investigation (out of this PR)
TrustyAI: rhods-ci-pr-test/2399/ 2 failures related to product build issue. The code patched by this PR ran OK in the second failed test

Dry-run failures are not due to changes in this PR

'* partial failure to investigate:

Verify User Can Deploy Multiple Models In The Same Namespace Using The UI
Verify User Can Deploy Multiple Models In Different Namespaces Using The UI
These might be solved at product level with readiness probes: add readiness probe on TGIS container (caikit+tgis) opendatahub-io/caikit-tgis-serving#156

ods_ci/tests/Tests/400__ods_dashboard/420__model_serving/420__model_serving.robot

ods_ci/tests/Resources/CLI/ModelServing/llm.resource

jgarciao · 2024-01-26T09:53:16Z

ods_ci/tests/Resources/CLI/ModelServing/llm.resource

@@ -580,5 +579,9 @@ Clean Up Test Project
    ...    namespace=${test_ns}
    ${rc}    ${out}=    Run And Return Rc And Output    oc delete project ${test_ns}
    Should Be Equal As Integers    ${rc}    ${0}
-    ${rc}    ${out}=    Run And Return Rc And Output    oc wait --for=delete namespace ${test_ns} --timeout=300s
-    Should Be Equal As Integers    ${rc}    ${0}
+    IF    ${wait_prj_deletion}


There are a lot of keywords deleting projects, for example Delete Data Science Project From CLI in ods_ci/tests/Resources/Page/ODH/ODHDashboard/ODHDataScienceProject/Projects.resource

Could you consider enhancing the existing one for your purposes in another PR?

I thought about that while reviewing the code. I think the main difference is that the first one you mention handle DS Projects, while in this case we're handling basic OCP projects.

I didn't apply enhancements for now, need to thinki a bit more about how to implement it, but I agree

...ci/tests/Tests/400__ods_dashboard/420__model_serving/423__model_serving_customruntimes.robot

ods_ci/tests/Tests/400__ods_dashboard/420__model_serving/420__model_serving.robot

ods_ci/tests/Resources/Page/ODH/ODHDashboard/ODHModelServing.resource

@@ -276,6 +274,24 @@
        Fail    msg=comparison between expected and actual failed, ${list}
    END

+Verify Model Inference With Retries


ods_ci/tests/Resources/Page/ODH/ODHDashboard/ODHModelServing.resource

+    ...                timing: model not ready to reply yet, despite the pod is up and running and the
+    ...                endpoint exposed.
+    ...                This is a temporary mitigation meanwhile we find a better way to check the model
+    [Arguments]    ${model_name}    ${inference_input}    ${expected_inference_output}


ods_ci/tests/Resources/Page/ODH/ODHDashboard/ODHModelServing.resource

github-actions · 2024-01-26T11:23:33Z

Robot Results

✅ Passed	❌ Failed	⏭️ Skipped	Total	Pass %
405	0	0	405	100

ods_ci/tests/Resources/Page/ODH/ODHDashboard/ODHModelServing.resource

+    ...                endpoint exposed.
+    ...                This is a temporary mitigation meanwhile we find a better way to check the model
+    [Arguments]    ${model_name}    ${inference_input}    ${expected_inference_output}
+    ...            ${token_auth}=${FALSE}    ${project_title}=${NONE}    ${retries}=${5}


ods_ci/tests/Resources/Page/ODH/ODHDashboard/ODHModelServing.resource

ods_ci/tests/Tests/400__ods_dashboard/420__model_serving/424_model_serving_bias_metrics.robot

ods_ci/tests/Tests/400__ods_dashboard/420__model_serving/420__model_serving.robot

jstourac

Thanks, there are some small things to be fixed. Otherwise LGTM from my restricted knowledge point of view.

ods_ci/tests/Resources/Page/ODH/ODHDashboard/ODHModelServing.resource

ods_ci/tests/Tests/400__ods_dashboard/420__model_serving/420__model_serving.robot

ods_ci/tests/Tests/400__ods_dashboard/420__model_serving/LLMs/422__model_serving_llm.robot

jstourac

One more comment, please let me know if you want to touch it (note it needs a change few lines above too). Otherwise I'll approve.

ods_ci/tests/Tests/400__ods_dashboard/420__model_serving/LLMs/422__model_serving_llm.robot


 Verify User Can Validate Scale To Zero
    [Documentation]    Checks if model successfully scale down to 0 if there's no traffic
    [Tags]    Sanity    Tier1    ODS-2379
    [Setup]    Set Project And Runtime    namespace=autoscale-zero
    ${flan_model_name}=    Set Variable    flan-t5-small-caikit
-    ${model_name}=    Create List    ${flan_model_name}
+    ${models_names}=    Create List    ${flan_model_name}


sonarqubecloud · 2024-01-26T16:17:38Z

Quality Gate passed

Kudos, no new issues were introduced!

0 New issues
0 Security Hotspots
No data about Coverage
0.0% Duplication on New Code

See analysis details on SonarCloud

github-advanced-security bot found potential problems Jan 25, 2024

View reviewed changes

bdattoma force-pushed the fix_serving_sanity branch from 3f4de28 to dec4caa Compare January 25, 2024 16:04

bdattoma requested review from tarukumar, FedeAlonso, jgarciao, jiridanek, jstourac and lugi0 January 25, 2024 16:10

bdattoma self-assigned this Jan 25, 2024

bdattoma added needs testing Needs to be tested in Jenkins enhancements Bugfixes, enhancements, refactoring, ... in tests or libraries (PR will be listed in release-notes) labels Jan 25, 2024

github-advanced-security bot found potential problems Jan 26, 2024

View reviewed changes

bdattoma added 10 commits January 26, 2024 09:29

fix canary test

1276323

fix concurrency test clean up

8953db2

skip project delation wait

e56ad4c

update serving platform names

0dcc905

remove unused browser session

b72d1b7

update sr platform labels

7630cb5

fix deploy form wait

afe1124

try fix mm inference tests

957e4b4

fix metrics test regression from other PRs

87827a8

add note about intermittent bug

c50f4f6

bdattoma force-pushed the fix_serving_sanity branch from 16373e9 to c50f4f6 Compare January 26, 2024 08:29

fix montioring enablement due to other PRs

63183e2

jgarciao previously approved these changes Jan 26, 2024

View reviewed changes

jgarciao and others added 2 commits January 26, 2024 10:56

Merge branch 'master' into fix_serving_sanity

0a1cafa

apply suggested changes

aa88d0a

bdattoma dismissed jgarciao’s stale review via aa88d0a January 26, 2024 10:18

tentative workaround for failing MM sanity

a6da1b6

github-advanced-security bot found potential problems Jan 26, 2024

View reviewed changes

fix inference kw with retries

60530cb

github-advanced-security bot found potential problems Jan 26, 2024

View reviewed changes

restore @ before inference input filepath

58d7276

github-advanced-security bot found potential problems Jan 26, 2024

View reviewed changes

ods_ci/tests/Tests/400__ods_dashboard/420__model_serving/424_model_serving_bias_metrics.robot Show resolved Hide resolved

jstourac reviewed Jan 26, 2024

View reviewed changes

ods_ci/tests/Tests/400__ods_dashboard/420__model_serving/420__model_serving.robot Outdated Show resolved Hide resolved

jstourac requested changes Jan 26, 2024

View reviewed changes

bdattoma added 2 commits January 26, 2024 14:23

fix missing args

4cf3ca3

add sleep btw retries

8705048

github-advanced-security bot found potential problems Jan 26, 2024

View reviewed changes

ods_ci/tests/Resources/Page/ODH/ODHDashboard/ODHModelServing.resource Fixed Show fixed Hide fixed

ods_ci/tests/Resources/Page/ODH/ODHDashboard/ODHModelServing.resource Fixed Show fixed Hide fixed

fix robocop alerts

1443179

github-advanced-security bot found potential problems Jan 26, 2024

View reviewed changes

ods_ci/tests/Tests/400__ods_dashboard/420__model_serving/420__model_serving.robot Fixed Show fixed Hide fixed

bdattoma requested review from jstourac and jgarciao January 26, 2024 15:16

jstourac reviewed Jan 26, 2024

View reviewed changes

ods_ci/tests/Tests/400__ods_dashboard/420__model_serving/LLMs/422__model_serving_llm.robot Outdated Show resolved Hide resolved

jstourac reviewed Jan 26, 2024

View reviewed changes

bdattoma added verified This PR has been tested with Jenkins and removed needs testing Needs to be tested in Jenkins labels Jan 26, 2024

bdattoma requested a review from jstourac January 26, 2024 15:23

fix variable name for consistency in ods-2379

aa06ca5

github-advanced-security bot found potential problems Jan 26, 2024

View reviewed changes

fix robocop alert

ebbca32

jstourac approved these changes Jan 26, 2024

View reviewed changes

bdattoma and others added 2 commits January 26, 2024 16:31

Merge branch 'master' into fix_serving_sanity

821c020

Merge branch 'master' into fix_serving_sanity

17c983a

jgarciao approved these changes Jan 26, 2024

View reviewed changes

jgarciao merged commit 29b646f into red-hat-data-services:master Jan 26, 2024
11 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix broken sanity tests + Add skip wait option in test teardown in model serving #1157

Fix broken sanity tests + Add skip wait option in test teardown in model serving #1157

bdattoma commented Jan 25, 2024

bdattoma commented Jan 25, 2024 •

edited

Loading

jgarciao Jan 26, 2024

bdattoma Jan 26, 2024 •

edited

Loading

github-actions bot commented Jan 26, 2024

jstourac left a comment

jstourac left a comment

sonarqubecloud bot commented Jan 26, 2024

Fix broken sanity tests + Add skip wait option in test teardown in model serving #1157

Fix broken sanity tests + Add skip wait option in test teardown in model serving #1157

Conversation

bdattoma commented Jan 25, 2024

bdattoma commented Jan 25, 2024 • edited Loading

jgarciao Jan 26, 2024

Choose a reason for hiding this comment

bdattoma Jan 26, 2024 • edited Loading

Choose a reason for hiding this comment

github-actions bot commented Jan 26, 2024

Robot Results

jstourac left a comment

Choose a reason for hiding this comment

jstourac left a comment

Choose a reason for hiding this comment

sonarqubecloud bot commented Jan 26, 2024

Quality Gate passed

bdattoma commented Jan 25, 2024 •

edited

Loading

bdattoma Jan 26, 2024 •

edited

Loading