test: Add delay to L0_lifecycle test_load_new_model_version after each model file update #7735

kthui · 2024-10-23T19:33:32Z

What does the PR do?

Add a small delay to L0_lifecycle test_load_new_model_version after each model file update, to prevent flaky results due to the update not being picked up by the server during model load request.

Checklist

Commit Type:

Check the conventional commit type
box here and add the label to the github PR.

Related PRs:

N/A

Where should the reviewer start?

Start by looking into the L0_lifecycle failure of pipeline 19628457

Test plan:

L0_lifecycle pass after the patch.

CI Pipeline ID: 19662667

Caveats:

N/A

Background

See #7730 and job 117883843

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

N/A

rmccorm4

Just to check my understanding, is the flakiness coming from checks for model versions expected to be "not ready"?

I.e. The calls to "load" in some test cases are actually triggering a "reload" that may "unload" some instances (removed versions from from the diff of new config), and the server doesn't wait for each unload to completely finish before returning to the client, so the checks for "not ready" models may fail if it hasn't fully unloaded on server side yet?

I would expect "loads" to block on server side until actually complete and model considered ready before returning to client. Whereas I believe that "unloads" may not block until considered "not ready", so I want to double check if the flakiness is coming from one of these "unload" cases via a version-policy-change-only "reload".

kthui · 2024-10-24T17:36:22Z

I looked deeper into the failed server and client log, and found the last load operation was successful according to the client log, but according to the server log the load operation did not commence. Subsequently, the check for model version 1 is loaded failed.

If I remove the model config update from the test, I was able to reproduce the same client and server log locally.

Base on the finding, my hypothesis for the failure is the model config update has been written to disk, according to the Python script, which at that point the server should see the updated file, but for some reason the server was seeing the old version by the time the load operation arrived. I will update the test accordingly.

This reverts commit 1af39e8.

kthui · 2024-10-24T18:13:28Z

The test script on the container is updated to the latest and re-ran, see job 118303410

rmccorm4 · 2024-10-24T18:37:27Z

qa/L0_lifecycle/lifecycle_test.py

Moving to thread.

Base on the finding, my hypothesis for the failure is the model config update has been written to disk, according to the Python script, which at that point the server should see the updated file, but for some reason the server was seeing the old version by the time the load operation arrived. I will update the test accordingly.

Is there a bug in checking file modification time or something?

I think this is unique to the CI, because I ran the test from the failing container locally 10 times and they all passed. I ran the test ~560 times last night on the CI without any time.sleep() and they all passed, see job 118186442.

Edit: The ~560 times CI run has --log-verbose=2, which could change some timing.

I think another possibility is the timestamp is the same before and after the file update, but this breaks our assumption that a ns precision timestamp is sufficient at determining the order of events.

rmccorm4

If it starts failing again even with this change, we should re-evaluate the methodology here

kthui added the PR: test Adding missing tests or correcting existing test label Oct 23, 2024

Add small delay after reloading model versions

1af39e8

kthui force-pushed the jacky-lifecycle-model-version branch from 95b1c55 to 1af39e8 Compare October 23, 2024 19:34

kthui self-assigned this Oct 23, 2024

kthui marked this pull request as ready for review October 23, 2024 23:20

kthui requested a review from rmccorm4 October 23, 2024 23:21

rmccorm4 reviewed Oct 23, 2024

View reviewed changes

kthui added 2 commits October 24, 2024 10:38

Revert "Add small delay after reloading model versions"

3ddf25b

This reverts commit 1af39e8.

Add small delay after disk operation

ba226a3

kthui requested a review from rmccorm4 October 24, 2024 18:13

rmccorm4 reviewed Oct 24, 2024

View reviewed changes

rmccorm4 approved these changes Oct 24, 2024

View reviewed changes

kthui merged commit a434122 into main Oct 24, 2024
3 checks passed

kthui deleted the jacky-lifecycle-model-version branch October 24, 2024 19:28

kthui changed the title ~~test: Add small delay to L0_lifecycle test_load_new_model_version after each model reload~~ test: Add small delay to L0_lifecycle test_load_new_model_version after each file update Oct 24, 2024

kthui changed the title ~~test: Add small delay to L0_lifecycle test_load_new_model_version after each file update~~ test: Add delay to L0_lifecycle test_load_new_model_version after each model file update Oct 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test: Add delay to L0_lifecycle test_load_new_model_version after each model file update #7735

test: Add delay to L0_lifecycle test_load_new_model_version after each model file update #7735

kthui commented Oct 23, 2024 •

edited

Loading

rmccorm4 left a comment •

edited

Loading

kthui commented Oct 24, 2024

kthui commented Oct 24, 2024

rmccorm4 Oct 24, 2024

kthui Oct 24, 2024 •

edited

Loading

kthui Oct 24, 2024

rmccorm4 left a comment

test: Add delay to L0_lifecycle test_load_new_model_version after each model file update #7735

test: Add delay to L0_lifecycle test_load_new_model_version after each model file update #7735

Conversation

kthui commented Oct 23, 2024 • edited Loading

What does the PR do?

Checklist

Commit Type:

Related PRs:

Where should the reviewer start?

Test plan:

Caveats:

Background

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

rmccorm4 left a comment • edited Loading

Choose a reason for hiding this comment

kthui commented Oct 24, 2024

kthui commented Oct 24, 2024

rmccorm4 Oct 24, 2024

Choose a reason for hiding this comment

kthui Oct 24, 2024 • edited Loading

Choose a reason for hiding this comment

kthui Oct 24, 2024

Choose a reason for hiding this comment

rmccorm4 left a comment

Choose a reason for hiding this comment

kthui commented Oct 23, 2024 •

edited

Loading

rmccorm4 left a comment •

edited

Loading

kthui Oct 24, 2024 •

edited

Loading