test: Refactor cpu metrics tests to make L0_metrics more stable #7476

rmccorm4 · 2024-07-26T01:12:22Z

The previous implementation of tests for the CPU/RAM related metrics were:

Unstable on the check for nv_cpu_utilization because there was no work going on in the background, making duplicate values appear fairly frequently.
Written in bash

This PR:

Adds server inference requests in the background to keep the server busy and show some CPU utilization variation during metric observation
Moves to python for more flexibility and ease of testing
Update the newer styled tests (pinned_memory_metrics, cpu_metrics) to use the new pytest style that generates nice reports on the gitlab side for easier regression detection at the unit test level. I also noticed the use of check_test_results without using the tu.ResultCollector for the pinned memory metrics tests section - so it wasn't actually checking the right test results file anyway, since none was being generated. I moved this one to the pytest style to be consistent with the similar cpu metrics test.
Make some windows-friendly changes to the L0_metrics tests based on Francesco's review comments like use of kill_server and TRITONSERVER_IPADDR.
Reduces 4GPU CI runner requirement to 2GPU requirement by only checking 2-gpus instead of 3-gpus in the gpu metrics section. I think the test is equivalent regardless, but there are significantly more 2gpu runners available in CI than 4gpu runners to cut down on job queue times. I'll push a gitlab-side PR to update accordingly, but it will still work as-is with these changes.

Average results of the PR from testing so far typically makes about 20 observations (1 per second for 20 second test), and finds no duplicate values. However, keeping the duplicate tolerance will help stabilize in the event we occasionally find some:

…quirks

qa/L0_metrics/cpu_metrics_test.py

qa/L0_metrics/test.sh

qa/L0_metrics/cpu_metrics_test.py

qa/L0_metrics/test.sh

fpetrini15

TmLg! Great work Ryan! Look forward to seeing the greener pipelines 🚀

And thanks for fixing things up to be Windows CI friendly!

…unners

krishung5

LGTM!

fpetrini15

LGTM, great work!

rmccorm4 added 6 commits July 25, 2024 18:08

Refactor cpu metrics tests to be more stable

c3445d5

Add a couple extra comments and remove debug statement

b9791be

Add explanatin for the request count

4e1b231

Fix typo

e51bc6b

Increase metrics interval to 1s for CPU utilization test

2291766

Account for cpu utilization values like '0' and '1.' from prometheus …

4ae15cb

…quirks

rmccorm4 added the PR: test Adding missing tests or correcting existing test label Jul 26, 2024

rmccorm4 requested review from fpetrini15, krishung5 and KrishnanPrash July 26, 2024 18:07

fpetrini15 reviewed Jul 26, 2024

View reviewed changes

qa/L0_metrics/cpu_metrics_test.py Show resolved Hide resolved

fpetrini15 reviewed Jul 26, 2024

View reviewed changes

qa/L0_metrics/cpu_metrics_test.py Outdated Show resolved Hide resolved

fpetrini15 reviewed Jul 26, 2024

View reviewed changes

qa/L0_metrics/test.sh Outdated Show resolved Hide resolved

fpetrini15 reviewed Jul 26, 2024

View reviewed changes

qa/L0_metrics/cpu_metrics_test.py Outdated Show resolved Hide resolved

rmccorm4 added 3 commits July 26, 2024 16:38

Review comments: make tests more windows friendly

bd8aa4e

Review comments: simplify range comparison

92a3cc1

Review comments: update bash usage of localhost to TRITONSERVER_IPADDR

514e5a9

fpetrini15 reviewed Jul 27, 2024

View reviewed changes

qa/L0_metrics/test.sh Show resolved Hide resolved

fpetrini15 reviewed Jul 27, 2024

View reviewed changes

qa/L0_metrics/test.sh Outdated Show resolved Hide resolved

Remove debug exit and print logs on failure

911bdc6

fpetrini15 self-requested a review July 27, 2024 01:23

fpetrini15 previously approved these changes Jul 27, 2024

View reviewed changes

rmccorm4 added 2 commits July 27, 2024 17:08

Add 'import os' for TRITONSERVER_IPADDR check

ecae4c9

Reduce 3 gpu test to 2 gpu test to increase number of applicable CI r…

3312536

…unners

rmccorm4 dismissed fpetrini15’s stale review via 3312536 July 28, 2024 00:08

rmccorm4 requested a review from fpetrini15 July 29, 2024 15:59

krishung5 approved these changes Jul 29, 2024

View reviewed changes

fpetrini15 approved these changes Jul 29, 2024

View reviewed changes

rmccorm4 merged commit fb056b1 into main Jul 29, 2024
3 checks passed

rmccorm4 deleted the rmccormick-L0_metrics-stability branch July 29, 2024 19:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test: Refactor cpu metrics tests to make L0_metrics more stable #7476

test: Refactor cpu metrics tests to make L0_metrics more stable #7476

rmccorm4 commented Jul 26, 2024 •

edited

Loading

fpetrini15 left a comment

krishung5 left a comment

fpetrini15 left a comment

test: Refactor cpu metrics tests to make L0_metrics more stable #7476

test: Refactor cpu metrics tests to make L0_metrics more stable #7476

Conversation

rmccorm4 commented Jul 26, 2024 • edited Loading

fpetrini15 left a comment

Choose a reason for hiding this comment

krishung5 left a comment

Choose a reason for hiding this comment

fpetrini15 left a comment

Choose a reason for hiding this comment

rmccorm4 commented Jul 26, 2024 •

edited

Loading