Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test: Refactor cpu metrics tests to make L0_metrics more stable #7476

Merged
merged 12 commits into from
Jul 29, 2024

Conversation

rmccorm4
Copy link
Contributor

@rmccorm4 rmccorm4 commented Jul 26, 2024

The previous implementation of tests for the CPU/RAM related metrics were:

  1. Unstable on the check for nv_cpu_utilization because there was no work going on in the background, making duplicate values appear fairly frequently.
  2. Written in bash

This PR:

  1. Adds server inference requests in the background to keep the server busy and show some CPU utilization variation during metric observation
  2. Moves to python for more flexibility and ease of testing
  3. Update the newer styled tests (pinned_memory_metrics, cpu_metrics) to use the new pytest style that generates nice reports on the gitlab side for easier regression detection at the unit test level. I also noticed the use of check_test_results without using the tu.ResultCollector for the pinned memory metrics tests section - so it wasn't actually checking the right test results file anyway, since none was being generated. I moved this one to the pytest style to be consistent with the similar cpu metrics test.
  4. Make some windows-friendly changes to the L0_metrics tests based on Francesco's review comments like use of kill_server and TRITONSERVER_IPADDR.
  5. Reduces 4GPU CI runner requirement to 2GPU requirement by only checking 2-gpus instead of 3-gpus in the gpu metrics section. I think the test is equivalent regardless, but there are significantly more 2gpu runners available in CI than 4gpu runners to cut down on job queue times. I'll push a gitlab-side PR to update accordingly, but it will still work as-is with these changes.

Average results of the PR from testing so far typically makes about 20 observations (1 per second for 20 second test), and finds no duplicate values. However, keeping the duplicate tolerance will help stabilize in the event we occasionally find some:
image

@rmccorm4 rmccorm4 added the PR: test Adding missing tests or correcting existing test label Jul 26, 2024
qa/L0_metrics/test.sh Outdated Show resolved Hide resolved
qa/L0_metrics/test.sh Outdated Show resolved Hide resolved
@fpetrini15 fpetrini15 self-requested a review July 27, 2024 01:23
fpetrini15
fpetrini15 previously approved these changes Jul 27, 2024
Copy link
Contributor

@fpetrini15 fpetrini15 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TmLg! Great work Ryan! Look forward to seeing the greener pipelines 🚀

And thanks for fixing things up to be Windows CI friendly!

Copy link
Contributor

@krishung5 krishung5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link
Contributor

@fpetrini15 fpetrini15 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, great work!

@rmccorm4 rmccorm4 merged commit fb056b1 into main Jul 29, 2024
3 checks passed
@rmccorm4 rmccorm4 deleted the rmccormick-L0_metrics-stability branch July 29, 2024 19:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
PR: test Adding missing tests or correcting existing test
Development

Successfully merging this pull request may close these issues.

3 participants