Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DAOS-17006 cart: Publish Mercury counters as metrics (#15870) #15963

Merged
merged 3 commits into from
Feb 27, 2025

Conversation

mjmac
Copy link
Contributor

@mjmac mjmac commented Feb 24, 2025

When Mercury has been built with diagnostic RPC counters
enabled, CaRT will periodically republish the counters
as DAOS telemetry for consumption by monitoring
infrastructure. NB: Requires Mercury > 2.4.0.

Change-Id: I0232d0da8007374fd1d28d395c65544c7fa57bc1
Signed-off-by: Michael MacDonald [email protected]
Co-authored-by: Jeff Olivier [email protected]
Co-authored-by: Nicholas Murphy [email protected]

Copy link

Ticket title is 'Expose Mercury perf counters as DAOS metrics'
Status is 'In Review'
https://daosio.atlassian.net/browse/DAOS-17006

@mjmac
Copy link
Contributor Author

mjmac commented Feb 24, 2025

@soumagne: FYI, this is a PR suitable for master... I have chosen not to include the mercury patch, but I can adjust that if desired. Do you anticipate making a 2.4.1 release prior to DAOS 2.8.0? If so, then the new metrics will "just work" when DAOS is built against it.

@soumagne
Copy link
Collaborator

ok that works, thanks! Yes I will be making a new release for DAOS 2.8.

@mjmac mjmac force-pushed the mjmac/DAOS-17006 branch 2 times, most recently from e24eacd to 24ad6ce Compare February 24, 2025 16:48
When Mercury has been built with diagnostic RPC counters
enabled, CaRT will periodically republish the counters
as DAOS telemetry for consumption by monitoring
infrastructure. NB: Requires Mercury > 2.4.0.

Change-Id: I0232d0da8007374fd1d28d395c65544c7fa57bc1
Signed-off-by: Michael MacDonald <[email protected]>
Co-authored-by: Jeff Olivier <[email protected]>
Co-authored-by: Nicholas Murphy <[email protected]>
@mjmac mjmac marked this pull request as ready for review February 25, 2025 13:39
@mjmac mjmac requested review from a team as code owners February 25, 2025 13:39
Copy link
Contributor

@daltonbohning daltonbohning left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ftest LGTM but does this need Features: telemetry?

@mjmac
Copy link
Contributor Author

mjmac commented Feb 25, 2025

ftest LGTM but does this need Features: telemetry?

The patch should basically be a no-op until we start building against the newer version of Mercury. Can you see if any of the telemetry tests were run as part of the usual PR set?

@daltonbohning
Copy link
Contributor

ftest LGTM but does this need Features: telemetry?

The patch should basically be a no-op until we start building against the newer version of Mercury. Can you see if any of the telemetry tests were run as part of the usual PR set?

These three are pr,telemetry. Maybe sufficient?

  • control/dmg_telemetry_basic.py
  • control/dmg_telemetry_io_basic.py:
  • control/dmg_telemetry_nvme.py

@mjmac
Copy link
Contributor Author

mjmac commented Feb 25, 2025

These three are pr,telemetry. Maybe sufficient?

  • control/dmg_telemetry_basic.py

Yeah, that one runs this sub-test: https://github.com/daos-stack/daos/blob/master/src/tests/ftest/control/dmg_telemetry_basic.py#L73 ... Pretty sure that's the one that would fail if I forgot to update the metrics list.

@mjmac mjmac requested a review from kjacque February 25, 2025 18:20
@jolivier23
Copy link
Contributor

Need to fix the merge conflict

@mjmac
Copy link
Contributor Author

mjmac commented Feb 25, 2025

Need to fix the merge conflict

Yeah, just waiting to see if I get any other review feedback before kicking off another CI run.

Comment on lines +863 to +872
CLIENT_NET_METRICS = [
"client_net_hg_bulks",
"client_net_hg_req_recv",
"client_net_hg_extra_bulk_resp",
"client_net_hg_extra_bulk_req",
"client_net_hg_resp_sent",
"client_net_hg_resp_recv",
"client_net_hg_mr_copies",
"client_net_hg_req_sent",
"client_net_hg_active_rpcs"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't there supposed to be a test that collects the metrics and verifies the values against a certain workload?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO that's kind of a low-value test in that it's really testing Mercury instead of DAOS code. Trying to determine "correct" counter values at this level also seems pretty error-prone and likely to introduce test flakiness.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, i agree this is probably an internal test for those counters. IDK if such a test exists. @soumagne do you know? if those are being exposed, it would be a good idea to verify they are correct.

what i don't get is what flakiness have to do with counters. If it's a deterministic workload, the results should have no variance and should not be error prone. so not sure i get what you mean there.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we currently do not have a test for these.

Copy link
Contributor Author

@mjmac mjmac Feb 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what i don't get is what flakiness have to do with counters. If it's a deterministic workload, the results should have no variance and should not be error prone. so not sure i get what you mean there.

I suppose that what I meant was more "brittle" than "flaky". Say we add a test today that writes 1MB of data to a container using IOR. We run the test, and then read the counters for the IOR client process. At the end of it we should see 0 active RPCs, 3 bulks, 13 req sent and 13 resp received (I'm just making numbers up for illustration).

The values of these counters are determined by a number of factors, including: object class, protocol query, collective RPCs enabled/disabled, etc. Now imagine that we change some of the client startup logic to somehow reduce or eliminate the protocol query stuff, or some other change is made to improve collective RPC efficiency, etc. And now the RPC count is off and the test will fail whenever it's run (probably not on the PR that made the change because that one wasn't directly modifying telemetry).

There are a lot of variables that will affect the final counts, and my concern is that this hypothetical test would wind up in the pile of perpetually-broken weekly tests that just get ignored and don't actually improve product quality.

We already have unit tests for the telemetry library itself. IMO, a better approach would be to implement some kind of unit test in Mercury itself to verify that counters are incremented as expected for a deterministic synthetic workload.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what i don't get is what flakiness have to do with counters. If it's a deterministic workload, the results should have no variance and should not be error prone. so not sure i get what you mean there.

I suppose that what I meant was more "brittle" than "flaky". Say we add a test today that writes 1MB of data to a container using IOR. We run the test, and then read the counters for the IOR client process. At the end of it we should see 0 active RPCs, 3 bulks, 13 req sent and 13 resp received (I'm just making numbers up for illustration).

The values of these counters are determined by a number of factors, including: object class, protocol query, collective RPCs enabled/disabled, etc. Now imagine that we change some of the client startup logic to somehow reduce or eliminate the protocol query stuff, or some other change is made to improve collective RPC efficiency, etc. And now the RPC count is off and the test will fail whenever it's run (probably not on the PR that made the change because that one wasn't directly modifying telemetry).

There are a lot of variables that will affect the final counts, and my concern is that this hypothetical test would wind up in the pile of perpetually-broken weekly tests that just get ignored and don't actually improve product quality.

We already have unit tests for the telemetry library itself. IMO, a better approach would be to implement some kind of unit test in Mercury itself to verify that counters are incremented as expected for a deterministic synthetic workload.

well yea it will be PR test in the unit test stage even with NLT, because the test can be something small.
But anyway I do not disagree with you. But I am concerned that it sounds we are adding a non tested feature and exposing it for users without confirming the metrics are accurate.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a fair concern. Do we have sanity tests for for any of our metrics outside of the basic tests that they are produced?

kjacque
kjacque previously approved these changes Feb 26, 2025
Feature: telemetry
Signed-off-by: Michael MacDonald <[email protected]>
Comment on lines +863 to +872
CLIENT_NET_METRICS = [
"client_net_hg_bulks",
"client_net_hg_req_recv",
"client_net_hg_extra_bulk_resp",
"client_net_hg_extra_bulk_req",
"client_net_hg_resp_sent",
"client_net_hg_resp_recv",
"client_net_hg_mr_copies",
"client_net_hg_req_sent",
"client_net_hg_active_rpcs"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a fair concern. Do we have sanity tests for for any of our metrics outside of the basic tests that they are produced?

@mjmac mjmac merged commit 6f8676f into master Feb 27, 2025
57 checks passed
@mjmac mjmac deleted the mjmac/DAOS-17006 branch February 27, 2025 19:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

6 participants