DAOS-17006 cart: Publish Mercury counters as metrics (#15870) #15963

mjmac · 2025-02-24T16:17:16Z

When Mercury has been built with diagnostic RPC counters
enabled, CaRT will periodically republish the counters
as DAOS telemetry for consumption by monitoring
infrastructure. NB: Requires Mercury > 2.4.0.

Change-Id: I0232d0da8007374fd1d28d395c65544c7fa57bc1
Signed-off-by: Michael MacDonald [email protected]
Co-authored-by: Jeff Olivier [email protected]
Co-authored-by: Nicholas Murphy [email protected]

github-actions · 2025-02-24T16:19:56Z

Ticket title is 'Expose Mercury perf counters as DAOS metrics'
Status is 'In Review'
https://daosio.atlassian.net/browse/DAOS-17006

mjmac · 2025-02-24T16:24:25Z

@soumagne: FYI, this is a PR suitable for master... I have chosen not to include the mercury patch, but I can adjust that if desired. Do you anticipate making a 2.4.1 release prior to DAOS 2.8.0? If so, then the new metrics will "just work" when DAOS is built against it.

soumagne · 2025-02-24T16:39:17Z

ok that works, thanks! Yes I will be making a new release for DAOS 2.8.

When Mercury has been built with diagnostic RPC counters enabled, CaRT will periodically republish the counters as DAOS telemetry for consumption by monitoring infrastructure. NB: Requires Mercury > 2.4.0. Change-Id: I0232d0da8007374fd1d28d395c65544c7fa57bc1 Signed-off-by: Michael MacDonald <[email protected]> Co-authored-by: Jeff Olivier <[email protected]> Co-authored-by: Nicholas Murphy <[email protected]>

daltonbohning

ftest LGTM but does this need Features: telemetry?

mjmac · 2025-02-25T17:04:01Z

ftest LGTM but does this need Features: telemetry?

The patch should basically be a no-op until we start building against the newer version of Mercury. Can you see if any of the telemetry tests were run as part of the usual PR set?

daltonbohning · 2025-02-25T17:07:56Z

ftest LGTM but does this need Features: telemetry?

The patch should basically be a no-op until we start building against the newer version of Mercury. Can you see if any of the telemetry tests were run as part of the usual PR set?

These three are pr,telemetry. Maybe sufficient?

control/dmg_telemetry_basic.py
control/dmg_telemetry_io_basic.py:
control/dmg_telemetry_nvme.py

mjmac · 2025-02-25T17:17:50Z

These three are pr,telemetry. Maybe sufficient?

control/dmg_telemetry_basic.py

Yeah, that one runs this sub-test: https://github.com/daos-stack/daos/blob/master/src/tests/ftest/control/dmg_telemetry_basic.py#L73 ... Pretty sure that's the one that would fail if I forgot to update the metrics list.

jolivier23 · 2025-02-25T19:22:54Z

Need to fix the merge conflict

mjmac · 2025-02-25T19:54:41Z

Need to fix the merge conflict

Yeah, just waiting to see if I get any other review feedback before kicking off another CI run.

mchaarawi · 2025-02-25T20:14:02Z

src/tests/ftest/util/telemetry_utils.py

+    CLIENT_NET_METRICS = [
+        "client_net_hg_bulks",
+        "client_net_hg_req_recv",
+        "client_net_hg_extra_bulk_resp",
+        "client_net_hg_extra_bulk_req",
+        "client_net_hg_resp_sent",
+        "client_net_hg_resp_recv",
+        "client_net_hg_mr_copies",
+        "client_net_hg_req_sent",
+        "client_net_hg_active_rpcs"]


isn't there supposed to be a test that collects the metrics and verifies the values against a certain workload?

IMO that's kind of a low-value test in that it's really testing Mercury instead of DAOS code. Trying to determine "correct" counter values at this level also seems pretty error-prone and likely to introduce test flakiness.

sure, i agree this is probably an internal test for those counters. IDK if such a test exists. @soumagne do you know? if those are being exposed, it would be a good idea to verify they are correct.

what i don't get is what flakiness have to do with counters. If it's a deterministic workload, the results should have no variance and should not be error prone. so not sure i get what you mean there.

we currently do not have a test for these.

what i don't get is what flakiness have to do with counters. If it's a deterministic workload, the results should have no variance and should not be error prone. so not sure i get what you mean there.

I suppose that what I meant was more "brittle" than "flaky". Say we add a test today that writes 1MB of data to a container using IOR. We run the test, and then read the counters for the IOR client process. At the end of it we should see 0 active RPCs, 3 bulks, 13 req sent and 13 resp received (I'm just making numbers up for illustration).

The values of these counters are determined by a number of factors, including: object class, protocol query, collective RPCs enabled/disabled, etc. Now imagine that we change some of the client startup logic to somehow reduce or eliminate the protocol query stuff, or some other change is made to improve collective RPC efficiency, etc. And now the RPC count is off and the test will fail whenever it's run (probably not on the PR that made the change because that one wasn't directly modifying telemetry).

There are a lot of variables that will affect the final counts, and my concern is that this hypothetical test would wind up in the pile of perpetually-broken weekly tests that just get ignored and don't actually improve product quality.

We already have unit tests for the telemetry library itself. IMO, a better approach would be to implement some kind of unit test in Mercury itself to verify that counters are incremented as expected for a deterministic synthetic workload.

what i don't get is what flakiness have to do with counters. If it's a deterministic workload, the results should have no variance and should not be error prone. so not sure i get what you mean there.

I suppose that what I meant was more "brittle" than "flaky". Say we add a test today that writes 1MB of data to a container using IOR. We run the test, and then read the counters for the IOR client process. At the end of it we should see 0 active RPCs, 3 bulks, 13 req sent and 13 resp received (I'm just making numbers up for illustration).

The values of these counters are determined by a number of factors, including: object class, protocol query, collective RPCs enabled/disabled, etc. Now imagine that we change some of the client startup logic to somehow reduce or eliminate the protocol query stuff, or some other change is made to improve collective RPC efficiency, etc. And now the RPC count is off and the test will fail whenever it's run (probably not on the PR that made the change because that one wasn't directly modifying telemetry).

There are a lot of variables that will affect the final counts, and my concern is that this hypothetical test would wind up in the pile of perpetually-broken weekly tests that just get ignored and don't actually improve product quality.

We already have unit tests for the telemetry library itself. IMO, a better approach would be to implement some kind of unit test in Mercury itself to verify that counters are incremented as expected for a deterministic synthetic workload.

well yea it will be PR test in the unit test stage even with NLT, because the test can be something small.
But anyway I do not disagree with you. But I am concerned that it sounds we are adding a non tested feature and exposing it for users without confirming the metrics are accurate.

That's a fair concern. Do we have sanity tests for for any of our metrics outside of the basic tests that they are produced?

src/cart/crt_context.c

Signed-off-by: Michael MacDonald <[email protected]>

Feature: telemetry Signed-off-by: Michael MacDonald <[email protected]>

jolivier23 · 2025-02-26T15:27:46Z

src/tests/ftest/util/telemetry_utils.py

+    CLIENT_NET_METRICS = [
+        "client_net_hg_bulks",
+        "client_net_hg_req_recv",
+        "client_net_hg_extra_bulk_resp",
+        "client_net_hg_extra_bulk_req",
+        "client_net_hg_resp_sent",
+        "client_net_hg_resp_recv",
+        "client_net_hg_mr_copies",
+        "client_net_hg_req_sent",
+        "client_net_hg_active_rpcs"]


That's a fair concern. Do we have sanity tests for for any of our metrics outside of the basic tests that they are produced?

mjmac force-pushed the mjmac/DAOS-17006 branch 2 times, most recently from e24eacd to 24ad6ce Compare February 24, 2025 16:48

mjmac force-pushed the mjmac/DAOS-17006 branch from 24ad6ce to 76bf462 Compare February 24, 2025 16:54

mjmac marked this pull request as ready for review February 25, 2025 13:39

mjmac requested review from a team as code owners February 25, 2025 13:39

daltonbohning reviewed Feb 25, 2025

View reviewed changes

mjmac requested a review from kjacque February 25, 2025 18:20

mchaarawi reviewed Feb 25, 2025

View reviewed changes

kjacque previously approved these changes Feb 26, 2025

View reviewed changes

src/cart/crt_context.c Outdated Show resolved Hide resolved

mjmac added 2 commits February 26, 2025 11:47

Merge branch 'master' into mjmac/DAOS-17006

4abb98f

Signed-off-by: Michael MacDonald <[email protected]>

Address nit

9196cdc

Feature: telemetry Signed-off-by: Michael MacDonald <[email protected]>

mjmac dismissed kjacque’s stale review via 9196cdc February 26, 2025 12:11

mjmac requested a review from kjacque February 26, 2025 12:12

jolivier23 approved these changes Feb 26, 2025

View reviewed changes

kjacque approved these changes Feb 26, 2025

View reviewed changes

mjmac merged commit 6f8676f into master Feb 27, 2025
57 checks passed

mjmac deleted the mjmac/DAOS-17006 branch February 27, 2025 19:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DAOS-17006 cart: Publish Mercury counters as metrics (#15870) #15963

DAOS-17006 cart: Publish Mercury counters as metrics (#15870) #15963

mjmac commented Feb 24, 2025 •

edited

Loading

github-actions bot commented Feb 24, 2025

mjmac commented Feb 24, 2025

soumagne commented Feb 24, 2025

daltonbohning left a comment

mjmac commented Feb 25, 2025

daltonbohning commented Feb 25, 2025

mjmac commented Feb 25, 2025

jolivier23 commented Feb 25, 2025

mjmac commented Feb 25, 2025

mchaarawi Feb 25, 2025

mjmac Feb 25, 2025

mchaarawi Feb 25, 2025

soumagne Feb 25, 2025

mjmac Feb 26, 2025 •

edited

Loading

mchaarawi Feb 26, 2025

jolivier23 Feb 26, 2025

jolivier23 Feb 26, 2025

DAOS-17006 cart: Publish Mercury counters as metrics (#15870) #15963

DAOS-17006 cart: Publish Mercury counters as metrics (#15870) #15963

Conversation

mjmac commented Feb 24, 2025 • edited Loading

github-actions bot commented Feb 24, 2025

mjmac commented Feb 24, 2025

soumagne commented Feb 24, 2025

daltonbohning left a comment

Choose a reason for hiding this comment

mjmac commented Feb 25, 2025

daltonbohning commented Feb 25, 2025

mjmac commented Feb 25, 2025

jolivier23 commented Feb 25, 2025

mjmac commented Feb 25, 2025

mchaarawi Feb 25, 2025

Choose a reason for hiding this comment

mjmac Feb 25, 2025

Choose a reason for hiding this comment

mchaarawi Feb 25, 2025

Choose a reason for hiding this comment

soumagne Feb 25, 2025

Choose a reason for hiding this comment

mjmac Feb 26, 2025 • edited Loading

Choose a reason for hiding this comment

mchaarawi Feb 26, 2025

Choose a reason for hiding this comment

jolivier23 Feb 26, 2025

Choose a reason for hiding this comment

jolivier23 Feb 26, 2025

Choose a reason for hiding this comment

mjmac commented Feb 24, 2025 •

edited

Loading

mjmac Feb 26, 2025 •

edited

Loading