-
Notifications
You must be signed in to change notification settings - Fork 308
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DAOS-17006 cart: Publish Mercury counters as metrics (#15870) #15963
Conversation
Ticket title is 'Expose Mercury perf counters as DAOS metrics' |
@soumagne: FYI, this is a PR suitable for master... I have chosen not to include the mercury patch, but I can adjust that if desired. Do you anticipate making a 2.4.1 release prior to DAOS 2.8.0? If so, then the new metrics will "just work" when DAOS is built against it. |
ok that works, thanks! Yes I will be making a new release for DAOS 2.8. |
e24eacd
to
24ad6ce
Compare
When Mercury has been built with diagnostic RPC counters enabled, CaRT will periodically republish the counters as DAOS telemetry for consumption by monitoring infrastructure. NB: Requires Mercury > 2.4.0. Change-Id: I0232d0da8007374fd1d28d395c65544c7fa57bc1 Signed-off-by: Michael MacDonald <[email protected]> Co-authored-by: Jeff Olivier <[email protected]> Co-authored-by: Nicholas Murphy <[email protected]>
24ad6ce
to
76bf462
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ftest LGTM but does this need Features: telemetry
?
The patch should basically be a no-op until we start building against the newer version of Mercury. Can you see if any of the telemetry tests were run as part of the usual PR set? |
These three are pr,telemetry. Maybe sufficient?
|
Yeah, that one runs this sub-test: https://github.com/daos-stack/daos/blob/master/src/tests/ftest/control/dmg_telemetry_basic.py#L73 ... Pretty sure that's the one that would fail if I forgot to update the metrics list. |
Need to fix the merge conflict |
Yeah, just waiting to see if I get any other review feedback before kicking off another CI run. |
CLIENT_NET_METRICS = [ | ||
"client_net_hg_bulks", | ||
"client_net_hg_req_recv", | ||
"client_net_hg_extra_bulk_resp", | ||
"client_net_hg_extra_bulk_req", | ||
"client_net_hg_resp_sent", | ||
"client_net_hg_resp_recv", | ||
"client_net_hg_mr_copies", | ||
"client_net_hg_req_sent", | ||
"client_net_hg_active_rpcs"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't there supposed to be a test that collects the metrics and verifies the values against a certain workload?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO that's kind of a low-value test in that it's really testing Mercury instead of DAOS code. Trying to determine "correct" counter values at this level also seems pretty error-prone and likely to introduce test flakiness.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, i agree this is probably an internal test for those counters. IDK if such a test exists. @soumagne do you know? if those are being exposed, it would be a good idea to verify they are correct.
what i don't get is what flakiness have to do with counters. If it's a deterministic workload, the results should have no variance and should not be error prone. so not sure i get what you mean there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we currently do not have a test for these.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what i don't get is what flakiness have to do with counters. If it's a deterministic workload, the results should have no variance and should not be error prone. so not sure i get what you mean there.
I suppose that what I meant was more "brittle" than "flaky". Say we add a test today that writes 1MB of data to a container using IOR. We run the test, and then read the counters for the IOR client process. At the end of it we should see 0 active RPCs, 3 bulks, 13 req sent and 13 resp received (I'm just making numbers up for illustration).
The values of these counters are determined by a number of factors, including: object class, protocol query, collective RPCs enabled/disabled, etc. Now imagine that we change some of the client startup logic to somehow reduce or eliminate the protocol query stuff, or some other change is made to improve collective RPC efficiency, etc. And now the RPC count is off and the test will fail whenever it's run (probably not on the PR that made the change because that one wasn't directly modifying telemetry).
There are a lot of variables that will affect the final counts, and my concern is that this hypothetical test would wind up in the pile of perpetually-broken weekly tests that just get ignored and don't actually improve product quality.
We already have unit tests for the telemetry library itself. IMO, a better approach would be to implement some kind of unit test in Mercury itself to verify that counters are incremented as expected for a deterministic synthetic workload.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what i don't get is what flakiness have to do with counters. If it's a deterministic workload, the results should have no variance and should not be error prone. so not sure i get what you mean there.
I suppose that what I meant was more "brittle" than "flaky". Say we add a test today that writes 1MB of data to a container using IOR. We run the test, and then read the counters for the IOR client process. At the end of it we should see 0 active RPCs, 3 bulks, 13 req sent and 13 resp received (I'm just making numbers up for illustration).
The values of these counters are determined by a number of factors, including: object class, protocol query, collective RPCs enabled/disabled, etc. Now imagine that we change some of the client startup logic to somehow reduce or eliminate the protocol query stuff, or some other change is made to improve collective RPC efficiency, etc. And now the RPC count is off and the test will fail whenever it's run (probably not on the PR that made the change because that one wasn't directly modifying telemetry).
There are a lot of variables that will affect the final counts, and my concern is that this hypothetical test would wind up in the pile of perpetually-broken weekly tests that just get ignored and don't actually improve product quality.
We already have unit tests for the telemetry library itself. IMO, a better approach would be to implement some kind of unit test in Mercury itself to verify that counters are incremented as expected for a deterministic synthetic workload.
well yea it will be PR test in the unit test stage even with NLT, because the test can be something small.
But anyway I do not disagree with you. But I am concerned that it sounds we are adding a non tested feature and exposing it for users without confirming the metrics are accurate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a fair concern. Do we have sanity tests for for any of our metrics outside of the basic tests that they are produced?
Signed-off-by: Michael MacDonald <[email protected]>
Feature: telemetry Signed-off-by: Michael MacDonald <[email protected]>
CLIENT_NET_METRICS = [ | ||
"client_net_hg_bulks", | ||
"client_net_hg_req_recv", | ||
"client_net_hg_extra_bulk_resp", | ||
"client_net_hg_extra_bulk_req", | ||
"client_net_hg_resp_sent", | ||
"client_net_hg_resp_recv", | ||
"client_net_hg_mr_copies", | ||
"client_net_hg_req_sent", | ||
"client_net_hg_active_rpcs"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a fair concern. Do we have sanity tests for for any of our metrics outside of the basic tests that they are produced?
When Mercury has been built with diagnostic RPC counters
enabled, CaRT will periodically republish the counters
as DAOS telemetry for consumption by monitoring
infrastructure. NB: Requires Mercury > 2.4.0.
Change-Id: I0232d0da8007374fd1d28d395c65544c7fa57bc1
Signed-off-by: Michael MacDonald [email protected]
Co-authored-by: Jeff Olivier [email protected]
Co-authored-by: Nicholas Murphy [email protected]