-
Notifications
You must be signed in to change notification settings - Fork 205
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Breakdown metrics #1271
Breakdown metrics #1271
Conversation
❕ Build Aborted
Expand to view the summary
Build stats
Test stats 🧪
Trends 🧪Test errors
Expand to view the tests failures
|
What's the status of this PR? |
|
||
[float] | ||
[[metrics-application]] | ||
=== Built-in application metrics |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
100% stolen from the java docs.
Ready for review. |
Pushed a few small changes:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some initial comments to discuss
docs/metrics.asciidoc
Outdated
|
||
Fields: | ||
|
||
* `sum.us`: The sum of all transaction durations in ms since the last report (the delta) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More an observation:
The field used is sum.us
- is the u
in .us
intended to be μ
(mu)? If so, μs
is the unit for microseconds, but the description says ms
, milliseconds. Is that correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hahh - I looked into this and indeed there was an issue here.
is the
u
in.us
intended to beμ
(mu)?
Yes!
but the description says ms, milliseconds. Is that correct?
No, it wasn't. This was a copy from the java agent docs, and in fact we sent the data in milliseconds in the PR originally. I looked into more docs and this clearly should be microseconds (e.g. the Go agent docs has this correctly). I fixed the code, so now we send it in microseconds and I also fixed this part of the doc as well - all microseconds now.
Side-note: since the breakdown metrics are used to calculate the time spent in different span types by percentage, I think the unit will not break the feature as long as all the data is in the same unit (which was the case), so likely that's why it was not an issue before - it'd be nice to see if all agents send it in the correct unit :)
docs/metrics.asciidoc
Outdated
|
||
Fields: | ||
|
||
* `sum.us`: The sum of all span self-times in ms since the last report (the delta) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same observation as above:
The field used is sum.us
- is the u
in .us
intended to be μ
(mu)? If so, μs
is the unit for microseconds, but the description says ms
, milliseconds. Is that correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above.
Rebased, ready to another review round. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've gone through in some more detail.
The gathering of breakdown metrics looks like it could be performed in PayloadSenderV2
, when the transaction is queued. PayloadSenderV2
would take BreakdownMetricsProvider
in its ctor. I know I previously suggested hooking into the .Ended
event as has been done here, but in retrospect, performing in PayloadSenderV2
would actually be cleaner.
An aside-
PayloadSenderV2
is a little bit of a misnomer as its performing a few different functions:
- holding a queue of items to send to APM server
- filtering items that should be sent
- serializing items to be sent
- actually sending items to APM server
Might be useful to refactor this in the next major version.
src/Elastic.Apm/Model/Transaction.cs
Outdated
|
||
internal readonly TraceState _traceState; | ||
|
||
internal readonly ConcurrentDictionary<(string, string), SpanTimer> SpanTimings = new(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Introducing a type for the key would make it clear what each part of the tuple relates to
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
{ | ||
if (item != null) | ||
{ | ||
if (!WildcardMatcher.IsAnyMatch(currentConfigSnapshot.DisableMetrics, BreakdownMetricsProvider.SpanSelfTime)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes the assumption that any added metrics providers should be controlled by the span self-time metric. Whilst this holds true now as breakdown provider is the only one added in the ctor, maybe the string(s) to match should be a property on an IMetricsProvider
to allow each individual provider determine if it should be added.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I agree that this isn't ok.
maybe the string(s) to match should be a property on an IMetricsProvider to allow each individual provider determine if it should be added.
Problem is we'd need to address 2 scenarios with this:
- The
BreakdownMetricsProvider
and everything passed to thector
is already created - so yes, in that case we could just query the string(s) fromIMetricsProvider
and it'd indeed help. - Every other
IMetricsProvider
is not instantiated at this point, so we can't query the string(s) list for those. In that cases what we do is that we check theDisabledMetrics
list and if one of those is disabled we don't even instantiate those.
What I did is that I moved the check to AgentCompontents
- that's where we instantiate the BreakdownMetricsProvider
.
Long term we could come up with an infrastructure to better handle this - I think it'd be fine to introduce the string list you suggest, but that'll also mean we'd need to instantiate all the IMetricsProvider
and if they are not enabled based on the string(s) list, we'd just not add them to the list and let them GCed. Nevertheless that'd be a longer discussion, so maybe we could do that in a separate PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update: I decided to move on and implement it nevertheless. I added IMetricsProvider.IsEnabled
- so we instantiate each IMetricsProvider
implementation, but only add them if they are enabled. This changed a few things around metrics collection, but in my opinion this is now much cleaner - checks if a specific IMetricsProvider
is enabled is now the same for all IMetricsProvider
implementations and we also got rid of the huge checks in MetricsCollector.ctor
- instead now IMetricsProvider
do the check for themselves.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
++ it'd be possible to build on this for the custom metrics scenarios.
There are multiple places where we can capture the metrics - basically anything works after the transaction ends and before it goes out to APM server. In what sense do you think it would be cleaner? I think it's worth looking at how data flows. In the PR currently: If we move it to which to me doesn't sound ok. And Some other issues:
I opened #1341 which moves Otherwise it's ready for another round. |
The
It's doing a lot more than sending payloads; it's more of a processing pipeline for payloads, which could also include calculating breakdown metrics for a particular payload, transactions.
If someone replaces
but the public interface doesn't define a contract for this behaviour, only that payloads are queued public interface IPayloadSender
{
void QueueError(IError error);
void QueueMetrics(IMetricSet metrics);
void QueueSpan(ISpan span);
void QueueTransaction(ITransaction transaction);
} I'm trying to think about the architecture of the agent in terms of how it is evolving and where we might want to go with it. Happy to discuss here, though it might be easier to chat over zoom 😃 |
src/Elastic.Apm/Api/Tracer.cs
Outdated
|
||
_logger.Debug()?.Log("Starting {TransactionValue}", retVal); | ||
return retVal; | ||
|
||
void RetValOnEnded(object sender, EventArgs e) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor nitpick: naming
void RetValOnEnded(object sender, EventArgs e) | |
void CaptureBreakdownMetrics(object sender, EventArgs e) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part got removed.
|
||
public IEnumerable<MetricSet> GetSamples() | ||
{ | ||
var retVal = new List<MetricSet>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could avoid resizing the internal array by allocating a capacity based on count of _itemsToSend
(or 1000)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, makes sense - done.
lock (_lock) | ||
{ | ||
retVal.AddRange(_itemsToSend.Take(1000)); | ||
_itemsToSend.Clear(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should _itemsToSend
be cleared? What if it still contains items to send after taking 1000?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should _itemsToSend be cleared? What if it still contains items to send after taking 1000?
The spec says: "...agents must cap the total number of metricsets. A reasonable default seems to be [1000]. When the limit is reached, agents should log sla warning...". So my understanding is that everything above 1k should be thrown away, so it's fine to clear those. Also worth noting that in this case we'd try to show time spent in percentage across 1K groups, so at this point I'd say the visualization can be also challenging depending on the distribution.
On the other hand by thinking about this: we should not even collect those in _itemsToSend
, plus the warning was also missing. So I moved the handling of this into CaptureTransaction(Transaction)
: we only collect up to 1K items and if 1K is reached we log.
I also included a test to cover this.
/// <summary> | ||
/// Encapsulates type and subtype | ||
/// </summary> | ||
internal struct SpanTypeAndSubtype |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor nitpick: naming
internal struct SpanTypeAndSubtype | |
internal struct SpanTimerKey |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
That's true and everyone (rightfully) criticises it, yet we end up putting more stuff into it. So this is again a reason why I'd not put it into there - but the main reason I would not put it there is that I think breakdown metrics creation is really not the job for the payload sender.
That's true and unfortunate, but why do we need to make yet another component part of this list? I still don't see how it's related to payload sender. Breakdown is a metric created by spans and transactions.
Ok, let's zoom - maybe I just don't get the core of your motivation. I think the cleanest and simplest was to just call |
Zoom summary: we agreed to move |
@@ -20,7 +20,10 @@ internal class MockPayloadSender : IPayloadSender | |||
{ | |||
private readonly List<IError> _errors = new List<IError>(); | |||
private readonly List<Func<IError, IError>> _errorFilters = new List<Func<IError, IError>>(); | |||
private readonly object _lock = new object(); | |||
private readonly object _spanLock = new object(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I saw tests failing with such logs - this specific test was a new one which I introduced in the PR. I wasn't able to repro this locally:
System.ArgumentException : Destination array was not long enough. Check the destination index, length, and the array's lower bounds.
Parameter name: destinationArray
Stacktrace
System.ArgumentException : Destination array was not long enough. Check the destination index, length, and the array's lower bounds.
Parameter name: destinationArray
Stack Trace:
at System.Array.Copy(Array sourceArray, Int32 sourceIndex, Array destinationArray, Int32 destinationIndex, Int32 length, Boolean reliable)
at System.Collections.Generic.List`1.CopyTo(T[] array, Int32 arrayIndex)
at System.Collections.Generic.List`1..ctor(IEnumerable`1 collection)
at Elastic.Apm.Tests.BreakdownTests.TransactionWithSpans() in C:\Users\jenkins\workspace\net_apm-agent-dotnet-mbp_PR-1271\apm-agent-dotnet\test\Elastic.Apm.Tests\BreakdownTests.cs:line 62
This exception isn't new - I already saw it on failing tests. I think the root cause of those failing tests is that the agent still puts an event into the MockPayloadSender
while it makes the copy of the list. To address this I introduced locks in the MockPayloadSender
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a comment around locking, rest looks OK.
} | ||
else | ||
{ | ||
var stopWatch = Stopwatch.StartNew(); | ||
while (_transactions.Count < count && signalled) | ||
lock (_transactionLock) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand this correctly, the lock around waiting for x
count of transactions can't succeed because adding transactions is also within a lock, meaning no transactions can be added whilst waiting for the count to be reached.
I think the locks around WaitForX()
count can be removed, and that locking is needed only on adding items or copying to an immutable collection.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, there is an AutoResetEvent.WaitOne()
within the lock 🤦 . I really wonder why no test failed, everything was green both locally and in CI. Yeah, I also think this should fail if first thread calls WaitForX()
and then another one tries to add to the queue.
I think locking on the reading of _transaction.Count
and spans.Count
is needed - otherwise the number which the thread calling WaitForX()
sees can be incorrect if another thread adds to the queue, but the lock should be limited to the call to .Count
, and that finishes immediately.
I updated the PR accordingly.
Add breakdown docs
Also review feedback: - Some naming changes - Handling 1K breakdown metricset limit (including test)
Only log once
Adapting assets to the changed metrics API
Co-authored-by: Brandon Morelli <[email protected]>
Remove lock around `WaitHandle.WaitOne()`
Move SpanTiming calculation before `_payloadSender.QueueSpan`.
💔 Tests Failed
Expand to view the summary
Build stats
Test stats 🧪
Trends 🧪Test errors
Expand to view the tests failures
|
Solves #227
TODOs:
Test with non-sampled transactionsupdate: not needed - non-sampled transactions and spans also generate these metrics - according to the spec that's preferred.