`event.success_count` metric should use representative count #388

lahsivjar · 2024-10-22T16:39:20Z

event.success_count attribute is populated using ingest pipeline defined in apm-data plugin. The attribute currently is populated using 1 for success spans and 0 for failed spans, however, when we aggregate this metric, we use representative count (ref) which should be more apt for what we do with this metric.

The issue is about fixing the ingest pipeline to use representative count if defined, 1 otherwise in the ingest pipeline for cases when the event.outcome is success.

Related to: elastic/opentelemetry-lib#112

The text was updated successfully, but these errors were encountered:

raultorrecilla · 2025-01-10T17:59:58Z

moving this to it106

raultorrecilla · 2025-02-05T16:35:26Z

was moved to it-107

rubvs · 2025-02-19T15:31:29Z

I initially thought that we can make this change in two different places:

In the ES pipeline, as outlined in this PR.
In apm-data directly, after we add the representative count for events (Span, Transaction), see Code.

However, after some investigation, event.success_count is not really a concern of the producer of the data:

It's an optimisation for consumers of the data.
Having it in ES moves the logic closer to the consumer.

So making the change in the ES pipeline, rather than apm-data, seems like the correct thing to do.

rubvs · 2025-02-20T19:08:13Z

Okay. So this PR is almost merged, finally.

There is a few mental headaches that I had to fully understand all the components at play. Maybe my thinking here is overcomplicated, but I'll outline them here for completeness anyways:

As outlined in the PR desc.
- The representative count is calc in apm-data, see Code
- From this code the representative count will always be set. It defaults to 1.
Carson's comment, stating that we should check that the representative count is set, was an indication to me that there is a divid between data being manipulated in apm-data vs in the ES pipeline.
- To make things more confusing, the pipeline is in the apm-data plugin.
- This indicates (to me at least) an abstraction on the apm-data level. Which is all about data manipulation. Makes sense.
- So then why do we calc the representative count in the library, then later do a "hard copy" of this value into another field.
  - In another codebase.
  - With a different programming language.
  - With a different name completely.
- Our code is literally saying that the amount of successful events are equal to the representative count.
- While doing this, we have to do extra logical checks. In the pipeline, which has already been done in apm-data.
  - This is redundant and feels like a leak. Ideas are crossing boundaries.
  - I'm a huge fan of the Principle of Locality in Physics and Computer Science.
This finally led to me having a deeper look at making the change in apm-data directly, which lead to the comment above.

In summary, things would have been much easier if I understood the "ownership" of fields in our data models. Since ownership can been seen as "maintaining locality". An idea Rust leverages to the extreme. In this case, event.Transaction.RepresentationCount and event.Span.RepresentationCount are owned by the producer. They are calc by apm-data and used throughout our services, such as aggregation. While event.Event.SuccessCount is only really used in Kibana, and is owned by the client so to speak.

I am still not sure when, where, and why exactly event.success_count is used?
In fact, the purpose of the entire event.Event data-structure has me confused now. To my understanding this Event inside and APMEvent is only for ES?

With all this being said, it becomes very clear why this bug originated to start with. This was not as simple as I expected. Moreover, once this PR is merged into ES. I have to go make another PR in APM Server to compensate for this, see TestIngestPipelineEventSuccessCount, before closing this issue. So there is a 3rd codebase involved.

simitt · 2025-02-21T11:09:47Z

@rubvs there are probably also some historical reasons; but one reason I would highlight for the split between logic implemented in the apm-server (through apm-data) vs. ingest pipelines in Elasticsearch is backwards compatibility and the option to potentially derive field values from older apm-server instances, where relevant for the UI.
The product compatibility matrix shows that a newer Elasticsearch version is expected to work with an older APM Server version of the same major version (e.g. ES 8.17 to be compatibel with APM Server 8.15). A major version spans over several years, and during this lifetime we might introduce new fields in the elastic apm protocol, have the APM Server process the fields, respect them in the ES templates and then the UI leverages them. However, maybe an older APM Server is sending data to a newer ES. In such a case, if possible, additional processing can be done in the ingest pipelines as they are shipped with ES. This comes with benefits and complications, but is one way of solving problems that might come into existence over the lifetime of a supported product.

In general, I highly encourage you to trigger conversations around why and how and where is something used during the implementation phase to ensure you have all the necessary knowledge and information to take ownership over the introduced code changes.

rubvs · 2025-02-24T02:59:39Z

Created a new issue to update the unit test in APM Server after the next release, see elastic/apm-server#15838

lahsivjar added the bug Something isn't working label Oct 22, 2024

lahsivjar mentioned this issue Oct 22, 2024

[enrichments] event.success_count should use representative count/adjusted count for the span elastic/opentelemetry-lib#112

Closed

simitt assigned rubvs Dec 2, 2024

rubvs mentioned this issue Jan 10, 2025

apm-data: Use representative count as event.success_count if available elastic/elasticsearch#119995

Merged

rubvs closed this as completed in elastic/elasticsearch#119995 Feb 20, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`event.success_count` metric should use representative count #388

`event.success_count` metric should use representative count #388

lahsivjar commented Oct 22, 2024 •

edited

Loading

raultorrecilla commented Jan 10, 2025

raultorrecilla commented Feb 5, 2025

rubvs commented Feb 19, 2025 •

edited

Loading

rubvs commented Feb 20, 2025

simitt commented Feb 21, 2025

rubvs commented Feb 24, 2025 •

edited

Loading

event.success_count metric should use representative count #388

event.success_count metric should use representative count #388

Comments

lahsivjar commented Oct 22, 2024 • edited Loading

raultorrecilla commented Jan 10, 2025

raultorrecilla commented Feb 5, 2025

rubvs commented Feb 19, 2025 • edited Loading

rubvs commented Feb 20, 2025

simitt commented Feb 21, 2025

rubvs commented Feb 24, 2025 • edited Loading

`event.success_count` metric should use representative count #388

`event.success_count` metric should use representative count #388

lahsivjar commented Oct 22, 2024 •

edited

Loading

rubvs commented Feb 19, 2025 •

edited

Loading

rubvs commented Feb 24, 2025 •

edited

Loading