Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

event.success_count metric should use representative count #388

Closed
lahsivjar opened this issue Oct 22, 2024 · 6 comments · Fixed by elastic/elasticsearch#119995
Closed
Assignees
Labels
bug Something isn't working

Comments

@lahsivjar
Copy link
Contributor

lahsivjar commented Oct 22, 2024

event.success_count attribute is populated using ingest pipeline defined in apm-data plugin. The attribute currently is populated using 1 for success spans and 0 for failed spans, however, when we aggregate this metric, we use representative count (ref) which should be more apt for what we do with this metric.

The issue is about fixing the ingest pipeline to use representative count if defined, 1 otherwise in the ingest pipeline for cases when the event.outcome is success.

Related to: elastic/opentelemetry-lib#112

@raultorrecilla
Copy link

moving this to it106

@raultorrecilla
Copy link

was moved to it-107

@rubvs
Copy link
Contributor

rubvs commented Feb 19, 2025

I initially thought that we can make this change in two different places:

  • In the ES pipeline, as outlined in this PR.
  • In apm-data directly, after we add the representative count for events (Span, Transaction), see Code.

However, after some investigation, event.success_count is not really a concern of the producer of the data:

  • It's an optimisation for consumers of the data.
  • Having it in ES moves the logic closer to the consumer.

So making the change in the ES pipeline, rather than apm-data, seems like the correct thing to do.

@rubvs
Copy link
Contributor

rubvs commented Feb 20, 2025

Okay. So this PR is almost merged, finally.

There is a few mental headaches that I had to fully understand all the components at play. Maybe my thinking here is overcomplicated, but I'll outline them here for completeness anyways:

  • As outlined in the PR desc.
    • The representative count is calc in apm-data, see Code
    • From this code the representative count will always be set. It defaults to 1.
  • Carson's comment, stating that we should check that the representative count is set, was an indication to me that there is a divid between data being manipulated in apm-data vs in the ES pipeline.
    • To make things more confusing, the pipeline is in the apm-data plugin.
    • This indicates (to me at least) an abstraction on the apm-data level. Which is all about data manipulation. Makes sense.
    • So then why do we calc the representative count in the library, then later do a "hard copy" of this value into another field.
      • In another codebase.
      • With a different programming language.
      • With a different name completely.
    • Our code is literally saying that the amount of successful events are equal to the representative count.
    • While doing this, we have to do extra logical checks. In the pipeline, which has already been done in apm-data.
      • This is redundant and feels like a leak. Ideas are crossing boundaries.
      • I'm a huge fan of the Principle of Locality in Physics and Computer Science.
  • This finally led to me having a deeper look at making the change in apm-data directly, which lead to the comment above.

In summary, things would have been much easier if I understood the "ownership" of fields in our data models. Since ownership can been seen as "maintaining locality". An idea Rust leverages to the extreme. In this case, event.Transaction.RepresentationCount and event.Span.RepresentationCount are owned by the producer. They are calc by apm-data and used throughout our services, such as aggregation. While event.Event.SuccessCount is only really used in Kibana, and is owned by the client so to speak.

  • I am still not sure when, where, and why exactly event.success_count is used?
  • In fact, the purpose of the entire event.Event data-structure has me confused now. To my understanding this Event inside and APMEvent is only for ES?

With all this being said, it becomes very clear why this bug originated to start with. This was not as simple as I expected. Moreover, once this PR is merged into ES. I have to go make another PR in APM Server to compensate for this, see TestIngestPipelineEventSuccessCount, before closing this issue. So there is a 3rd codebase involved.

@simitt
Copy link
Contributor

simitt commented Feb 21, 2025

@rubvs there are probably also some historical reasons; but one reason I would highlight for the split between logic implemented in the apm-server (through apm-data) vs. ingest pipelines in Elasticsearch is backwards compatibility and the option to potentially derive field values from older apm-server instances, where relevant for the UI.
The product compatibility matrix shows that a newer Elasticsearch version is expected to work with an older APM Server version of the same major version (e.g. ES 8.17 to be compatibel with APM Server 8.15). A major version spans over several years, and during this lifetime we might introduce new fields in the elastic apm protocol, have the APM Server process the fields, respect them in the ES templates and then the UI leverages them. However, maybe an older APM Server is sending data to a newer ES. In such a case, if possible, additional processing can be done in the ingest pipelines as they are shipped with ES. This comes with benefits and complications, but is one way of solving problems that might come into existence over the lifetime of a supported product.

In general, I highly encourage you to trigger conversations around why and how and where is something used during the implementation phase to ensure you have all the necessary knowledge and information to take ownership over the introduced code changes.

@rubvs
Copy link
Contributor

rubvs commented Feb 24, 2025

Created a new issue to update the unit test in APM Server after the next release, see elastic/apm-server#15838

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants