Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OTEP: Recording exceptions as log based events #4333

Open
wants to merge 33 commits into
base: main
Choose a base branch
from

Conversation

lmolkova
Copy link
Contributor

@lmolkova lmolkova commented Dec 10, 2024

Related to open-telemetry/semantic-conventions#1536

Changes

Recording exceptions as span events is problematic since it

  • ties recording exceptions to tracing/sampling
  • duplicates exceptions recorded by instrumented libraries on logs
  • does not leverage log features such as typical log filtering based on severity

This OTEP provides guidance on how to record exceptions using OpenTelemetry logs focusing on minimizing duplication and providing context to reduce the noise.

If accepted, the follow-up spec changes are expected to replace existing (stable) documents:


@lmolkova lmolkova changed the title OTEP: Recording exceptions and errors with OpenTelemetry OTEP: Recording exceptions as log based events Dec 10, 2024
@pellared
Copy link
Member

pellared commented Dec 10, 2024

I think this is a related issue:

oteps/4333-recording-exceptions-on-logs.md Outdated Show resolved Hide resolved
oteps/4333-recording-exceptions-on-logs.md Outdated Show resolved Hide resolved
oteps/4333-recording-exceptions-on-logs.md Outdated Show resolved Hide resolved
oteps/4333-recording-exceptions-on-logs.md Outdated Show resolved Hide resolved
oteps/4333-recording-exceptions-on-logs.md Outdated Show resolved Hide resolved
oteps/4333-recording-exceptions-on-logs.md Outdated Show resolved Hide resolved
oteps/4333-recording-exceptions-on-logs.md Outdated Show resolved Hide resolved
oteps/4333-recording-exceptions-on-logs.md Outdated Show resolved Hide resolved
oteps/4333-recording-exceptions-on-logs.md Outdated Show resolved Hide resolved
oteps/4333-recording-exceptions-on-logs.md Outdated Show resolved Hide resolved
@tedsuo tedsuo added the OTEP OpenTelemetry Enhancement Proposal (OTEP) label Dec 12, 2024
@lmolkova lmolkova force-pushed the exceptions-on-logs-otep branch 2 times, most recently from b06a09f to 76c7d85 Compare December 17, 2024 17:30
oteps/4333-recording-exceptions-on-logs.md Outdated Show resolved Hide resolved
oteps/4333-recording-exceptions-on-logs.md Outdated Show resolved Hide resolved
oteps/4333-recording-exceptions-on-logs.md Outdated Show resolved Hide resolved
oteps/4333-recording-exceptions-on-logs.md Outdated Show resolved Hide resolved
oteps/4333-recording-exceptions-on-logs.md Outdated Show resolved Hide resolved
oteps/4333-recording-exceptions-on-logs.md Outdated Show resolved Hide resolved
oteps/4333-recording-exceptions-on-logs.md Outdated Show resolved Hide resolved
oteps/4333-recording-exceptions-on-logs.md Outdated Show resolved Hide resolved
oteps/4333-recording-exceptions-on-logs.md Outdated Show resolved Hide resolved
oteps/4333-recording-exceptions-on-logs.md Outdated Show resolved Hide resolved
@lmolkova lmolkova force-pushed the exceptions-on-logs-otep branch from db27087 to e9f38aa Compare January 3, 2025 01:42
@carlosalberto
Copy link
Contributor

A small doubt:

If this instrumentation supports tracing, it should capture the error in the scope of the processing span.

Although (I think) it's not called out, I'm understanding exceptions should now be explicitly reported as both 1) Span.Event and 2) Log/Event? i.e. coding wise you should do this:

currentSpan.recordException(e);
logger.logRecordBuilder
    .addException(e);

Is this the case?

Copy link
Contributor

@jsuereth jsuereth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall I'm very supportive. Just some nits and one mitigation I'd like to see called out/addressed.

oteps/4333-recording-exceptions-on-logs.md Outdated Show resolved Hide resolved
oteps/4333-recording-exceptions-on-logs.md Outdated Show resolved Hide resolved
oteps/4333-recording-exceptions-on-logs.md Outdated Show resolved Hide resolved
oteps/4333-recording-exceptions-on-logs.md Outdated Show resolved Hide resolved
oteps/4333-recording-exceptions-on-logs.md Outdated Show resolved Hide resolved
@adriangb
Copy link

adriangb commented Feb 6, 2025

In my opinion what matters is not what's possible, what matters is what the real world experience is going to be for users and backend implementers. And the reality is that this change may have a serious negative impact on both.

@cijothomas
Copy link
Member

In my opinion what matters is not what's possible, what matters is what the real world experience is going to be for users and backend implementers. And the reality is that this change may have a serious negative impact on both.

I don't think Otel ever recommended Exceptions MUST/SHOULD be reported via SpanEvents. It had conventions for reporting exception via SpanEvent and Logs. (logs convention came later than Span), but never recommended one over other.
This was also my first comment on this OTEP.

This OTEP would be the first time Otel officially makes a recommendation on the preferred way of reporting exception.

(That is my read. Happy to be corrected!)

@adriangb
Copy link

adriangb commented Feb 6, 2025

I agree with your interpretation. My point is that recording events via logs does have downsides. Right now you have to opt into it so it's okay to have to do something like the LogRecordProcessor you proposed to get that to work correctly (essentially duplicating the info). But if it becomes the recommendation and presumably eventually the default, as this OTEP is proposing, that results in a a worse user experience for both users and OTEL backend engineers.

@cijothomas
Copy link
Member

I agree with your interpretation. My point is that recording events via logs does have downsides. Right now you have to opt into it so it's okay to have to do something like the LogRecordProcessor you proposed to get that to work correctly (essentially duplicating the info). But if it becomes the recommendation and presumably eventually the default, as this OTEP is proposing, that results in a a worse user experience for both users and OTEL backend engineers.

that results in a a worse user experience for both users and OTEL backend engineers.

I think this depends on the backend/vendor.

Recording exceptions via logs has downsides. Recording exceptions via SpanEvents also has downsides too. Having no recommendation from Otel is also not good. It is definitely possible to have something in the Otel Spec for SDKs, to have a feature flag to control this based on user preference. (The feature flag, can do the conversion of exceptions from LogRecord to SpanEvents OR vice-versa, with or without duplicating.). It is also possible to have feature-flag for instrumentations, but it may be easier to have an SDK level thing that'll ensure consistent behavior, irrespective of the instrumentation used.

@lmolkova
Copy link
Contributor Author

lmolkova commented Feb 6, 2025

to @alexmojaki

With span events, it's easy for the backend to store the span and the exception together because OTLP keeps them together. If recording an exception on a span generates a separate log event then those are guaranteed to be in separate exports. For our backend at least it's very hard to then bring them together in storage.

You could bring them on the SDK level with a events-to-span-events processor. It's a great question how the span-events -> logs migration would look like and whether such processor should be provided by contrib component, individual distros or by default.

If a span is excluded by sampling, what happens to its child log events by default? Even if it's included, it will be missing the parent span describing the context of the exception. It's better than nothing, but this doesn't seem like a real solution to exceptions being lost by sampling. Tail sampling seems like the way to deal with that.

Correlated logs with errors and other details without parent span are pretty valuable regardless of tracing or sampling.

I understand that it's useful to be able to log exceptions without the need to create a span, but that's different from saying that spans should no longer contain exceptions, which is where this OTEP seems headed even if it's not requiring it yet. Why remove span events?

this is part of the previous OTEP on Events - https://github.com/open-telemetry/opentelemetry-specification/blob/main/oteps/0265-event-vision.md#relationship-to-span-events. In the long term, we hope to replace span events with log-based events with migration story for consumers.

We extract special columns like is_exception and exception_type from events which make it trivial for users to query exceptions, they don't have to know about events at all.

We report error.type on spans and it contains exception type if exception was the reason. PTAL at the https://github.com/open-telemetry/semantic-conventions/blob/main/docs/general/recording-errors.md.
If you have some scenarios in mind where it's necessary to capture that the error manifested as exception (vs some status code) is useful, please bring them up in semantic conventions.

Span events themselves are not special, but storing exception data directly inside relevant spans is important to us.

Does it include the stack trace? The rest is (or can be) stored for a single 'terminal' exception that the span ends with. Or is your goal to store exception chain and/or handled exceptions that happened during span lifetime?

The counter-argument I have is that there are a lot of exceptions that happen during span execution and most of them are already recorded as logs by runtimes, client libs, frameworks, etc. I.e. you're already in the world where exceptions are not exported along with spans.

It's a fair ask though that user app may want to associate arbitrary exceptions with a span today and we're taking it away when moving to logs.

oteps/4333-recording-exceptions-on-logs.md Outdated Show resolved Hide resolved
oteps/4333-recording-exceptions-on-logs.md Outdated Show resolved Hide resolved
oteps/4333-recording-exceptions-on-logs.md Outdated Show resolved Hide resolved
oteps/4333-recording-exceptions-on-logs.md Outdated Show resolved Hide resolved
oteps/4333-recording-exceptions-on-logs.md Outdated Show resolved Hide resolved
oteps/4333-recording-exceptions-on-logs.md Outdated Show resolved Hide resolved
oteps/4333-recording-exceptions-on-logs.md Outdated Show resolved Hide resolved
oteps/4333-recording-exceptions-on-logs.md Outdated Show resolved Hide resolved
oteps/4333-recording-exceptions-on-logs.md Outdated Show resolved Hide resolved
oteps/4333-recording-exceptions-on-logs.md Outdated Show resolved Hide resolved
oteps/4333-recording-exceptions-on-logs.md Outdated Show resolved Hide resolved
oteps/4333-recording-exceptions-on-logs.md Outdated Show resolved Hide resolved
oteps/4333-recording-exceptions-on-logs.md Outdated Show resolved Hide resolved
oteps/4333-recording-exceptions-on-logs.md Outdated Show resolved Hide resolved
oteps/4333-recording-exceptions-on-logs.md Outdated Show resolved Hide resolved
@trask
Copy link
Member

trask commented Feb 7, 2025

@adriangb @alexmojaki @samuelcolvin would you mind summarizing where we are with each of your concerns above as separate review comments (on the Files changed tab at the top, pick a relevant-ish line and add the summary of where we're at there) so we can have "threaded" discussions on each one of them? I really want to follow along and see what needs to be addressed, but am having trouble following with everything as mainline comments 😅 thank you ❤️

2. Recording exceptions as logs will result in UX degradation for users
leveraging trace-only backends such as Jaeger.

3. Having exceptions exported and stored along with span is beneficial for some backends.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commenting here to follow the request of commenting somewhere vaguely relevant for a threaded discussion.

Let me check if I understand things correctly. Currently this code:

from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import SimpleSpanProcessor, ConsoleSpanExporter
from requests import ConnectionError

provider = TracerProvider()
provider.add_span_processor(SimpleSpanProcessor(ConsoleSpanExporter()))
tracer = provider.get_tracer(__name__)

with tracer.start_as_current_span("foo"):
    raise ConnectionError("bar")

prints a span containing:

{
  "status": {
    "status_code": "ERROR",
    "description": "ConnectionError: bar"
  },
  "attributes": {},
  "events": [
    {
      "name": "exception",
      "timestamp": "2025-02-07T10:35:52.726398Z",
      "attributes": {
        "exception.type": "requests.exceptions.ConnectionError",
        "exception.message": "bar",
        "exception.stacktrace": "Traceback...",
        "exception.escaped": "False"
      }
    }
  ]
}

Am I correct that the goal is to instead emit the following?

{
  "status": {
    "status_code": "ERROR",
    "description": "bar"
  },
  "attributes": {
    "error.type": "requests.exceptions.ConnectionError"
  }
}

The differences being:

  1. Span events are gone, and the stacktrace will only be found in a child event-log.
  2. The span event attribute exception.type containing the fully qualified exception type name is now in the span attribute error.type
  3. The status description no longer contains the (unqualified) exception name.

Copy link
Contributor Author

@lmolkova lmolkova Feb 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I correct that the goal is to instead emit the following?

Not the goal for this OTEP. This OTEP tells how to record exception and errors on logs and does not change how the span status or attributes are populated.

Span events are gone, and the stacktrace will only be found in a child event-log.

yes

The span event attribute exception.type containing the fully qualified exception type name is now in the span attribute error.type

It's already in the spec, not part of this OTEP - https://github.com/open-telemetry/semantic-conventions/blob/main/docs/general/recording-errors.md.
Error.type on span can as well be something more specific depending on what tracing instrumentation decided to put into error.type. Error logs and traces are likely to be captured at different layers and there is no assumption that all exceptions will make it to the tracing layer. Tracing layer may get nothing, or aggregated exception, or a different one suppressing the original one.

The status description no longer contains the (unqualified) exception name.

This OTEP does not change anything about span status.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not the goal for this OTEP.

I understand that, but this OTEP is removing the span events so I want to make sure that the spec (or the plan for the spec) ensures that at least exception.type and exception.message are still directly in the span.

The span event attribute exception.type containing the fully qualified exception type name is now in the span attribute error.type

It's already in the spec, not part of this OTEP - https://github.com/open-telemetry/semantic-conventions/blob/main/docs/general/recording-errors.md.

But before seeing this issue I'd never heard of error.type before. That page is apparently less than a month old but the semantic convention seems to be significantly older so I'm trying to figure out what happened. The Python SDK currently doesn't set that attribute based on the snippet above, is that just because it's a bit behind? https://github.com/open-telemetry/semantic-conventions/blob/main/docs/general/recording-errors.md#recording-exceptions shows error.type and the span status being recorded manually. Are they not supposed to be recorded automatically by the appropriate tracer/span API methods when an exception bubbles through? BTW the snippet doesn't actually end the span.

Copy link
Contributor Author

@lmolkova lmolkova Feb 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This attribute is mentioned in every semantic conventions, e.g. https://github.com/open-telemetry/semantic-conventions/blob/main/docs/http/http-spans.md and the recording-errors page is a generalization. Python should set error.type in specific instrumentations that report spans.
If python supports ending span with exception in a generic way, it should start setting error.type there - please create an issue for them if they don't.

Copy link
Contributor Author

@lmolkova lmolkova Feb 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clarification on "I understand that, but this OTEP is removing the span events "

It does not remove span events. It's a design proposal for the future when events are stabilized to stop using span events. The actionable part of your feedback that I see is that we should consider having something like Span.end(exception) instead of Span.recordException(exception) that would record the exception.* attributes (except the stacktrace) on span and ensure all the details are populated. Did I get it right?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something like that, yes.

What's the plan for Span.recordException? Will that method be removed, or will it switch to generating a child event-log?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is no concrete plan yet. It'd be great if you could create an issue in this repo to start the discussion on migration details.

@ChristopherKlinge
Copy link

Hi,

I have a question regarding the proposed error.type attribute. As far as I can tell from the current spec and this proposal, this is the "technical" type of the exception (e.g. java.lang.NullPointerException). How is this supposed to translate to error codes for business errors?

In a more traditional logging based stack, when logging an exception, the developer should add a human readable error message explaining the underlying exception's effect on the current process. Without these messages, an observer has a very hard time figuring out what broke from a business perspective.

{
  "level": "ERROR",
  "message": "Could not generate invoice for order XYZ.",
  "exception": "java.lang.NullPointerException",
  "stack_trace": "..."
}

If I understand this OTEP correctly, this log message would not change much and the corresponding trace would look like this:

{
  "status": {
    "status_code": "ERROR",
    "description": "Could not generate invoice for order XYZ."
  },
  "attributes": {
    "error.type": "java.lang.NullPointerException"
  }
}

Is this correct?

@lmolkova
Copy link
Contributor Author

lmolkova commented Feb 7, 2025

@ChristopherKlinge please check out https://github.com/open-telemetry/semantic-conventions/blob/main/docs/general/recording-errors.md which covers existing error.type attribute.

This OTEP is about exception and errors on logs and encourages to add additional information to logs.

Comment on lines +257 to +258
In their next major version, log bridges SHOULD start using `setException` method or
follow defaults documented for OTel SDK when explicitly recording `exception.*`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
In their next major version, log bridges SHOULD start using `setException` method or
follow defaults documented for OTel SDK when explicitly recording `exception.*`
In their next major version, log bridges SHOULD start using the `setException` method or
follow the defaults documented for OTel SDK when explicitly recording `exception.*`

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
changelog.opentelemetry.io OTEP OpenTelemetry Enhancement Proposal (OTEP)
Projects
Status: In progress
Development

Successfully merging this pull request may close these issues.