Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recording details as logs, span events, or span attributes #4414

Open
lmolkova opened this issue Feb 11, 2025 · 15 comments
Open

Recording details as logs, span events, or span attributes #4414

lmolkova opened this issue Feb 11, 2025 · 15 comments
Labels
triage:deciding:community-feedback Open to community discussion. If the community can provide sufficient reasoning, it may be accepted

Comments

@lmolkova
Copy link
Contributor

lmolkova commented Feb 11, 2025

Extracting discussion points from #4333 and #4393 around how to record occurrences of things that are not spans.

Examples of controversial things being discussed:

  • HTTP request and HTTP response bodies
  • DB query inputs and results
  • LLM prompts and completions
  • Exceptions

1. Record them as logs

Pros:

  • logs can be reported with or without tracing instrumentation and sampled along with spans or not sampled at all
  • logs provide verbosity control out of the box
  • logs are familiar to and used by everyone. A lot of libs/infra out there report logs/events
  • structured logs are very flexible

Minor inconveniences:

  • Need to configure Logs SDK

Cons:

  • logs are exported separately from spans. Backends that want to store thing like HTTP request body along with the corresponding span need to do extra effort.
  • [UPDATE] Backends that don't support logs can't ingest them

Mitigation:

  • Log -> span event processor in the SDK could attach log to span (if there is a current one). We may explore options on how to make it simple for end users.
  • Ingestion on the backend side may append logs to span records in the internal storage if allows such mutations

2. Record them as span events

Pros:

  • exported along with the spans, it's easy to decide on the ingestion side how to store them

Cons:

  • impossible to record if tracer is disabled or span is sampled out
  • creates a separate stream of events (in addition to log-based events) - not clear when to use which
  • no verbosity control
  • can't export them before span ends

Mitigations:

  • span events -> logs pipeline could export events reported on unsampled spans
  • we can add severity to span events

Fundamental problems:

  • we stay with very similar, but independent span events and logs signals and without clear understanding when to use what

3. Record them as complex span attributes

E.g. rpc.message span event with attributes rpc.message.id = 1 and rpc.message.type=SENT, could be represented as a complex attribute

"rpc.message": [  {  "rpc.message.id" : 1,  "rpc.message.type": "SENT", "timestamp":...  }]

Alternative is flattening (including indexes):

  • rpc.message[0].timestamp = ...
  • rpc.message[0].message.id = 1
  • rpc.message[0].type = SENT

Pros - same as span events.

Cons - same as span events and more:

  • much harder to add severity to attributes
  • timestamp is an attribute
  • query language needs to support complex array attributes on spans

4. Record logs as zero duration spans

Listing it for completeness, I don't understand what does it solve, so maybe @adriangb who brought it up can provide some details

@lmolkova
Copy link
Contributor Author

lmolkova commented Feb 11, 2025

To add some color: Option 3 (events as span attributes) is extensively used by GenAI-focused backends including Arize, Traceloop

@adriangb
Copy link

adriangb commented Feb 11, 2025

Thank you for the summary.

I think the breakdown of options looks good but I disagree with several of the pro/cons.

1. Record them as logs

logs can be reported with or without tracing instrumentation and sampled along with spans or not sampled at all

Don't you need logging instrumentation to report logs? As you say below: Need to configure Logs SDK. So if you already have tracing instrumentation now you have 2 instrumentations instead of 1. If you wanted to emit events without emitting any other spans couldn't you just avoid emitting the unwanted telemetry in the first place (e.g. don't auto-instrument your web framework, or use the auto instrumentation configuration options to selectively apply it). Or you could create a SpanProcessor to filter out spans without events.

I imagine the main use case for sampling logs separately from traces is that you want to aggressively sample traces to reduce costs but don't want to miss exceptions. Tail Sampling is already supposed to address this: Some examples of how you can use Tail Sampling include: Always sampling traces that contain an error. As things stand with sampling it happens at the trace level so there is an argument to be made for the entire trace being too high of volume, however we have had many, many customers ask for tail sampling of exceptions type behavior but I have never heard a request from our customers or seen any comments on here that request the rest of the trace information related to an exception to be discarded. And you now have two different sampling configurations but since logs are embedded in the context of a trace if you don't want to lose that contextual information (which as per above I believe is most users) you'd have to use a sampling system that understands the relationship between logs and traces and can make the decision at a global level, including interactions with other use cases of tail sampling. To give a concrete example: I want to preserve the full information of any trace for which there is an exception/error log, but I also want to keep any exception/error logs for any trace with a latency above X. Do the SDKs and OTEL collectors already support this?

logs provide verbosity control out of the box

Totally agree this is an advantage of logs. I think there's a lot to be said for a discussion on how to incorporate verbosity into traces. The largest discussion I know of is #3205. In the context of this issue I think the relevant bit is that logs have an error level. The other levels are more of a general advantage of logs but I don't see how they fit into this issue in particular. Isn't that what a span's status is supposed to be? I agree span status is much less flexible than arbitrary verbosity levels, but I also don't see why that couldn't be improved instead of moving the data to another signal. I also think verbosity is a bit of an abused concept: it's often used not just to communicate the importance of something to observers (was there an error or not?) but also for sampling. I feel that this is a practice from a time when this was the only option (no context, all you can do is look at 1 log record at a time) that we could do much better than. For example for tracing I've experimented with a path pruning approach where you prune a trace to keep only the nodes on the path from the root to the sampled node, allowing you to retain valuable context of how the exception came to be while minimizing the amount of data. I've also tested a system to sample out branches in a hot loop so that if e.g. you want to put telemetry in the inner function but it's going to be looped over 1M times you keep 1% of the iterations as a sample. I'm not saying this is the right solution, I'm just pointing out that there's ideas to be explored that we close ourselves out to by adopting the legacy logs mindset of sampling and verbosity. I would like to see the envelope pushed for what verbosity means for traces and techniques to reduce the data volumes of traces beyond sampling at the trace level.

logs are familiar to and used by everyone. A lot of libs/infra out there report logs/events

Yes but there's also a reason tracing was created. Just because the technology is older doesn't mean it's better. Besides just because logs as a whole exist doesn't mean that OTEL logs have wide adoption: it's a new signal with a new data shape (two structured pieces, a log body and attributes, as well as complex attributes) so backends that only support traces not only have to add the new endpoints, etc. but may also have to come up with a completely different storage model. I couldn't find a clear answer but I think that Jaeger doesn't support OTEL logs at all. I imagine this is the case with many other OTEL backends at the moment.

structured logs are very flexible

I assume by this you mean they support complex attributes? I think we are both in agreement that complex attributes should be added to spans. If not we live in this world where more and more things are being moved into logs resulting in backends having to support both logs and traces, which means they have to support complex attributes regardless in their storage model, query language, etc.

Ingestion on the backend side may append logs to span records in the internal storage if allows such mutations

I think this is unlikely to materialize. Most large scale systems are going to be doing something along the lines of storing highly compressed columnar data (e.g. Parquet) in object storage. DataDog actually has an excellent in depth description of what their system looks like. These systems are not amenable to updates, they are designed around the idea of immutable append only data.

I'll summarize what my cons are to using logs:

  • It's another signal for backends, collectors, agents, etc. to support. And as per above with complex attributes and other quirks it's not trivial to add universal support for logs to a system.
  • Users now have to configure logs and deal with two concepts, including confusion from seeing that no exceptions are coming through (they forgot to configure logs) and other things like that.

I do see pros:

  • Verbosity is both an effective communication tool and useful metadata for sampling.
  • Logs support complex attributes
  • It might make it easier to configure something akin to tail sampling.

But I think all of the pros could be workshopped to improve tracing in general and avoid the cons.

2. Record them as span events

Exported along with the spans, it's easy to decide on the ingestion side how to store them.

👍🏻 yes agreed, ingestors have all of the information in one place.

Impossible to record if tracer is disabled or span is sampled out

Impossible is a strong word. There's not technical limitation stopping from writing a span processor that let's through exceptions or in fact any other single span but discards the rest. Yes there is no specification for sampling at anything more granular than the span level, but there is also no specification saying that's not allowed or will never be allowed. We could work to define something in this space. And like I said above I think there's a lot of opportunity there. We do also already have tail sampling that as per discussion above I think satisfies most use cases.

Python example

Note: you can run this by installing uv and then running this as an script executable (e.g. chmod +x test.py && ./test.py).

#!/usr/bin/env -S uv run --script
# /// script
# requires-python = ">=3.12"
# dependencies = [
# "opentelemetry-api==1.29.0",
# "opentelemetry-sdk==1.29.0",
# ]
# ///


from opentelemetry import context, trace
from opentelemetry.sdk.trace import ReadableSpan, Span, SpanProcessor, TracerProvider
from opentelemetry.sdk.trace.export import (
  ConsoleSpanExporter,
  SimpleSpanProcessor,
)


class ExceptionSamplingSpanProcessor(SpanProcessor):
  def __init__(self, inner: SpanProcessor) -> None:
      self._inner = inner

  def on_start(self, span: Span, parent_context: context.Context | None = None) -> None:
      self._inner.on_start(span, parent_context)  # note: won't work for BatchSpanProcessor, works fine with SimpleSpanProcessor

  def on_end(self, span: ReadableSpan) -> None:
      if span.status.status_code == trace.StatusCode.ERROR:
          # it would be trivial to also modify the span to discard the context here if that was wanted for some reason
          self._inner.on_end(span)


provider = TracerProvider()
processor = ExceptionSamplingSpanProcessor(SimpleSpanProcessor(ConsoleSpanExporter(formatter=lambda s: f'{s.name}\n')))
provider.add_span_processor(processor)

trace.set_tracer_provider(provider)

tracer = trace.get_tracer('my.tracer.name')

with tracer.start_as_current_span('foo') as span:
  span.set_status(trace.Status(trace.StatusCode.ERROR))

with tracer.start_as_current_span('bar') as span:
  span.set_status(trace.Status(trace.StatusCode.OK))

creates a separate stream of events (in addition to log-based events) - not clear when to use which

I think this assumes that you already have a stream of logs. If you currently only have a stream of traces then it doesn't create an additional stream and quite the opposite: moving this data to logs would create a seperate stream.

can't export them before span ends

I agree this is a problem. I think this should be addressed. But I don't think it's that big of a problem outside of long lived spans. I've seen more users complain about their long lived spans not showing up at all in a backend (i.e. they don't know if the operation is running or not) than I have seen that their exceptions in particular don't show up exactly when they happen. I think this is because it is much more common to kick off a process that you expect will be running and want to confirm is running but less common to expect an error to occur within a long running process (errors are often unexpected).

There is one more BIG pro for the current approach that you don't mention: if backends store the events in the same row as the span (which is reasonable to do since they are received together; this is not possible under other proposals for systems that don't support cheap updates to existing rows) it makes it very easy to find the immediate context of an error (how long was that span that contained it, etc.). If that data is sent at a different time, even if it is over the same signal, it will be much harder for backends to stitch together into one row/entry. And if they can't do that it will have to be joined across tables, which depending on the query language may be a bit harder or much harder (joining is always harder than not joining) but has the possibility of making the query slower or faster depending on the volumes of logs/events relative to spans.

3. Record them as complex span attributes

This is the first time I've heard of this approach. I don't like it for various reasons, but since I don't see anyone advocating for it I won't dive into details unless requested (I already know this reply is way too long). The bit I will say:

  • I'm surprised otel even allows arrays of maps in span event attributes... the whole thing with what is allowed in attributes where and not is so confusing.
  • To your comment ofquery language needs to support complex array attributes on spans: won't any backend that wants to support logs need a query language that can do this? Since logs can have arbitrarily complex attributes this is necessary to fully support them. So this seems just as much a con of the logs approach as it is of this aproach.

4. Record logs as zero duration spans

In the above conversation we've identified the following limitations of tracing:

  1. Events can't be exported before a span ends.
  2. No verbosity levels.
  3. Sampling is not fine grained enough.
  4. Complex attributes are not supported.

And my main complaint about logs:

  1. Backends have to add support for logs.
  2. Users have to do more configuration.

In #4393 (comment) and subsequent discussion I made concrete suggestions that would address Events can't be exported before a span ends and Backends have to add support for logs and Users have to do more configuration:

  1. Explicitly state that zero duration spans are allowed. This is a spec only change, the write protocol already allows it and I'd be surprised if any backend has assertions or issues handling this (aside from nonsensical duration bars in a UI or something). There is also currently no statement saying they are disallowed, nor do I think there can be given limits of timing precision.
  2. Add a flag to spans to mark them as an event. This could be a new marker field (okay to do at the wire protocol level) or even an attribute such as verbosity: "warn" which could effectively kill two birds with one stone*. Or a verbosity attribute could be added at the wire protocol level and only be supported for zero duration spans / spans that are events at least at first.
  3. Define a mapping from span event -> span that is an event.
  4. Add SDK APIs to emit events (e.g. `tracer.warn('a bad thing happened') making use of the above two.

* I don't think this has to be done, it's not worth throwing away the whole idea just because a verbosity attribute is not a good idea, these things don't have to happen together but I'm pointing out that maybe they can.

This would then allow sending exceptions (and other events) as spans marked as events.
Users would not have to add more boilerplate to configuring their SDK if they don't otherwise need logs.
Backends would handle this without changes and can slowly add better support (make use of the verbosity attribute, don't show a duration bar for zero length spans marked as events).
Backends can offer a unified interface between the old (span events) and the new (zero duration spans that are events) by unpacking span events into zero duration spans that are events. Offering a unified experience across changes in SDKs is an important QOL for users so that they can e.g. set alerts without having to do a bunch of jumping through hoops in their queries.
SKDs can gradually adopt this by adding the new methods (which I expect would be quite small methods that then delegate to existing methods).

Concretely here is what this might look like for users:

Python example

Note: you can run this by installing uv and then running this as an script executable (e.g. chmod +x test.py && ./test.py).

#!/usr/bin/env -S uv run --script
# /// script
# requires-python = ">=3.12"
# dependencies = [
# "opentelemetry-api==1.29.0",
# "opentelemetry-sdk==1.29.0",
# ]
# ///

from collections.abc import Iterator
from contextlib import contextmanager
from time import time_ns
from typing import Any

from opentelemetry import context, trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import (
  ConsoleSpanExporter,
  SimpleSpanProcessor,
)
from opentelemetry.util import types


# Users don't writ this class, I'm proposing we add the methods to the existing Tracer class.
class LoggingTracer(trace.Tracer):
  def __init__(self, inner: trace.Tracer):
      self._inner = inner
      self.start_span = inner.start_span
      self.start_as_current_span = inner.start_as_current_span

  def warn(
      self,
      event_name: str,
      context: context.Context | None = None,
      attributes: types.Attributes | None = None,
  ) -> None:
      attributes = dict(attributes or {})
      attributes['verbosity'] = 'warn'
      time = time_ns()
      span = self._inner.start_span(
          name=event_name,
          context=context,
          attributes=attributes,
          start_time=time,
      )
      span.end(time)

  def start_span(self, *args: Any, **kwargs: Any) -> trace.Span:
      return self._inner.start_span(*args, **kwargs)

  @contextmanager
  def start_as_current_span(  # type: ignore  # implementing a Tracer requires access to private APIs. Why?
      self, *args: Any, **kwargs: Any
  ) -> Iterator[trace.Span]:
      return self._inner.start_as_current_span(*args, **kwargs)  # type: ignore  # Same as above


provider = TracerProvider()
processor = SimpleSpanProcessor(ConsoleSpanExporter())
provider.add_span_processor(processor)

trace.set_tracer_provider(provider)

tracer = trace.get_tracer('my.tracer.name')

tracer = LoggingTracer(tracer)

# with context
with tracer.start_as_current_span('foo') as span:
  tracer.warn('bar')

# with no context
tracer.warn('solo')

This does not resolve Sampling is not fine grained enough or Complex attributes are not supported. The former I think is a very interesting discussion to be had, the latter we should fix by adding support for complex attributes to spans.

@ChristopherKlinge
Copy link

ChristopherKlinge commented Feb 11, 2025

Hi,

as an end-user who is fairly new to diving deep into the spec, I'd like to add my thoughts to this discussion. I have been using logs 99% of the time until now. The downsides of them are apparent, which is why I decided to move to a more tracing based approach in the future. This makes the discussions in #4333, #4393 and here somewhat confusing.

Currently I am forced to export logs and traces independently from each other and employ clunky and expensive software to correlate them later on. I often find myself looking at traces and subsequently searching through logs to find relevant events. I strongly believe that moving this information into the tracing ecosystem fully would significantly simplify working on it.

Most of the arguments for using log events feel like trying to focus on retaining a use for legacy infrastructure instead of going for the solution that makes the most sense considering the users' requirements.

@adriangb
Copy link

adriangb commented Feb 11, 2025

One point that stood out to me from the call today: my understanding of the inception for the Logs API was to make it easier to pipe existing "legacy" (e.g. unstructured, varying structure) logs into OTEL. @austinlparker corrected me that this is not the case and proposed (correct me if I'm wrong Austin) that logs exist to model things that have zero duration that can't exist as spans because spans have a duration. There was a comment in the chat that zero duration spans is an anti-pattern.

I'd like to push back on that: if it's an anti-pattern it is because it's re-using an existing system that wasn't designed for zero duration things for zero duration things but that con is very much a pro for real world users and backends: users don't have to learn new concepts and configure them and backends don't have to support a new signal. See the example above that @ChristopherKlinge gives.

Calling sending events as zero duration spans an anti-pattern seems a bit extreme. In practice we have had zero issues employing this pattern in the real world, and as per my working code example above it would be trivial for SDKs and backends to support it. The only thing needed is some specifications around it. If there is specification it is no longer an anti-pattern.

@adriangb
Copy link

More broadly my point is that given the choice between embracing an “anti-pattern” and doing a breaking change for users and backends (moving events to logs) I would gladly choose to embrace the anti-pattern and tweak the specs in a backward compatible way to make it not an anti-pattern.

@adriangb
Copy link

To the point of "the ship has sailed, Logs are here to stay" my pushback is that:

  1. I am not proposing we get rid of Logs as a signal. I agree there are use cases and for better or worse it's a stable part of the spec.
  2. I do think we are at a tipping point, in particular with OTEP: Recording exceptions as log based events #4333 and new semantic conventions like GenAI using Logs that means that as a platform I would have to either (1) support logs or (2) tell users to configure their SDK / collector / etc. to send span events instead of logs. (1) is quite a bit of work for a backend. I would be less worried if there were a canonical way to map logs to spans or span events but that seems to be something that is being left as a TODO, which is worrying. (2) is problematic because it's very, very hard and confusing for users to have to deal with this and it still suffers from all of the issues that span events suffer from (having to buffer them all until the span ends, limits on the size of spans, etc.).

What I am saying is that we consider carefully if we want to go over that tipping point and offering what I see as an alternative that is more backwards compatible and achieves many of the same outcomes.

@pellared
Copy link
Member

pellared commented Feb 11, 2025

Regarding "Record logs as zero duration spans"

Calling sending events as zero duration spans an anti-pattern seems a bit extreme. In practice we have had zero issues employing this pattern in the real world, and as per my working code example above it would be trivial for SDKs and backends to support it.

Do you have benchmarks showing that zero-duration spans perform well in high-volume scenarios? Have you compared performance metrics and load tests? OpenTelemetry prioritizes performance, as stated in its mission:

OpenTelemetry users should not have to choose between high-quality telemetry and a performant application. High performance is a requirement for OpenTelemetry, and unexpected interference effects in the host application are unacceptable.

but that con is very much a pro for real world users and backends: users don't have to learn new concepts and configure them and backends don't have to support a new signal.

Regarding users, while you argue that this approach simplifies things for users by avoiding new concepts and signals, my experience suggests otherwise. As an OpenTelemetry Go maintainer, I see users eagerly awaiting stable OTel Go Logs rather than adopting workarounds. Our perspectives may differ, but only a proper user survey can truly determine what users expect.

Regarding backends (e.g. you mentioned Jaeger and the backend you maintain), OpenTelemetry is vendor neutral. Neutral definition by Cambridge Dictionary:

not saying or doing anything that would encourage or help any of the groups involved in an argument

Anyway, would thousands of zero-duration spans provide a good experience for e.g. Jaeger users?

OpenTelemetry has real pain points, but avoiding a signal when it's appropriate isn’t a sustainable solution. If configuring OTel Logs SDK is complex, improving configuration user experience should be the focus, rather than repurposing spans.

Finally, per the Cambridge Dictionary, a span refers to a period of time during which something exists or happens, reinforcing that spans inherently involve duration.

@trask trask added the triage:deciding:community-feedback Open to community discussion. If the community can provide sufficient reasoning, it may be accepted label Feb 11, 2025
@trask trask added this to Logs SIG Feb 11, 2025
@adriangb
Copy link

adriangb commented Feb 11, 2025

Do you have benchmarks showing that zero-duration spans perform well in high-volume scenarios? Have you compared performance metrics and load tests?

I do not have conclusive data spanning multiple languages, ways to set up things, etc. but here is what I can offer:

  1. Anecdotal evidence from our implementation that zero duration spans are not a performance issue. Overhead issues show up more often from things like instrumenting a hot loop and not having a good way to sample branches of that loop or noisy auto instrumentation. This applies just as much to logs as traces: if you pipe all debug logs in your system (which may be a lot if you bridge your languages existing logging solution) you're going to hit overhead issues with logging; if you auto-instrument everything and don't do sampling you're going to hit overhead issues with tracing.
  2. Having worked in the Python SDK extensively I can say that at least in Python there doesn't seem to be a reason why logs would have significantly lower overhead than traces and also that the traces performance could be improved substantially (low level performance things like using optimizing the involved classes via __slots__, offering optional C implementations like psycopg does, using PEP 563 annotations, etc.; I think there is also likely work that could be done to generally analyze hotspots and reduce overhead).
  3. I'll offer a small benchmark using the Python SDKs. This is a benchmark because numbers are going to change depending on batching, network speed, etc. I don't mean this to be conclusive evidence but at least this should put into doubt the assumption that tracing overhead is going to be much higher than logging.
Logs

Note: you can run this by installing uv and then running this as an script executable (e.g. chmod +x test.py && ./test.py).

#!/usr/bin/env -S uv run --script
# /// script
# requires-python = ">=3.12"
# dependencies = [
# "opentelemetry-api==1.29.0",
# "opentelemetry-sdk==1.29.0",
# ]
# ///

import datetime
import logging
import sys
from collections.abc import Sequence

from opentelemetry._logs import set_logger_provider
from opentelemetry.sdk._logs import LogData, LoggerProvider, LoggingHandler
from opentelemetry.sdk._logs.export import LogExporter, SimpleLogRecordProcessor


class NoOpLogExporter(LogExporter):
  def export(self, batch: Sequence[LogData]) -> None:
      pass

  def shutdown(self):
      pass


logger_provider = LoggerProvider()
set_logger_provider(logger_provider)

logger_provider.add_log_record_processor(SimpleLogRecordProcessor(NoOpLogExporter()))
handler = LoggingHandler(level=logging.NOTSET, logger_provider=logger_provider)

logging.getLogger().addHandler(handler)

logger = logging.getLogger('my.logger1')

n = int(sys.argv[1])
start = datetime.datetime.now(tz=datetime.UTC)
for _ in range(n):
  logger.warning('test')
duration = datetime.datetime.now(tz=datetime.UTC) - start

print(f'{n} logs took {duration}')
Zero duration spans

Note: you can run this by installing uv and then running this as an script executable (e.g. chmod +x test.py && ./test.py).

#!/usr/bin/env -S uv run --script
# /// script
# requires-python = ">=3.12"
# dependencies = [
# "opentelemetry-api==1.29.0",
# "opentelemetry-sdk==1.29.0",
# ]
# ///

import datetime
import sys
from collections.abc import Iterator, Sequence
from contextlib import contextmanager
from time import time_ns
from typing import Any

from opentelemetry import context, trace
from opentelemetry.sdk.trace import ReadableSpan, TracerProvider
from opentelemetry.sdk.trace.export import (
  SimpleSpanProcessor,
  SpanExporter,
  SpanExportResult,
)
from opentelemetry.util import types


# An exporter that does nothing, I don't want to benchmark the network or console.
class NoOpSpanExporter(SpanExporter):
  def export(self, spans: Sequence[ReadableSpan]) -> SpanExportResult:
      return SpanExportResult.SUCCESS


# Users don't writ this class, I'm proposing we add the methods to the existing Tracer class.
class LoggingTracer(trace.Tracer):
  def __init__(self, inner: trace.Tracer):
      self._inner = inner
      self.start_span = inner.start_span
      self.start_as_current_span = inner.start_as_current_span

  def warning(
      self,
      event_name: str,
      context: context.Context | None = None,
      attributes: types.Attributes | None = None,
  ) -> None:
      attributes = dict(attributes or {})
      attributes['verbosity'] = 'warn'
      time = time_ns()
      span = self._inner.start_span(
          name=event_name,
          context=context,
          attributes=attributes,
          start_time=time,
      )
      span.end(time)

  def start_span(self, *args: Any, **kwargs: Any) -> trace.Span:
      return self._inner.start_span(*args, **kwargs)

  @contextmanager
  def start_as_current_span(  # type: ignore  # implementing a Tracer requires access to private APIs. Why?
      self, *args: Any, **kwargs: Any
  ) -> Iterator[trace.Span]:
      return self._inner.start_as_current_span(*args, **kwargs)  # type: ignore  # Same as above


provider = TracerProvider()
processor = SimpleSpanProcessor(NoOpSpanExporter())
provider.add_span_processor(processor)

trace.set_tracer_provider(provider)

tracer = trace.get_tracer('my.tracer.name')

tracer = LoggingTracer(tracer)

n = int(sys.argv[1])
start = datetime.datetime.now(tz=datetime.UTC)
for _ in range(n):
  tracer.warning('test')
duration = datetime.datetime.now(tz=datetime.UTC) - start

print(f'{n} spans took {duration}')

I get:

❯ ./example_traces.py 1_000_000
Reading inline script metadata from `./example_traces.py`
1000000 spans took 0:00:06.141151

❯ ./example_logs.py 1_000_000
Reading inline script metadata from `./example_logs.py`
1000000 logs took 0:00:07.798453

Regarding users, while you argue that this approach simplifies things for users by avoiding new concepts and signals, my experience suggests otherwise. As an OpenTelemetry Go maintainer, I see users eagerly awaiting stable OTel Go Logs rather than adopting workarounds. Our perspectives may differ, but only a proper user survey can truly determine what users expect.

Are you sure users are awaiting OTEL logs or are they just wanting a way to send logs? I see a lot of users that have used logs before or want to send logs for things that don't naturally have a duration / scope, but I don't think users care if that's via OTEL logs or zero duration spans that we call logs or events.

Regarding users being confused by more signals and things to configure: I will point to @ChristopherKlinge 's feedback above. I did not prompt this feedback nor do I know their use case in detail but I have seen a lot of feedback like this from our users, from customer interviews, from previous companies, etc.

Anyway, would thousands of zero-duration spans provide a good experience for e.g. Jaeger users?

Certainly a better experience than that data not arriving at all or not being supported. And no worse than if Jaeger added first class support to logs unless you want to sample logs separately from traces but again I think that sort of use case is orthogonal and can be solved via tail sampling or similar depending on what exactly the user wants.

Neutral definition by Cambridge Dictionary: not saying or doing anything that would encourage or help any of the groups involved in an argument

I don't think this is very helpful. Any decision is going to help some vendors more than others. I don't think my proposal is super unique to my system, I do think it benefits backends that support traces but not logs more than it helps backends that support logs but not traces, but that's a pretty wide net. And for the particular case of moving exceptions from span events to logs that clearly hurts vendors that support traces but not logs and benefits vendors that do support logs.

Finally, per the Cambridge Dictionary, a span refers to a period of time during which something exists or happens, reinforcing that spans inherently involve duration.

So call it log or event at the SDK level to avoid confusing users. That's what we've done and it's worked great. Users know how to send an info level log, they don't care that under the hood it's sent via a protobuf layer that uses the nomenclature span. Making breaking changes because we want to follow the Cambridge Dictionary precisely seems a bit pedantic to me. The definition of log is "a full written record of a trip, a period of time, or an event" which also involves a period of time (or alternatively "a thick piece of tree trunk or branch, especially one cut for burning on a fire" but I think that's even less helpful).

@austinlparker
Copy link
Member

Without breaking out the dictionary, OpenTelemetry has a pretty stable convention for 'span' being 'a record of a unit of work with a beginning and an end'. Our log signal is evolving to support a highly structured, schema-first approach, as this most comports with our vision of unified telemetry.

While I respect that you have found a solution that works for you and your users, our mission and vision is slightly broader than that. We must consider the myriad stakeholders, existing users, existing telemetry systems, and existing database designs when considering changes to to this model. OpenTelemetry is not simply repackaging "what works today"; It is a comprehensive framework for telemetry data that bridges the current world into a future that is, inherently, driven by a contextual and correlated stream of telemetry data. We make no promises or guarantees about where this data is stored, or how it should be queried. We do not believe that every user will find equal value from each signal. This means that we can't just "treat logs as zero-duration spans" -- what if someone doesn't have a span database, to turn your example around.

Thankfully, OpenTelemetry is highly composable so you are welcome to continue distributing extensions to it that adapt its functionality to work with your database/analysis backend. Even better, thanks to its composability, your users can still benefit from the instrumentation ecosystem as it continues to grow and mature.

@adriangb
Copy link

@austinlparker I agree with your general vision: OpenTelemetry is a vendor agnostic project with an ambitious goal and a lot of stakeholders. But how does that relate to what I am concretely discussing and proposing here? Just because there are a lot of stakeholders does that mean it's not possible to address the concrete proposals and issues I've brought up?

Without breaking out the dictionary

The dictionary was broken out above already and it says logs can also be related to durations, but I agree that making decision based on dictionary definitions is not a good path forward.

what if someone doesn't have a span database, to turn your example around

Again I'm not saying logs shouldn't exist at all, and I welcome the option for users to transform trace information into logs if they must send it somewhere that doesn't support traces. But what we are talking about here is taking information that used to existing in tracing and moving it to logs on the premise that tracing doesn't support that sort of data. That is what I am opposing on the basis that it could be done without moving this data to another signal and without making a breaking change for existing backends.

Our log signal is evolving to support a highly structured, schema-first approach, as this most comports with our vision of unified telemetry

I am interested in what this vision of unified telemetry is. I've been in this community for years and it is very hard to track what the long term vision is. It seems to me that this vision is being created as we speak. And are traces not highly structured?

@lmolkova
Copy link
Contributor Author

lmolkova commented Feb 12, 2025

taking information that used to existing in tracing and moving it to logs on the premise that tracing doesn't support that sort of data

I think you're referring to #4333 and the long term vision to replace span events with logs.
The existing solution (span events) is limiting for these reasons and there is a huge intersection with logs which only adds to the complexity and confusion.

While you may choose to build everything on top of traces, other vendors made a different choice and they have good reasons behind it too. OTel should enable both with the minimum complexity for the end users.

The long term vision for events was discussed in all possible directions in OTEP 265 and now lives here.
This does not mean breaking. If you want to receive span events instead of logs, you should be able to as producers of telemetry switch to logs.

What could be productive is to contribute to the migration story.

@adriangb
Copy link

adriangb commented Feb 12, 2025

While you may choose to build everything on top of traces, other vendors made a different choice and they have good reasons behind it too.

Could you share the reasons why that choice has been made? I have made a legitimate effort to look through old issues and discussions but I have not found compelling reasons as to why using logs is better than what I am proposing here.

This does not mean breaking. If you want to receive span events instead of logs, you should be able to as producers of telemetry switch to logs.

If a user updates their SDK and/or instrumentation and suddenly data is missing I would consider that a breaking change, even if they can either configure things to avoid it or I can implement a whole new signal. I think there was agreement on the calls today that this is a breaking change so saying that it is not a breaking change now is confusing.

The long term vision for events was discussed in all possible directions open-telemetry/oteps#265 and now lives here.

I obviously missed the boat on that discussion. I'm sorry I'm showing up now and dredging it back up. I have a wide span of open source contribution and it can be hard to keep up with everything. To be fair it's not Logs themselves I have a problem with or the merging of the Events and Logs API (which is a good part of that OTEP), it's mainly moving exceptions to logs. And because I don't want to oppose progress without offering a solution I've proposed a way to send exceptions via the existing traces signal that I believe addresses most of the concerns with span events, is less of a breaking change, is less of a departure from where we are now and will be easier for users, the SDK and backends to implement. I have not seen any serious discussion of this approach before and the most concrete piece of feedback I've gotten on it has been "the ship has sailed", dictionary definitions and incomplete workarounds for the breaking changes being introduced.

I even compiled all comments and the diff from open-telemetry/oteps#265, this issue and a couple others that I've been browsing and fed it into Claude asking it to find any previous discussions of this approach to see if I'm missing an important bit of discussion and as far as I can tell it has not been proposed before in this level of detail. Issues where I see concrete discussion of zero duration spans are #4123 and #2125. The former has a different context and the conclusions aren't necessarily useful, but there is a good bit of discussion around what a span is and when something should be a span or not. The latter issue has really no response at all, and it is not being discussed in the context of replacing span event.

Let's talk about #4123 (comment): I generally agree with this definition, and I think it can still be true at the SDK and semantic level even if the wire protocol is a bit less constrained. There is already precedent for the wire protocol being more flexible or differing from the semantic and SDK layers (e.g. complex attributes). What I am proposing means zero changes at the wire protocol level, only semantic changes and additional SDK APIs.

I think I have done a fair share of trying to catch up on previous discussions and understand how we got to where we are. I was asked for concrete plans and have given them. I was asked for benchmarks and have given them. I'd appreciate a bit more concrete discussion on the actual proposal.

@austinlparker
Copy link
Member

austinlparker commented Feb 12, 2025

I'd appreciate a bit more concrete discussion on the actual proposal.

If your desire is "I want to record exceptions as span events" then you'll be able to do that under the current direction of the log/event API. I do not believe there is support for relaxing the specification in order to canonize zero-duration spans, for the reasons previously stated.

ed: I would suggest that the current direction actually works much better for users, as they can have a single exception handler that writes an exception event which will be emitted via the span or log pipeline based on context.

@adriangb
Copy link

adriangb commented Feb 12, 2025

I'd appreciate a bit more concrete discussion on the actual proposal.
If your desire is "I want to record exceptions as span events" then you'll be able to do that under the current direction of the log/event API

I am not proposing to record exceptions as span events. I am proposing a mechanism for sending independent events (or logs, whatever your preferred terminology is) over the existing traces signal in a backwards compatible way so that OTEL doesn't have to force every backend to support logs and traces in what I think we agree is a breaking change. And that's what I'm trying to steer the discussion towards by asking concrete feedback on.

I do not believe there is support for relaxing the specification in order to canonize zero-duration spans, for the reasons previously stated.

What part of the specification has to be relaxed? Is there anywhere that says that zero duration spans are not allowed over the wire? Or is there any specification that says they are explicitly meaningless, not allowed, etc.? What are the reasons previously stated? The dictionary definition of the generic term "span"? Could you be more concrete please?

I would suggest that the current direction actually works much better for users, as they can have a single exception handler that writes an exception event which will be emitted via the span or log pipeline based on context

Much better than my proposal here or existing span events? I am not clear which one of the two things your comment is responding to based on the confusion in the previous paragraph.

Assuming a world exists where you can choose any one of the options, why not choose the default one to be something that minimizes breakage relative to the current solution? Just as you are stating that it will be possible to send exceptions via spans instead of logs why not flip that and say it is possible to send exceptions via logs, which is a smaller change given they are currently sent via spans?

@adriangb
Copy link

adriangb commented Feb 12, 2025

I forgot to touch on

Using logs to record errors has the following advantages over using span events:
.

I believe this proposal addresses all of the issues raised there.

They can be recorded for operations that don't have any tracing instrumentation

I understand this complaint as "why should I have to open a span just to record an exception?" or in other words "how do I record an exception if I don't have a currently active span?". As I showed in the example in #4414 (comment) it is totally possible to do this under this proposal. You can have a top level logger.exception(...). If the argument is that you avoid having to configure the tracing SDK, well you now need to configure the logs SDK so I don't see much benefit to that (maybe you always have both, in which case there is no pro or con).

They can be sampled along with or separately from spans

Doesn't tail sampling sort of achieve this already? I can set up a collector or other sampler to only keep traces where there are exceptions / spans with their status set to error (which would continue to work under this proposal). I think it's rarer but there is a use case for keeping only the exception and discarding the rest of the information in that trace which cannot be achieved under curing sampling paradigms. But IMO that is a limitation of trace sampling that should be addressed separately; I gave more detail in #4414 (comment).

They can have different severity levels to reflect how critical the error is

I addressed this in #4414 (comment):

logs provide verbosity control out of the box

Totally agree this is an advantage of logs. I think there's a lot to be said for a discussion on how to incorporate verbosity into traces...

They are already reported natively by many frameworks and libraries

And I think it's reasonable that if users want to capture this signal, instead of or in addition to using tracing auto instrumentation, they will have to use a backend that supports logs or do some SDK or collector configuration to pipe logs into spans. As an example we support that already under the paradigm being proposed here: https://logfire.pydantic.dev/docs/integrations/logging/#oh-no-too-many-logs-from

Additionally mentioned in that PR and by @trask in the call today the buffering of events until a span ends and the memory or truncation issues that can lead to. This proposal addresses that because you can fire off an event via tracing at any time and it will be sent out with the next export.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
triage:deciding:community-feedback Open to community discussion. If the community can provide sufficient reasoning, it may be accepted
Projects
Status: No status
Development

No branches or pull requests

6 participants