Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Span Stats processor #403

Closed
ashwinidulams opened this issue Jul 7, 2020 · 38 comments · Fixed by #2225
Closed

Proposal: Span Stats processor #403

ashwinidulams opened this issue Jul 7, 2020 · 38 comments · Fixed by #2225
Labels
enhancement New feature or request help wanted Extra attention is needed
Milestone

Comments

@ashwinidulams
Copy link

Span Statistics Proposal

A processor for aggregating spans to derive RED (Request, Error, Distribution) metrics across a set of dimensions (dynamic or otherwise) and exporting the results to a metrics sink (prometheus endpoint, stream etc).

Motivation

A trace contains end-to-end data about the service call chain in the context of a request/transaction. This data can be mined to derive context rich metrics that can help us identify and isolate the problem as quickly as possible. For example, span level aggregation can be used to detect any performance problems within the process boundary defined by the span.

Instead of forwarding the spans to a separate processor (stream processing application) for aggregation, a otel processor can process the spans in flight and emit RED metrics at configurable time intervals.

Proposed Configuration

processors:
  # name of the processor 
  spanstatsprocessor:
    # time window for aggregation in seconds.
    timewindowseconds: 15
    # dimensions for aggregation. could be an array of aggregations as well.
    # for each set of dimensions, RED (request, error and distribution) metrics will be emitted aggregated per timewindow.
    dimensions:
    - http.url
    - custom_tag1
    # sink for metrics
    sink:
      # prometheus sink with a endpoint exposed at port 24880
      # alternately could be a kafka stream.
      prometheusport: 24880
@jrcamp
Copy link
Contributor

jrcamp commented Jul 30, 2020

Just so it's not lost, @jmacd wrote in the original issue:

See open-telemetry/opentelemetry-specification#657, which aims to specify conventions for this kind of processor.

@jrcamp jrcamp added the help wanted Extra attention is needed label Jul 30, 2020
@jrcamp jrcamp added this to the Backlog milestone Jul 30, 2020
@jmacd
Copy link
Contributor

jmacd commented Aug 20, 2020

There is a related discussion open-telemetry/opentelemetry-specification#739.

Also open-telemetry/opentelemetry-specification#381.

This is a great thing and we should have such a processor. I would expect this processor to output OTLP metrics into a metrics pipeline, which I believe is implied by the config snippet above. 👍

@pauldraper
Copy link

pauldraper commented Aug 28, 2020

What are the metrics? span.count and span.duration?

Does it make sense to allow attributes as metrics (e.g. response.bytes)?

@albertteoh
Copy link
Contributor

I'd like to take on this task if no one's put their hand up yet.

I would expect this processor to output OTLP metrics into a metrics pipeline

@jmacd, this suggestion makes a lot of sense, as I've recently learned from @jpkrohling. As I'm new to OTEL, what do you suggest would be the best approach to achieving this? I haven't put too much thought into it, but my initial thinking is to copy the OTLP exporter's metrics exporting functionality.

If this doesn't sound like the right approach, can you please point me in the right general direction?

@albertteoh
Copy link
Contributor

albertteoh commented Dec 10, 2020

I've put together a Proof of Concept. Here's a screenshot of a Grafana dashboard based on the R.E.D metrics aggregated from span data. It contains a line graph of hit counts (Request rate), Error counts and rates and a histogram of latencies for a specific endpoint (Duration):

Screen Shot 2020-10-26 at 9 07 30 pm

@albertteoh
Copy link
Contributor

albertteoh commented Dec 10, 2020

Wondering if I could get some early feedback on this from the community, particularly, the approach taken to transform trace telemetry to metrics, which I've illustrated here:

Metrics from trace processor 3 pipelines (1)

The approach is to have a common trace receiver that feeds two pipelines, a trace-only pipeline for trace ingestion, and a trace->metrics pipeline that performs aggregations on span data and writes them out to a configured metrics exporter.

The exporter is chosen using a processor configuration like so:

exporters:
   prometheus:
     endpoint: "0.0.0.0:8889"
     ...

processors:
  spanmetrics:
    metrics_exporter: prometheus
    ...

And the code that performs the metrics exporter discovery:

// Start implements the component.Component interface.
func (p *processorImp) Start(ctx context.Context, host component.Host) error {
	exporters := host.GetExporters()

	for k, exp := range exporters[configmodels.MetricsDataType] {
                ...
                // Check if the exporter k has the same name as the configured exporter. e.g. prometheus
		if k.Name() == p.config.MetricsExporter {
			p.metricsExporter = metricsExp
			break
		}
	}
	if p.metricsExporter == nil {
		return fmt.Errorf( "failed to find metrics exporter: '%s'", p.config.MetricsExporter)
	}
	return nil
}

That all seems okay to me so far, though please suggest any improvements to the above proposed approach.

The part that I'm less sure of is the user experience when configuring pipelines, which seems less than ideal due to the following OTEL collector constraints (which make sense by the way):

  1. Needing at least one receiver and at least one exporter in a pipeline.
  2. Exporters are only discoverable if they belong to a pipeline.

Here's the relevant sections of my POC config. Note the need for:

  • a dummy otlp/spanmetrics receiver (requirement 1.).
  • an unused exporter (I just used logger) in the traces/spanmetrics pipeline, which aggregates span data to metrics (requirement 1.).
  • a dummy metrics/spanmetrics pipeline to "house" the prometheus metrics exporter to ensure it's discoverable in the processor Start() function (requirement 2.).
 receivers:
   # Dummy receiver that's never used, because a pipeline is required to have one.
   otlp/spanmetrics:
     protocols:
       grpc:
         endpoint: "localhost:12345"

 service:
   pipelines:

     # Trace-only pipeline for ingesting traces/spans.
     traces:
       receivers: [jaeger]
       processors: [batch, queued_retry]
       exporters: [logging]

     # Dedicated pipeline for receiving spans and emitting metrics.
     # Note the logging exporter is not really used, but is mandatory for pipelines (must have at least one receiver and exporter)
     traces/spanmetrics:
       receivers: [jaeger]
       processors: [spanmetrics]
       exporters: [logging]

     # A dummy 'internal' pipeline dedicated to exposing a metrics exporter for the spanmetrics pipeline.
     # The exporter name must match the metrics_exporter name.
     # The receiver is just a dummy and never used; added because it's mandatory.
     metrics/spanmetrics:
       receivers: [otlp/spanmetrics]
       exporters: [prometheus]

To me, the need for these dummy configuration entries makes for a less than ideal configuration experience for those using this processor and wondering if folks have any suggestions to improve this?

@objectiser
Copy link
Contributor

@albertteoh Could there be a concept of a local metrics receiver, that any component could generate metrics for? So then the spanmetrics processor could be part of the normal trace pipeline, and the local metrics receiver could be used to trigger a single metrics pipeline.

receivers:
  otlp:
    protocols:
      grpc:
         endpoint: "localhost:12345"

  local/mymetrics:

exporters:
   prometheus:
     endpoint: "0.0.0.0:8889"
     ...

processors:
  spanmetrics:
    metrics_receiver: local/mymetrics

service:
  pipelines:
   traces:
     receivers: [otlp]
     processors: [spanmetrics, batch, queued_retry]
     exporters: [logging]
   metrics:
     receivers: [local/mymetrics]
     exporters: [prometheus]

@albertteoh
Copy link
Contributor

albertteoh commented Dec 11, 2020

Thanks @objectiser, that config looks cleaner. Please correct me if I'm wrong; I think the spanmetrics processor still needs an exporter as they expose the ConsumeMetrics function for sending metrics (i.e. to trigger a pipeline), whereas receivers don't offer such means.

Your suggestion got me thinking though... I'd like to understand the motivation for needing at least one receiver in a pipeline? Removing this requirement would mean a local receiver is no longer needed for the use case of producing metrics from traces, and we simply configure what's necessary (@tigrannajaryan @bogdandrutu, what do you think?):

receivers:
   jaeger:
     protocols:
       thrift_http:
         endpoint: "0.0.0.0:14278"

exporters:
   prometheus:
     endpoint: "0.0.0.0:8889"
     ...

processors:
  spanmetrics:
    metrics_exporter: prometheus

service:
  pipelines:
   traces:
     receivers: [jaeger]
     processors: [spanmetrics, batch, queued_retry]
     exporters: [logging]
   metrics:
     exporters: [prometheus]

@objectiser
Copy link
Contributor

@albertteoh My suggestions were not based on the current internal implementation, but focusing first on how it might be simplified for configuration - and then address how to deal with this "local" receiver as a second consideration.

I like your updated suggestion as well - my only comment would be that it would not allow any processors to be defined on the metrics generated by the spanmetrics processor - whereas if those metrics could be passed to a local receiver, it still offers the full metrics pipeline to process the metrics and potentially export to multiple destinations.

@djaglowski
Copy link
Member

Here is a very rough draft of an alternate proposal. I'm providing here in case it may generate useful discussion or ideas.

@objectiser
Copy link
Contributor

@djaglowski Sounds reasonable - although for clarity could you add an example OTC config that addresses this use case, for example:

  • tracing data, processed to derive metrics
  • tracing data then needs to be further processed before being exported
  • derived metrics then exported to another metrics pipeline that maybe does additional processing before exporting to the desired backend

@djaglowski
Copy link
Member

@objectiser, I've updated the doc to include your example scenario and a possible configuration for it. This example highlights a gap in my proposal, which I've also addressed at the end. Thanks for reviewing, and for the suggestion.

@objectiser
Copy link
Contributor

@djaglowski Looks good, thanks for the update.

@albertteoh
Copy link
Contributor

albertteoh commented Dec 21, 2020

Thanks for putting together the proposal, @djaglowski.

I feel an important requirement for trace -> metrics aggregation is having access to all span data before any filtering (like tail-based sampling) and potentially after span modification processors are applied.

If I understand correctly, the signal translators are implemented as exporters, which I believe would be subject to all processors in the pipeline being executed first before reaching the exporter and hence potentially result in lower-fidelity metrics.

@djaglowski
Copy link
Member

djaglowski commented Dec 22, 2020

@albertteoh I believe my proposal covers this, though the usability may be a bit poor.

In the example shown in the doc, copied below, pipeline a is a forking mechanism for the raw traces, so that they can be 1) passed along to pipeline b without modification, and 2) processed into metrics and passed to pipeline c.

receivers:
  some/tracingreceiver:
  otlp/tracingreceiver:
    endpoint: "0.0.0.0:1111"
  otlp/metricsreceiver:
    endpoint: "0.0.0.0:2222"
  
processors:
  some/tracingprocessor:
  some/metricsprocessor:
  
exporters:
  some/tracingexporter:
  some/metricsexporter:
  otlp/tracingexporter:
    endpoint: "0.0.0.0:1111"
  tracestometrics: # perform some derivation and behave as otlp/metricsexporter
    endpoint: "0.0.0.0:2222"

service:
  pipelines:
    a:
      receivers: [some/tracingreceiver]
      exporters: [otlp/tracingexporter, tracestometrics]
    b:
      receivers: [some/tracingreceiver]
      processors: [some/tracingprocessor]
      exporters: [some/tracingexporter]
    c:
      receivers: [some/metricsreceiver]
      processors: [some/metricsprocessor]
      exporters: [some/metricsexporter]

Diagram:
image

@djaglowski
Copy link
Member

djaglowski commented Dec 22, 2020

A new type of component would simplify the above proposal. This is pretty similar to the "local receiver" idea proposed by @objectiser.

I've expanded upon this in the doc, but at a high level, the idea is that translators could be used to link pipelines together:

receivers:
  some/tracingreceiver:
  
processors:
  some/tracingprocessor:
  some/metricsprocessor:
  
exporters:
  some/tracingexporter:
  some/metricsexporter:

translators: # must be used as both exporter and receiver
  tracestotraces:
  tracestometrics:

service:
  pipelines:
    a:
      receivers: [some/tracingreceiver]
      exporters: [tracestotraces, tracestometrics]
    b:
      receivers: [tracestotraces]
      processors: [some/tracingprocessor]
      exporters: [some/tracingexporter]
    c:
      receivers: [tracestometrics]
      processors: [some/metricsprocessor]
      exporters: [some/metricsexporter]

image

@objectiser
Copy link
Contributor

@djaglowski Haven't looked in the doc, so possibly explained there - but what is the purpose of the Traces to Traces Translater? Seems redundant as just dealing with the trace pipeline.

@djaglowski
Copy link
Member

djaglowski commented Dec 22, 2020

@objectiser It would probably make more sense if described as a Traces Forwarder. The reasoning for this to be included here at all, is that it would be a bridge between pipelines.

Ultimately, this is all in the context of trying to work with the existing pipeline format, where a fan out happens before Exporters. A "Translator" would be an abstraction of an Exporter/Receiver combo that bridges two pipelines.

image

@objectiser
Copy link
Contributor

@djaglowski Ok thanks for the explanation - I'm not familiar with the internals of the collector so may be wrong - but I thought Traces Receiver could just be shared by pipeline A and B (i.e. also used as the receiver for pipeline B instead of the Traces to Traces Translator) to achieve the same result?

@djaglowski
Copy link
Member

@objectiser You might be right about being able to reuse components.

However, for the purpose of chaining together pipelines, I believe it would become ambiguous which pipeline you intend to emit to.

The Traces Forwarder would be an abstraction on both a receiver and an exporter, both of which are configured to communicate on a specific port.
image

So the idea here was to encapsulate all the details of linking the two together, and just have a single component instead.

@albertteoh
Copy link
Contributor

Thanks for the additional details @djaglowski. I like the translator/forwarder/bridge concept, especially given it tightly couples the exporter and receiver link between pipelines which avoids ambiguity on where exporters will route requests to, and hence a better user experience. It satisfies the requirement for chaining pipelines and "mixed" telemetry type pipelines, while IIUC, maintaining backwards compatibility with existing interfaces.

I also agree that your proposal satisfies the requirement of aggregating traces to metrics before processing.

@tigrannajaryan
Copy link
Member

@djaglowski I commented on the design doc.

@andrewhsu andrewhsu added the enhancement New feature or request label Jan 6, 2021
tigrannajaryan pushed a commit that referenced this issue Jan 28, 2021
**Description:** Adds spanmetricsprocessor logic as a follow-up to #1917.

**Link to tracking Issue:** #403

**Testing:**
- Unit and integration tested locally (by adding to `cmd/otelcontribcol/components.go`).
- Performance tested (added metrics key cache to improve performance) and added benchmark function.

**Documentation:** Add relevant documentation to exported structs and functions.
@tigrannajaryan
Copy link
Member

Is this now done or more work is planned?

@albertteoh
Copy link
Contributor

@tigrannajaryan, I've been following the contributing guidelines so there's one last PR required which is to enable the spanmetrics processor component, if you're happy for me to go ahead with it.

tigrannajaryan pushed a commit that referenced this issue Jan 29, 2021
Signed-off-by: albertteoh <[email protected]>

**Description:** Enables the spanmetricsprocessor for use in otel collector configs as a follow-up to #2012.

**Link to tracking Issue:** Fixes #403

**Testing:** Integration tested locally against a prebuilt Grafana dashboard to confirm metrics look correct.
@halayli
Copy link

halayli commented Sep 11, 2021

If the goal of this processor is to convert traces to a metrics, then It should focus on extraction and shouldn't do any aggregation but rather pass the metrics to a metrics processor. And as far as I can tell, in its current implementation it's ignoring the span events which are as important as span attributes.

@albertteoh
Copy link
Contributor

@halayli could you elaborate more on what you mean, particularly about extraction and span events?

@Tenaria
Copy link

Tenaria commented Jan 20, 2022

Agree with @halayli's comment above that the span metrics processor shouldn't aggregate but simply extract the metrics for the trace and forward. I feel like the span metrics processor is becoming a bit convoluted due to the added layer of complexity introduced by aggregation that is further complicated as we add in more features.

In my opinion if aggregation is desired, that can be a part of a separate processor - I saw this open issue #4968 which is highly relevant in that case. That way we don't have all these processors that are all doing multiple things (e.g it sounds like the metric transform processor is also doing some aggregation). I think it'll help us moving forward.

What are other people's thoughts around this (maintainers/contributors/other users /etc.), would you be receptive to this change in removing the aggregation?

@tonychoe
Copy link

I agree that we better keep this processor to generating span-metrics only. I prefer to do aggregation via the metrics backend or other otelcol processor.

@albertteoh
Copy link
Contributor

Thanks @Tenaria for bringing up this topic again.

+1 to separating the span to metrics translation from the metrics aggregation concern.

Reading over the comments again, I think I've got a better appreciation of @halayli's the suggestion (although I'm still not sure what span events refer to). I noticed in the last few PRs that there was a fair bit of discussion around locking, concurrency and performance, and I suspect these concerns should be mitigated if we remove aggregation.

When spanmetrics processor first came to being, the goal was just to get something working quickly to prove its utility. Clearly, it's evolved to the point where the need to decouple the aggregation and translation concerns is quite evident.

Fortunately, in @tigrannajaryan's greater wisdom, the spanmetrics processor is still in "experimental" mode so I'm not too concerned about introducing breaking changes at this stage, and this will be a breaking change.

@tigrannajaryan
Copy link
Member

I support the idea of separating the concerns of extracting metric values from spans (which can be this specialized processor) vs aggregating the extracted metrics (which can be a generic metric aggregation processor).

However if we are doing breaking change I suggest that we first announce it before start making changes (in Slack and in SIG meeting), put together a 1-pager that describes how the new approach will work, how it will substitute the current approach and how the changes will be rolled out (very rough plan with rollout steps would be great to have). Since the new approach relies on having the metric aggregation processor we must ensure that metric aggregation processor exists before the aggregation features are removed from this processor (otherwise we will end up in temporary a state where we don't have feature parity with the current processor).

If anyone is willing to drive this change then the maintainers/approvers will be happy to review.

@jamesmoessis
Copy link
Contributor

@tigrannajaryan it seems the metrictransformprocessor has aggregation functionalities, albeit it looks a little outdated from first glance. Is there still interest in a completely new and separate metricsaggregatorprocessor?

@tigrannajaryan
Copy link
Member

I would expect that metrictransformprocessor should be able to do the aggregation that we need.

There is an effort to refine the metrictransformprocessor, which I am not following closely, so I don't know where exactly they are right now: https://github.com/open-telemetry/opentelemetry-collector-contrib/issues?q=is%3Aissue+is%3Aopen++label%3A%22proc%3A+metricstransformprocessor%22+

@jmacd
Copy link
Contributor

jmacd commented Jan 24, 2022

What I'd like to see in the future / it would be cool if a span exporter would use an OpenTelemetry metrics API MeterProvider configured to use a private instance of an OTel-Go Metrics SDK configured to output to a metrics receiver.

There's a 1:1 correspondence between Spans arriving and Meter(library_name).Counter(span_name).Add(adjusted_count) API events, so the span receiver would be extremely simple and stateless itself.

The OTel-Go SDK would do the aggregation for you, and note that it's performing aggregation over counter events, not arbitrary metrics data points. The logic required inside an SDK for aggregating events into OTLP is substantially simpler than a generic metrics aggregation processor, that's why I think this is a good idea.

@tigrannajaryan
Copy link
Member

What I'd like to see in the future / it would be cool if a span exporter would use an OpenTelemetry metrics API MeterProvider configured to use a private instance of an OTel-Go Metrics SDK configured to output to a metrics receiver.

It will likely not have the best performance due to extra conversions and (de)serialization. For better performance we need to stay in pdata space.

@MovieStoreGuy
Copy link
Contributor

MovieStoreGuy commented Jan 25, 2022

It will likely not have the best performance due to extra conversions and (de)serialization. For better performance we need to stay in pdata space.

Did someone say generics? :trollface:

But I agree, moving towards a scoped meter that can remain in pdata space sounds like the best solution for this.

@ambition-consulting
Copy link

ambition-consulting commented Oct 31, 2022

I've put together a Proof of Concept. Here's a screenshot of a Grafana dashboard based on the R.E.D metrics aggregated from span data. It contains a line graph of hit counts (Request rate), Error counts and rates and a histogram of latencies for a specific endpoint (Duration):

Screen Shot 2020-10-26 at 9 07 30 pm

Hey @albertteoh, do you have the grafana config for this available somewhere? I'm using your plugin and would like to visualize those metrcis now that I got it working.

Sweet work btw. :)

@albertteoh
Copy link
Contributor

@ambition-consulting you can try: https://snapshots.raintank.io/dashboard/snapshot/J1haxtp7oBF1BpgL7JBqHzZgVNxlAlWW

@ambition-consulting
Copy link

ambition-consulting commented Nov 2, 2022

@ambition-consulting you can try: https://snapshots.raintank.io/dashboard/snapshot/J1haxtp7oBF1BpgL7JBqHzZgVNxlAlWW

@albertteoh this looks great, but the values get hard coded on export. can I please (!) get a json with the functions being applied on the output of the spanmetricsprocessor?

my asumption is that this is what the dashboard does? or are those just dummy values that don't come from spanmetric processor?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

Successfully merging a pull request may close this issue.