-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal: Span Stats processor #403
Comments
Just so it's not lost, @jmacd wrote in the original issue:
|
There is a related discussion open-telemetry/opentelemetry-specification#739. Also open-telemetry/opentelemetry-specification#381. This is a great thing and we should have such a processor. I would expect this processor to output OTLP metrics into a metrics pipeline, which I believe is implied by the config snippet above. 👍 |
What are the metrics? Does it make sense to allow attributes as metrics (e.g. |
I'd like to take on this task if no one's put their hand up yet.
@jmacd, this suggestion makes a lot of sense, as I've recently learned from @jpkrohling. As I'm new to OTEL, what do you suggest would be the best approach to achieving this? I haven't put too much thought into it, but my initial thinking is to copy the OTLP exporter's metrics exporting functionality. If this doesn't sound like the right approach, can you please point me in the right general direction? |
Wondering if I could get some early feedback on this from the community, particularly, the approach taken to transform trace telemetry to metrics, which I've illustrated here: The approach is to have a common trace receiver that feeds two pipelines, a trace-only pipeline for trace ingestion, and a trace->metrics pipeline that performs aggregations on span data and writes them out to a configured metrics exporter. The exporter is chosen using a processor configuration like so: exporters:
prometheus:
endpoint: "0.0.0.0:8889"
...
processors:
spanmetrics:
metrics_exporter: prometheus
... And the code that performs the metrics exporter discovery: // Start implements the component.Component interface.
func (p *processorImp) Start(ctx context.Context, host component.Host) error {
exporters := host.GetExporters()
for k, exp := range exporters[configmodels.MetricsDataType] {
...
// Check if the exporter k has the same name as the configured exporter. e.g. prometheus
if k.Name() == p.config.MetricsExporter {
p.metricsExporter = metricsExp
break
}
}
if p.metricsExporter == nil {
return fmt.Errorf( "failed to find metrics exporter: '%s'", p.config.MetricsExporter)
}
return nil
} That all seems okay to me so far, though please suggest any improvements to the above proposed approach. The part that I'm less sure of is the user experience when configuring pipelines, which seems less than ideal due to the following OTEL collector constraints (which make sense by the way):
Here's the relevant sections of my POC config. Note the need for:
receivers:
# Dummy receiver that's never used, because a pipeline is required to have one.
otlp/spanmetrics:
protocols:
grpc:
endpoint: "localhost:12345"
service:
pipelines:
# Trace-only pipeline for ingesting traces/spans.
traces:
receivers: [jaeger]
processors: [batch, queued_retry]
exporters: [logging]
# Dedicated pipeline for receiving spans and emitting metrics.
# Note the logging exporter is not really used, but is mandatory for pipelines (must have at least one receiver and exporter)
traces/spanmetrics:
receivers: [jaeger]
processors: [spanmetrics]
exporters: [logging]
# A dummy 'internal' pipeline dedicated to exposing a metrics exporter for the spanmetrics pipeline.
# The exporter name must match the metrics_exporter name.
# The receiver is just a dummy and never used; added because it's mandatory.
metrics/spanmetrics:
receivers: [otlp/spanmetrics]
exporters: [prometheus] To me, the need for these dummy configuration entries makes for a less than ideal configuration experience for those using this processor and wondering if folks have any suggestions to improve this? |
@albertteoh Could there be a concept of a
|
Thanks @objectiser, that config looks cleaner. Please correct me if I'm wrong; I think the Your suggestion got me thinking though... I'd like to understand the motivation for needing at least one receiver in a pipeline? Removing this requirement would mean a local receiver is no longer needed for the use case of producing metrics from traces, and we simply configure what's necessary (@tigrannajaryan @bogdandrutu, what do you think?): receivers:
jaeger:
protocols:
thrift_http:
endpoint: "0.0.0.0:14278"
exporters:
prometheus:
endpoint: "0.0.0.0:8889"
...
processors:
spanmetrics:
metrics_exporter: prometheus
service:
pipelines:
traces:
receivers: [jaeger]
processors: [spanmetrics, batch, queued_retry]
exporters: [logging]
metrics:
exporters: [prometheus] |
@albertteoh My suggestions were not based on the current internal implementation, but focusing first on how it might be simplified for configuration - and then address how to deal with this "local" receiver as a second consideration. I like your updated suggestion as well - my only comment would be that it would not allow any processors to be defined on the metrics generated by the |
Here is a very rough draft of an alternate proposal. I'm providing here in case it may generate useful discussion or ideas. |
@djaglowski Sounds reasonable - although for clarity could you add an example OTC config that addresses this use case, for example:
|
@objectiser, I've updated the doc to include your example scenario and a possible configuration for it. This example highlights a gap in my proposal, which I've also addressed at the end. Thanks for reviewing, and for the suggestion. |
@djaglowski Looks good, thanks for the update. |
Thanks for putting together the proposal, @djaglowski. I feel an important requirement for trace -> metrics aggregation is having access to all span data before any filtering (like tail-based sampling) and potentially after span modification processors are applied. If I understand correctly, the signal translators are implemented as exporters, which I believe would be subject to all processors in the pipeline being executed first before reaching the exporter and hence potentially result in lower-fidelity metrics. |
@albertteoh I believe my proposal covers this, though the usability may be a bit poor. In the example shown in the doc, copied below,
|
A new type of component would simplify the above proposal. This is pretty similar to the "local receiver" idea proposed by @objectiser. I've expanded upon this in the doc, but at a high level, the idea is that
|
@djaglowski Haven't looked in the doc, so possibly explained there - but what is the purpose of the |
@objectiser It would probably make more sense if described as a Ultimately, this is all in the context of trying to work with the existing pipeline format, where a fan out happens before Exporters. A "Translator" would be an abstraction of an Exporter/Receiver combo that bridges two pipelines. |
@djaglowski Ok thanks for the explanation - I'm not familiar with the internals of the collector so may be wrong - but I thought |
@objectiser You might be right about being able to reuse components. However, for the purpose of chaining together pipelines, I believe it would become ambiguous which pipeline you intend to emit to. The So the idea here was to encapsulate all the details of linking the two together, and just have a single component instead. |
Thanks for the additional details @djaglowski. I like the translator/forwarder/bridge concept, especially given it tightly couples the exporter and receiver link between pipelines which avoids ambiguity on where exporters will route requests to, and hence a better user experience. It satisfies the requirement for chaining pipelines and "mixed" telemetry type pipelines, while IIUC, maintaining backwards compatibility with existing interfaces. I also agree that your proposal satisfies the requirement of aggregating traces to metrics before processing. |
@djaglowski I commented on the design doc. |
**Description:** Adds spanmetricsprocessor logic as a follow-up to #1917. **Link to tracking Issue:** #403 **Testing:** - Unit and integration tested locally (by adding to `cmd/otelcontribcol/components.go`). - Performance tested (added metrics key cache to improve performance) and added benchmark function. **Documentation:** Add relevant documentation to exported structs and functions.
Is this now done or more work is planned? |
@tigrannajaryan, I've been following the contributing guidelines so there's one last PR required which is to enable the spanmetrics processor component, if you're happy for me to go ahead with it. |
Signed-off-by: albertteoh <[email protected]> **Description:** Enables the spanmetricsprocessor for use in otel collector configs as a follow-up to #2012. **Link to tracking Issue:** Fixes #403 **Testing:** Integration tested locally against a prebuilt Grafana dashboard to confirm metrics look correct.
If the goal of this processor is to convert traces to a metrics, then It should focus on extraction and shouldn't do any aggregation but rather pass the metrics to a metrics processor. And as far as I can tell, in its current implementation it's ignoring the span events which are as important as span attributes. |
@halayli could you elaborate more on what you mean, particularly about extraction and span events? |
Agree with @halayli's comment above that the span metrics processor shouldn't aggregate but simply extract the metrics for the trace and forward. I feel like the span metrics processor is becoming a bit convoluted due to the added layer of complexity introduced by aggregation that is further complicated as we add in more features. In my opinion if aggregation is desired, that can be a part of a separate processor - I saw this open issue #4968 which is highly relevant in that case. That way we don't have all these processors that are all doing multiple things (e.g it sounds like the metric transform processor is also doing some aggregation). I think it'll help us moving forward. What are other people's thoughts around this (maintainers/contributors/other users /etc.), would you be receptive to this change in removing the aggregation? |
I agree that we better keep this processor to generating span-metrics only. I prefer to do aggregation via the metrics backend or other otelcol processor. |
Thanks @Tenaria for bringing up this topic again. +1 to separating the span to metrics translation from the metrics aggregation concern. Reading over the comments again, I think I've got a better appreciation of @halayli's the suggestion (although I'm still not sure what span events refer to). I noticed in the last few PRs that there was a fair bit of discussion around locking, concurrency and performance, and I suspect these concerns should be mitigated if we remove aggregation. When spanmetrics processor first came to being, the goal was just to get something working quickly to prove its utility. Clearly, it's evolved to the point where the need to decouple the aggregation and translation concerns is quite evident. Fortunately, in @tigrannajaryan's greater wisdom, the spanmetrics processor is still in "experimental" mode so I'm not too concerned about introducing breaking changes at this stage, and this will be a breaking change. |
I support the idea of separating the concerns of extracting metric values from spans (which can be this specialized processor) vs aggregating the extracted metrics (which can be a generic metric aggregation processor). However if we are doing breaking change I suggest that we first announce it before start making changes (in Slack and in SIG meeting), put together a 1-pager that describes how the new approach will work, how it will substitute the current approach and how the changes will be rolled out (very rough plan with rollout steps would be great to have). Since the new approach relies on having the metric aggregation processor we must ensure that metric aggregation processor exists before the aggregation features are removed from this processor (otherwise we will end up in temporary a state where we don't have feature parity with the current processor). If anyone is willing to drive this change then the maintainers/approvers will be happy to review. |
@tigrannajaryan it seems the |
I would expect that metrictransformprocessor should be able to do the aggregation that we need. There is an effort to refine the metrictransformprocessor, which I am not following closely, so I don't know where exactly they are right now: https://github.com/open-telemetry/opentelemetry-collector-contrib/issues?q=is%3Aissue+is%3Aopen++label%3A%22proc%3A+metricstransformprocessor%22+ |
What I'd like to see in the future / it would be cool if a span exporter would use an OpenTelemetry metrics API There's a 1:1 correspondence between Spans arriving and The OTel-Go SDK would do the aggregation for you, and note that it's performing aggregation over counter events, not arbitrary metrics data points. The logic required inside an SDK for aggregating events into OTLP is substantially simpler than a generic metrics aggregation processor, that's why I think this is a good idea. |
It will likely not have the best performance due to extra conversions and (de)serialization. For better performance we need to stay in |
Did someone say generics? But I agree, moving towards a scoped meter that can remain in pdata space sounds like the best solution for this. |
Hey @albertteoh, do you have the grafana config for this available somewhere? I'm using your plugin and would like to visualize those metrcis now that I got it working. Sweet work btw. :) |
@albertteoh this looks great, but the values get hard coded on export. can I please (!) get a json with the functions being applied on the output of the spanmetricsprocessor? my asumption is that this is what the dashboard does? or are those just dummy values that don't come from spanmetric processor? |
Span Statistics Proposal
A processor for aggregating spans to derive RED (Request, Error, Distribution) metrics across a set of dimensions (dynamic or otherwise) and exporting the results to a metrics sink (prometheus endpoint, stream etc).
Motivation
A trace contains end-to-end data about the service call chain in the context of a request/transaction. This data can be mined to derive context rich metrics that can help us identify and isolate the problem as quickly as possible. For example, span level aggregation can be used to detect any performance problems within the process boundary defined by the span.
Instead of forwarding the spans to a separate processor (stream processing application) for aggregation, a otel processor can process the spans in flight and emit RED metrics at configurable time intervals.
Proposed Configuration
The text was updated successfully, but these errors were encountered: