Exemplars support for Prometheus Histogram #2812

jonatan-ivanov · 2021-10-12T04:10:22Z

This issue belongs to #2672 (Support for Exemplars), see the details there.

neiser · 2022-01-27T11:54:55Z

@jonatan-ivanov I saw that you already pushed #2813 some time ago. Are you planning to propose a PR for his as well? Is there anything blocking you? Would you mind if I gave it a shot (similar to your implementation for Counter)?

jonatan-ivanov · 2022-01-28T17:16:44Z

@neiser I am (was? 😃), but we are always happy to see contributions so if you want to work on it, please feel free to do so.
I took a look at the Histogram support at the time I wrote support for Counters but it is a little more complicated case and I needed to focus on other things. We are planning to add this in the next minor release of Micrometer (1.9.0) so your timing is perfect.

neiser · 2022-01-30T08:17:45Z

@jonatan-ivanov Alright, then let me see if I can provide a PR in the coming days.

neiser · 2022-01-31T06:08:17Z

@jonatan-ivanov I've looked into the code and figured that I'd need to start at PrometheusDistributionSummary#recordNonNegative to trigger recording an exemplar for count/amount/max and all histogram buckets. I didn't quite get why PrometheusDistributionSummary manages its own histogram, while the base class AbstractDistributionSummary already has such a field, though configuration of it seems to be complex, in particular with the weirdly integrated case of VictoriaMetrics which also surprisingly appear in core module as io.micrometer.core.instrument.distribution.FixedBoundaryVictoriaMetricsHistogram (my impression was that micrometer tries to abstract away any particular backend system for metrics collection... which is kinda broken by this class IMHO).

Anyway, my question is now is: Should I attempt a refactoring which would make getting exemplars into histogram buckets easier? Or should I just hack a lot of stuff around the "prometheus" part of Micrometer to achieve whatever resolving this issue takes?

fscellos · 2022-02-02T16:29:45Z

Hello i did some test to integrate exemplars in histogram.
For test i implemented it by define a custom TimeWindowFixedBoundaryHistogram (some private attributes (bucket) does not allow a direct inheritance from AbstractTimeWindowHistogram) and try to inspire from code of @jonatan-ivanov in its counter.

ReadMe is in french but you can take a look at : github / micrometer-prometheus-examplar . If you want to check it, i'll be happy to try to find an implementation with you.

It seems to work from a spring boot application (by defining PrometheusMeterRegistry Bean to override default actuator behaviour) but i always have a little problem....
Some span values are not correct (too low for my test application) and i suspect that my "exemplar-sampling" time is too low versus prometheus scraping..
I have to refine it to be sure that exemplars associated with a scraped metrics contains the full time of the span, and not a partial one...

jonatan-ivanov · 2022-02-02T21:01:21Z

@neiser If my memory serves me well, I wanted to extend TimeWindowFixedBoundaryHistogram as PrometheusHistogram and add exemplars support to it.

Micrometer "abstracts away backends" in the sense that a MeterRegistry implementation will define a couple of things:

The dimensionality (dimensional vs. hierarchical)
Push or Pull based
Client or Server side aggregation (e.g.: sending discrete samples, rates, cumulative values, etc.)
The "line"/"network" protocol
etc?

Some MeterRegistry implementations support multiple protocols, e.g.: StatsdMeterRegistry supports four flavors of line protocols (see StatsdFlavor) and three flavors of "network" protocols (TCP, UPD, UDS).

In case of Prometheus, there are multiple backends that support the Prometheus format and Prometheus itself supports multiple formats ("Prometheus Text" and OpenMetrics). VictoriaMetrics also supports the Prometheus format but they define their own histogram flavor, also this histogram flavor is reused in OpenTSDB. :)

So if you use an XyzMeterRegistry, that does not mean you need use the Xyz backend. It rather means that you need to use a backend that supports the XyZ format (e.g.: with the StatsdMeterRegistry you can use many backends).

Or in the other way: if you use Xyz as a backend you don't need to use XyzMeterRegistry but you can use a different one if that backend supports multiple standards (e.g.: if you use DataDog, you can use the DatadogMeterRegistry(HTTP API) or the StatsdMeterRegistry with the DataDog flavor since the backend supports multiple standards).

neiser · 2022-02-03T18:34:12Z

@fscellos @jonatan-ivanov Thank you for your answers. This helped me a bit. But I can't promise that I can make progress before next week. So if you find time over the weekend to work on this, please let me know so that we don't do the work twice!

fscellos · 2022-02-10T11:00:16Z

Hello. As target is to apply trace sampling, i will take a look to see if it is possible to include TraceFlags (cf https://www.w3.org/TR/trace-context/#sampled-flag) information to choose if a specific trace is eligible to be set as exemplars in bucket (or counter).

Because if you have client sampling (what can be done with Instrumentation sampling on client side), suppose 1%, we can't use 99 % of this traces as examplars as we're not be able to find them in storage backend (ie. in a prometheus instance, if you click on an exemplar, you won't find it in backend storage).

So we must only keep traces that have a sampled flag set to "01" (other with TraceFlags set to "00" won't be store)

fscellos · 2022-02-10T11:01:15Z

Note that previous remarks il also available for Counter.

fscellos · 2022-03-25T09:05:42Z

Hello. That's cool. But i always don't understand how you can manage sampling on trace as my PR on prometheus java client haven't be merge yet (prometheus/client_java#766)

jonatan-ivanov · 2022-03-25T17:33:37Z

@fscellos Haven't we discussed this here: #3019 (comment)?

If the SpanContextSupplier returns null when the span is not sampled, your changes on the Prometheus client are not really needed for Micrometer to make this work. Also, they are completely transparent from Micrometer's perspective: we don't need any code change to integrate with them (we only need to change tests).

I tested this using Spring Cloud Sleuth (SleuthSpanContextSupplier), please see the description of the PR to try it out: #3091 (comment).
Please let us know if it works for you.

neiser · 2022-03-31T19:47:02Z

@fscellos @jonatan-ivanov I've patched a demo spring boot app such that the actuator endpoint includes exemplars. As long as spring-projects/spring-boot#30472 isn't merged, it requires to explicitly pass the exemplar sampler into the Prometheus registry, assuming it's instrumented with the opentelemetry agent (see Dockerfile for details).

Note that when testing all this locally with curl, you need to use an Accept header, for example curl -s http://localhost:8080/actuator/prometheus -H 'Accept: application/openmetrics-text;version=1.0.0'.

Note that the demo uses micrometer with 1.9.0-SNAPSHOT, I hope the release will happen soon 🙂

Thanks y'all for finishing this feature. Exemplars are really something I've always wished I had when debugging some nasty cloud bugs.

jonatan-ivanov · 2022-03-31T22:15:15Z

@neiser Please take a look at the description of the PR that belongs to this issue: #3091
It will tell you that this is only available in the OpenMetrics format and it will also show the header that you need to send. Your browser/curl does not send that header, Prometheus does.

In the samples (see the description of the PR) you will find a demo spring boot app (with the sampler bean), here's the boot PR to auto-configure it: spring-projects/spring-boot#30472

1.9.0-RC1 should be released in mid April, 1.9.0 should be in mid May, Boot 2.7.0 should be a week after that.

neiser · 2022-04-01T03:03:19Z

@jonatan-ivanov Thanks, I was aware of that Accept header, I just wanted to let @fscellos know about it (I've also updated my previous comment with a specific curl request). It has led to confusion in the past though, see here for example.

I agree that spring-projects/spring-boot#30472 is the correct solution, however, as shown in my little Spring Boot Demo (see previous comment), you don't need to wait to have exemplars reported already now, if you're not using Spring Sleuth but OpenTelemetry agent instrumentation. Again, thanks for pushing this feature!

fscellos · 2022-04-01T14:57:57Z

Hello @neiser.
Be careful because your code seems to be functionnal but it is not for all (i just it to be sure in my own stack).

My stack is as follows : Some Microservices (java, node js, go, php) and for some of them instrumentation through Open Telemetry Operator (i'm in k8s; this operator just inject otel agent in target pods).

In my configuration i specify that trace sampling is set to 25% with a strategy of "parentbased_traceidratio".
It meens that is one microservices doesn't have any trace information in input request, it decide what to do :

If not trace exist in input, it decide if a trace will be generated (1 chance / 4 with 25 % sampling). If yes, it records its span to backend trace storage (or opentelemetry collector) and call subsequent services with a trace flags of "01", means that for this trace, span must be recorded. If no, it does not record its span and call subsequent services with a trace flags of "00" means that for all other services involve, no span have to be recorded.

For OpenTelemetry (for sleuth i don't know how it works), sampling cannot be ok if this "trace flags" information is not taking account. It's why is submit this PR (prometheus/client_java#766) to prometheus java client (the important point in this PR is the add of "spanContextSupplier.isSampled()" test in DefaultExemplarSampler that take account of this trace flags).

So until, this PR for prometheus client is not merge and integrate, yes you will generate trace_id in exported metrics, but when you'll want to explore it from dashboard in grafana by example (with a link to tempo or jaeger), sometimes you'll retrieve a trace (when sampling decision put trace flags to 01) and sometimes you will get "trace not found" (when sampling decision put trace flags to 00).

If you set sampling to 100%, you won't see this problem, but in production environnement, it is impossible to have such sampling strategy because you'll have to many traces post to you trace storage backend (ratio depends on how many request you will have but a 1% sampling seems to be a realistic value).

If my explanation are not clear, i encourage you to apply a low sampling strategy and check your trace storage system (or grafana dashboard) to check by your own.

See micrometer-metrics/micrometer#2812 (comment)

neiser · 2022-04-02T05:07:53Z

@fscellos Thanks for your detailed answer. I've applied a little patch here which should respect the fact that not all traces are samples and thus not suitable for exemplars.

fscellos · 2022-04-02T07:03:38Z

@neiser : yes it should work like this.

jonatan-ivanov added enhancement A general enhancement registry: prometheus A Prometheus Registry related issue labels Oct 12, 2021

jonatan-ivanov mentioned this issue Oct 12, 2021

Support for Prometheus Exemplars #2672

Closed

2 tasks

This was referenced Feb 10, 2022

Exemplars support for Prometheus Counter #2813

Merged

Integrate exemplars in counter and histogram. Take account of client sampling in counter #3019

Closed

jonatan-ivanov mentioned this issue Mar 22, 2022

Prometheus histogram exemplars #3091

Merged

jonatan-ivanov linked a pull request Mar 22, 2022 that will close this issue

Prometheus histogram exemplars #3091

Merged

jonatan-ivanov added this to the 1.9.0-M5 milestone Mar 23, 2022

shakuzen closed this as completed in eaee830 Mar 25, 2022

neiser added a commit to qaware/cloud-observability-grafana-spring-boot that referenced this issue Apr 2, 2022

Only use sampled trace ids as exemplars

09417f6

See micrometer-metrics/micrometer#2812 (comment)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exemplars support for Prometheus Histogram #2812

Exemplars support for Prometheus Histogram #2812

jonatan-ivanov commented Oct 12, 2021

neiser commented Jan 27, 2022 •

edited

Loading

jonatan-ivanov commented Jan 28, 2022

neiser commented Jan 30, 2022

neiser commented Jan 31, 2022 •

edited

Loading

fscellos commented Feb 2, 2022

jonatan-ivanov commented Feb 2, 2022 •

edited

Loading

neiser commented Feb 3, 2022

fscellos commented Feb 10, 2022

fscellos commented Feb 10, 2022

fscellos commented Mar 25, 2022

jonatan-ivanov commented Mar 25, 2022

neiser commented Mar 31, 2022 •

edited

Loading

jonatan-ivanov commented Mar 31, 2022

neiser commented Apr 1, 2022

fscellos commented Apr 1, 2022

neiser commented Apr 2, 2022

fscellos commented Apr 2, 2022

Exemplars support for Prometheus Histogram #2812

Exemplars support for Prometheus Histogram #2812

Comments

jonatan-ivanov commented Oct 12, 2021

neiser commented Jan 27, 2022 • edited Loading

jonatan-ivanov commented Jan 28, 2022

neiser commented Jan 30, 2022

neiser commented Jan 31, 2022 • edited Loading

fscellos commented Feb 2, 2022

jonatan-ivanov commented Feb 2, 2022 • edited Loading

neiser commented Feb 3, 2022

fscellos commented Feb 10, 2022

fscellos commented Feb 10, 2022

fscellos commented Mar 25, 2022

jonatan-ivanov commented Mar 25, 2022

neiser commented Mar 31, 2022 • edited Loading

jonatan-ivanov commented Mar 31, 2022

neiser commented Apr 1, 2022

fscellos commented Apr 1, 2022

neiser commented Apr 2, 2022

fscellos commented Apr 2, 2022

neiser commented Jan 27, 2022 •

edited

Loading

neiser commented Jan 31, 2022 •

edited

Loading

jonatan-ivanov commented Feb 2, 2022 •

edited

Loading

neiser commented Mar 31, 2022 •

edited

Loading