-
Notifications
You must be signed in to change notification settings - Fork 139
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature/health metrics micrometer [WIP] #485
Feature/health metrics micrometer [WIP] #485
Conversation
5fa1fe8
to
4a68a86
Compare
42f8844
to
5fc3558
Compare
Is it possible to add an integration tests that spins up Prometheus and Grafana in a test container, to test out the metric integration? |
From GPTChat How do you configure Prometheus describe micro meter? To configure Prometheus to scrape metrics from a micro meter, you will need to add the appropriate configuration to the Prometheus server. This typically involves adding a new scrape configuration that specifies the target for the scrape, the metrics to collect, and any additional settings or options. For example, to scrape metrics from a micro meter, you might add a configuration like the following to the Prometheus server: Copy code
Once this configuration is added to the Prometheus server, it will begin scraping metrics from the micro meter at the specified interval. The scraped metrics will be stored in the Prometheus database, and they can be queried and visualized using tools like Grafana. Overall, configuring Prometheus to scrape metrics from a micro meter involves adding the appropriate configuration to the Prometheus server, specifying the target, metrics path, and other settings for the scrape. This can provide valuable insights into the performance and health of your application, and it can help you to monitor and alert on potential issues and problems. |
How do you set up grafana to visualise Prometheus data? To set up Grafana to visualize Prometheus data, you will need to add a Prometheus data source to Grafana and configure it to connect to your Prometheus server. This will allow Grafana to query the Prometheus server and retrieve the metrics that have been collected by Prometheus. To add a Prometheus data source to Grafana, follow these steps: Log in to Grafana and go to the "Data Sources" page. Go to the "Dashboards" page in Grafana and click the "New dashboard" button. Overall, setting up Grafana to visualize Prometheus data involves adding a Prometheus data source to Grafana and configuring it to connect to your Prometheus server. This will allow you to use Grafana to create dashboards and visualizations that display the metrics collected by Prometheus, and it will provide valuable insights into the performance and health of your application. Antony Stubbs Yes, there are TestContainers for Grafana and Prometheus that you can use to spin up containers for these tools in your integration tests. TestContainers is a Java library that provides a simple and lightweight way to run containers in tests, and it allows you to easily spin up containers for tools like Grafana and Prometheus in your integration tests. TestContainers provides a pre-configured container for Grafana, as well as a container for Prometheus. These containers come with the appropriate settings and configuration already applied, so you can spin them up in your tests without having to configure them manually. To use TestContainers to spin up Grafana and Prometheus containers in your tests, you will need to add the TestContainers library as a dependency to your project, and you will need to import the appropriate containers from the TestContainers library. For example, you can use the following code to import the Grafana and Prometheus containers from TestContainers: Copy code Copy code // Start the containers // Use the containers in your tests... // Stop the containers when you are done Overall, TestContainers provides pre-configured containers for Grafana and Prometheus that you can use to spin up these tools in your integration tests. This can provide a convenient and lightweight way to test the integration between your application and these tools, and it can help you to ensure that the metrics exposed by your application are being collected and stored correctly by Prometheus. |
@@ -626,5 +628,30 @@ boolean checkIfWorkIsStale(final WorkContainer<?, ?> workContainer) { | |||
return false; | |||
} | |||
|
|||
} | |||
public PCMetrics.PCPartitionMetrics getMetrics() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we invert this? Move the construction into the metrics class, instead of the state class? (pass the metrics class the state)
|
||
eosStreamProcessor.subscribe(of(inputTopic)); // <4> | ||
pcMetricsTracker = new PCMetricsTracker(((AbstractParallelEoSStreamProcessor)eosStreamProcessor)::calculateMetricsWithIncompletes, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gosh we need to sort out the need for that cast... I guess the calculateMetrics method should be part of the interface?
@@ -34,11 +42,11 @@ | |||
*/ | |||
@Slf4j | |||
public class CoreApp { | |||
|
|||
final MeterRegistry metricsRegistry = new PrometheusMeterRegistry(PrometheusConfig.DEFAULT); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how do you view these now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had to add an endpoint to serve those metrics
void setupPrometheusEndpoint(){
try {
HttpServer server = HttpServer.create(new InetSocketAddress(7001), 0);
server.createContext("/prometheus", httpExchange -> {
String response = metricsRegistry.scrape();
httpExchange.sendResponseHeaders(200, response.getBytes().length);
try (OutputStream os = httpExchange.getResponseBody()) {
os.write(response.getBytes());
}
});
Executors.newSingleThreadExecutor().submit(server::start);
} catch (IOException e) {
throw new RuntimeException(e);
}
}
I'm working on a tentative Grafana dashboard to better understand what metrics we are missing or can be derived from other sources (e.g. executor services). I will push all my changes with the docker-compose setup and all the config and dashboard assets for repeatability and feedback
@@ -37,6 +37,8 @@ | |||
* | |||
* @author Antony Stubbs | |||
*/ | |||
// metrics: avg time spend encoding, most common best encoder, number of times run length used, number of times | |||
// bitset used |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Last encoding used size
And actually - these are all partition specific, so should be in partition state
*/ | ||
@Value | ||
@SuperBuilder(toBuilder = true) | ||
public class PCMetrics { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is all going to be a lot of metrics?
Need a debug level like streams does?
Metrics aren't built if not quieter? (Except for gauges and timers of course)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Debug-level metrics like we have in Kafka Streams could be a good approach. Also, MicroMeter allows filtering when setting up the registry so developers can selectively drop metrics.
registry.config()
.meterFilter(MeterFilter.ignoreTags("too.much.information"))
.meterFilter(MeterFilter.denyNameStartsWith("jvm"));
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, MicroMeter allows filtering when setting up the registry so developers can selectively drop metrics.
MeterFilter are powerful but annoying. IMHO It would be much more pleasant to be able to enable "basics" and "detailed" metrics using specifics binder or by configuring the binder. One can still use MeterFilters in specific circumstance but they shouldn't be required by the happy path.
I think per shard metrics are not possible with key level parallelism and a non trivial amount of keys.
- IIRC the memory footprint of a timer is in the 2..8kB range depending if percentile / SLO / histogram are enabled. It's likely bearable for most use case but not unremarkable.
- I think that most TSDB won't be able to handle such cardinality, at least in a cost effective way.
I wouldn't enable them by default but look for a way to be able to spot anomalies using low cardinality meters, logs or any other means.
Per shard metrics become impossible
It would be nice to not have to rely on metrics filtering to opt-out from shard level metrics, I find it annoying and inefficient compared to be just bind desired meters. Splitting PCMetricsTracker or making it configurable may be one way to do it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see discrepancies between this document and PcMetricsTracker. Which one should be considered as the reference?
That's my bad 😅 I tried to collect all the metrics ideas from different experimental branches, remove the ones that can be calculated or derived from their primitive form and, also, see if my understanding of each metric semantics was right.
So, PCMetricsTracker should be the canonical reference. The document was just an artefact to centralise and discuss potential metrics to consider.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you so much for input @cykl !
I think per shard metrics are not possible with key level parallelism and a non trivial amount of keys.
Do you mean all types of shard metrics? or only timers? or ..?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Likely both.
Timers are worst offenders because each timer generates multiple time-series in the TSDB (ex. count, totalTime, maxTime + buckets for histograms etc.) while a gauge or a counter creates only one.
But if you have many shards, even a single gauge with a shard label could be problematic.
It is often recommended to not use user supplied input for tags values and to be mindful of the cardinality.
Micrometer:
Beware of the potential for tag values coming from user-supplied sources to blow up the cardinality of a metric. You should always carefully normalize and add bounds to user-supplied input. Sometimes, the cause is sneaky. Consider the URI tag for recording HTTP requests on service endpoints. If we do not constrain 404’s to a value like NOT_FOUND, the dimensionality of the metric would grow with each resource that cannot be found.
Prometheus:
Remember that every unique combination of key-value label pairs represents a new time series, which can dramatically increase the amount of data stored. Do not use labels to store dimensions with high cardinality (many different label values), such as user IDs, email addresses, or other unbounded sets of values.
My understanding is that with key parallelism, shardId = key which can be a very large set of values. In this case you will likely disable shard level metrics because they are too expensive.
vaguely related question: is PC able to deal with a large number of keys (ex. millions) with key level parallelism? My understanding is that you have a per shard memory overhead and offset maps are going to contains many entries possibly exceeding the 4KB limit. Is there any benchmark / feedback available about that?
@@ -37,6 +38,7 @@ | |||
* | |||
* @author Antony Stubbs | |||
*/ | |||
// metrics: number of queues, average queue length | |||
@Slf4j | |||
public class ShardManager<K, V> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
slow work gauge of some sort - how many records are "slow", average wait time, something like that. average size of the slow work collection every time we try to take work.
60f68ef
to
6a3df9d
Compare
6a3df9d
to
15131ea
Compare
👋 @astubbs , I've just added the Prometheus/Grafana plumbing and IntegrationTests based on TestContainers that we agreed on and wired the MeterRegistry dependency using PCModule as per your example in your remote. On a different note, I also played with some potential Grafana visualizations and documented most of the metrics I have been looking at. |
I see discrepancies between this document and PcMetricsTracker. Which one should be considered as the reference? |
this.registeredMeterIds.stream() | ||
.filter(id -> !this.meterRegistry | ||
.find(id.getName()) | ||
.tags(METRIC_CATEGORY, subsystemTag) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about common tags? I'm not sure why the filter clause is required, but if we must ensure that the meter really exist in the registry then we likely don't want false positives from another binder.
*/ | ||
@Value | ||
@SuperBuilder(toBuilder = true) | ||
public class PCMetrics { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, MicroMeter allows filtering when setting up the registry so developers can selectively drop metrics.
MeterFilter are powerful but annoying. IMHO It would be much more pleasant to be able to enable "basics" and "detailed" metrics using specifics binder or by configuring the binder. One can still use MeterFilters in specific circumstance but they shouldn't be required by the happy path.
I think per shard metrics are not possible with key level parallelism and a non trivial amount of keys.
- IIRC the memory footprint of a timer is in the 2..8kB range depending if percentile / SLO / histogram are enabled. It's likely bearable for most use case but not unremarkable.
- I think that most TSDB won't be able to handle such cardinality, at least in a cost effective way.
I wouldn't enable them by default but look for a way to be able to spot anomalies using low cardinality meters, logs or any other means.
Per shard metrics become impossible
It would be nice to not have to rely on metrics filtering to opt-out from shard level metrics, I find it annoying and inefficient compared to be just bind desired meters. Splitting PCMetricsTracker or making it configurable may be one way to do it.
OMG these looks amazing! Can't wait to try it out! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
big ticket discussion points sent on slack. Great work! getting there, now for the polish...
...src/main/java/io/confluent/parallelconsumer/internal/AbstractParallelEoSStreamProcessor.java
Outdated
Show resolved
Hide resolved
...src/main/java/io/confluent/parallelconsumer/internal/AbstractParallelEoSStreamProcessor.java
Outdated
Show resolved
Hide resolved
...src/main/java/io/confluent/parallelconsumer/internal/AbstractParallelEoSStreamProcessor.java
Outdated
Show resolved
Hide resolved
...src/main/java/io/confluent/parallelconsumer/internal/AbstractParallelEoSStreamProcessor.java
Outdated
Show resolved
Hide resolved
// call the user's function | ||
List<R> resultsFromUserFunction; | ||
PollContextInternal<K, V> context = new PollContextInternal<>(workContainerBatch); | ||
public PCMetrics calculateMetrics() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does it make sense to have this manual call still? what should our "health" end point look like?
parallel-consumer-core/src/main/java/io/confluent/parallelconsumer/state/WorkManager.java
Outdated
Show resolved
Hide resolved
parallel-consumer-examples/parallel-consumer-example-core/pom.xml
Outdated
Show resolved
Hide resolved
parallel-consumer-examples/parallel-consumer-example-core/pom.xml
Outdated
Show resolved
Hide resolved
...consumer-example-core/src/main/java/io/confluent/parallelconsumer/examples/core/CoreApp.java
Outdated
Show resolved
Hide resolved
...src/test/java/io/confluent/parallelconsumer/examples/core/CoreAppMetricsIntegrationTest.java
Outdated
Show resolved
Hide resolved
…do/parallel-consumer into feature/health-metrics-micrometer
e3b56f8
to
2a451e3
Compare
👋 @rkolesnev @astubbs as per our discussion I have done the following changes:
If you are happy with the Structured Event approach I can continue by adding the remaining initial metrics documented here. |
Any updates? I need this feature.. FYI, i also made simple changes using micrometer here (I didn't know this PR exists). |
Hi! I'm back online now, so this will really start moving. It's my top priority. |
BTW - we'd love to get your input into this PR? |
|
Closed by #613 |
Fixes:
Seeking some feedback associated to the following areas:
zero dependencies policy: should we add MicroMeter SPI as a dependency to parallel-consumer-core or extract the PCMetricsTracker into a separate parallel-consumer-metrics project? another option would be to send it upstream with the rest of the existing binders for well-known libraries
Devex/usage: this approach leaves in the hands of the developer the register setup and binding. Another approach could be to pass the register as a
ParallelConsumerOption
and take care of the binder inside ofAbstractParallelEoSStreamProcessor
.Healthchecks: It seems not possible to propagate a string as a Gauge value, so I'm using a tag to decorate the numerical value associated with PCState (e.g. running, paused, closed, etc)
This current draft only binds a subset of the metrics. Happy to keep iterating on it if the approach is okay with you 😊
Checklist
Replaces / Makes redundant: