Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/health metrics micrometer [WIP] #485

Closed

Conversation

nachomdo
Copy link
Contributor

@nachomdo nachomdo commented Nov 29, 2022

Fixes:

Seeking some feedback associated to the following areas:

  • zero dependencies policy: should we add MicroMeter SPI as a dependency to parallel-consumer-core or extract the PCMetricsTracker into a separate parallel-consumer-metrics project? another option would be to send it upstream with the rest of the existing binders for well-known libraries

  • Devex/usage: this approach leaves in the hands of the developer the register setup and binding. Another approach could be to pass the register as a ParallelConsumerOption and take care of the binder inside of AbstractParallelEoSStreamProcessor.

  • Healthchecks: It seems not possible to propagate a string as a Gauge value, so I'm using a tag to decorate the numerical value associated with PCState (e.g. running, paused, closed, etc)

This current draft only binds a subset of the metrics. Happy to keep iterating on it if the approach is okay with you 😊

Checklist

  • Documentation (if applicable)
  • Changelog

Replaces / Makes redundant:

@nachomdo nachomdo force-pushed the feature/health-metrics-micrometer branch from 5fa1fe8 to 4a68a86 Compare November 29, 2022 13:00
@nachomdo nachomdo marked this pull request as ready for review November 29, 2022 13:03
@nachomdo nachomdo marked this pull request as draft November 29, 2022 13:07
@nachomdo nachomdo force-pushed the feature/health-metrics-micrometer branch 2 times, most recently from 42f8844 to 5fc3558 Compare November 29, 2022 15:10
@astubbs astubbs requested review from rkolesnev and astubbs December 2, 2022 13:30
@astubbs
Copy link
Contributor

astubbs commented Dec 3, 2022

Is it possible to add an integration tests that spins up Prometheus and Grafana in a test container, to test out the metric integration?

@astubbs
Copy link
Contributor

astubbs commented Dec 3, 2022

From GPTChat

How do you configure Prometheus describe micro meter?

To configure Prometheus to scrape metrics from a micro meter, you will need to add the appropriate configuration to the Prometheus server. This typically involves adding a new scrape configuration that specifies the target for the scrape, the metrics to collect, and any additional settings or options.

For example, to scrape metrics from a micro meter, you might add a configuration like the following to the Prometheus server:

Copy code
scrape_configs:

  • job_name: "micro-meter"
    scrape_interval: 15s
    metrics_path: /metrics
    static_configs:
    • targets: ["localhost:8080"]
      This configuration specifies a new scrape job named "micro-meter", with a scrape interval of 15 seconds. It specifies that Prometheus should scrape metrics from the "/metrics" endpoint on the target "localhost:8080".

Once this configuration is added to the Prometheus server, it will begin scraping metrics from the micro meter at the specified interval. The scraped metrics will be stored in the Prometheus database, and they can be queried and visualized using tools like Grafana.

Overall, configuring Prometheus to scrape metrics from a micro meter involves adding the appropriate configuration to the Prometheus server, specifying the target, metrics path, and other settings for the scrape. This can provide valuable insights into the performance and health of your application, and it can help you to monitor and alert on potential issues and problems.

@astubbs
Copy link
Contributor

astubbs commented Dec 3, 2022

How do you set up grafana to visualise Prometheus data?

To set up Grafana to visualize Prometheus data, you will need to add a Prometheus data source to Grafana and configure it to connect to your Prometheus server. This will allow Grafana to query the Prometheus server and retrieve the metrics that have been collected by Prometheus.

To add a Prometheus data source to Grafana, follow these steps:

Log in to Grafana and go to the "Data Sources" page.
Click the "Add data source" button.
Select "Prometheus" from the list of data source types.
Enter a name for the data source, such as "Prometheus Server".
In the "URL" field, enter the URL of your Prometheus server, including the port number (e.g., "http://localhost:9090").
Leave the "Access" field set to "Direct".
Click the "Save & Test" button to save the data source and test the connection to the Prometheus server.
Once the Prometheus data source has been added to Grafana, you can use it to create dashboards and visualizations that display the metrics collected by Prometheus. To do this, follow these steps:

Go to the "Dashboards" page in Grafana and click the "New dashboard" button.
Select "Graph" from the list of panel types.
In the "Metrics" tab, select the Prometheus data source that you added earlier.
Enter a Prometheus query in the "Query" field to retrieve the metrics that you want to visualize.
Use the other settings and options in the "Metrics" tab to customize the appearance and behavior of the graph.
Click the "Save" button to save the dashboard and the graph.
Once the dashboard and graph have been saved, Grafana will automatically refresh the graph with the latest data from Prometheus, and it will display the metrics that you queried. You can then use the dashboard and graph to monitor and analyze the metrics collected by Prometheus, and you can use the other features and capabilities of Grafana to customize and enhance the visualizations.

Overall, setting up Grafana to visualize Prometheus data involves adding a Prometheus data source to Grafana and configuring it to connect to your Prometheus server. This will allow you to use Grafana to create dashboards and visualizations that display the metrics collected by Prometheus, and it will provide valuable insights into the performance and health of your application.

Antony Stubbs
Are there TestContainers for grafana and Prometheus? And are there ones that come already setup?

Yes, there are TestContainers for Grafana and Prometheus that you can use to spin up containers for these tools in your integration tests. TestContainers is a Java library that provides a simple and lightweight way to run containers in tests, and it allows you to easily spin up containers for tools like Grafana and Prometheus in your integration tests.

TestContainers provides a pre-configured container for Grafana, as well as a container for Prometheus. These containers come with the appropriate settings and configuration already applied, so you can spin them up in your tests without having to configure them manually.

To use TestContainers to spin up Grafana and Prometheus containers in your tests, you will need to add the TestContainers library as a dependency to your project, and you will need to import the appropriate containers from the TestContainers library. For example, you can use the following code to import the Grafana and Prometheus containers from TestContainers:

Copy code
import org.testcontainers.containers.GraphiteContainer;
import org.testcontainers.containers.PrometheusContainer;
Once the Grafana and Prometheus containers are imported, you can use them to spin up the containers in your tests. For example, you might use the following code to spin up the Grafana and Prometheus containers in your tests:

Copy code
// Create the Grafana and Prometheus containers
GraphiteContainer grafana = new GraphiteContainer();
PrometheusContainer prometheus = new PrometheusContainer();

// Start the containers
grafana.start();
prometheus.start();

// Use the containers in your tests...

// Stop the containers when you are done
grafana.stop();
prometheus.stop();
This code will spin up the Grafana and Prometheus containers in your tests, and it will start the containers with the appropriate settings and configuration. You can then use the containers in your tests, and you can stop the containers when you are done.

Overall, TestContainers provides pre-configured containers for Grafana and Prometheus that you can use to spin up these tools in your integration tests. This can provide a convenient and lightweight way to test the integration between your application and these tools, and it can help you to ensure that the metrics exposed by your application are being collected and stored correctly by Prometheus.

@@ -626,5 +628,30 @@ boolean checkIfWorkIsStale(final WorkContainer<?, ?> workContainer) {
return false;
}

}
public PCMetrics.PCPartitionMetrics getMetrics() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we invert this? Move the construction into the metrics class, instead of the state class? (pass the metrics class the state)


eosStreamProcessor.subscribe(of(inputTopic)); // <4>
pcMetricsTracker = new PCMetricsTracker(((AbstractParallelEoSStreamProcessor)eosStreamProcessor)::calculateMetricsWithIncompletes,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gosh we need to sort out the need for that cast... I guess the calculateMetrics method should be part of the interface?

@@ -34,11 +42,11 @@
*/
@Slf4j
public class CoreApp {

final MeterRegistry metricsRegistry = new PrometheusMeterRegistry(PrometheusConfig.DEFAULT);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how do you view these now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to add an endpoint to serve those metrics

void setupPrometheusEndpoint(){
        try {
            HttpServer server = HttpServer.create(new InetSocketAddress(7001), 0);
            server.createContext("/prometheus", httpExchange -> {
                String response = metricsRegistry.scrape();
                httpExchange.sendResponseHeaders(200, response.getBytes().length);
                try (OutputStream os = httpExchange.getResponseBody()) {
                    os.write(response.getBytes());
                }
            });
            Executors.newSingleThreadExecutor().submit(server::start);

        } catch (IOException e) {
            throw new RuntimeException(e);
        }
    }

I'm working on a tentative Grafana dashboard to better understand what metrics we are missing or can be derived from other sources (e.g. executor services). I will push all my changes with the docker-compose setup and all the config and dashboard assets for repeatability and feedback

@@ -37,6 +37,8 @@
*
* @author Antony Stubbs
*/
// metrics: avg time spend encoding, most common best encoder, number of times run length used, number of times
// bitset used
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Last encoding used size

And actually - these are all partition specific, so should be in partition state

*/
@Value
@SuperBuilder(toBuilder = true)
public class PCMetrics {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is all going to be a lot of metrics?
Need a debug level like streams does?
Metrics aren't built if not quieter? (Except for gauges and timers of course)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Debug-level metrics like we have in Kafka Streams could be a good approach. Also, MicroMeter allows filtering when setting up the registry so developers can selectively drop metrics.

registry.config()
    .meterFilter(MeterFilter.ignoreTags("too.much.information"))
    .meterFilter(MeterFilter.denyNameStartsWith("jvm"));

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, MicroMeter allows filtering when setting up the registry so developers can selectively drop metrics.

MeterFilter are powerful but annoying. IMHO It would be much more pleasant to be able to enable "basics" and "detailed" metrics using specifics binder or by configuring the binder. One can still use MeterFilters in specific circumstance but they shouldn't be required by the happy path.

I think per shard metrics are not possible with key level parallelism and a non trivial amount of keys.

  1. IIRC the memory footprint of a timer is in the 2..8kB range depending if percentile / SLO / histogram are enabled. It's likely bearable for most use case but not unremarkable.
  2. I think that most TSDB won't be able to handle such cardinality, at least in a cost effective way.

I wouldn't enable them by default but look for a way to be able to spot anomalies using low cardinality meters, logs or any other means.

Per shard metrics become impossible

It would be nice to not have to rely on metrics filtering to opt-out from shard level metrics, I find it annoying and inefficient compared to be just bind desired meters. Splitting PCMetricsTracker or making it configurable may be one way to do it.

Copy link
Contributor Author

@nachomdo nachomdo Dec 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see discrepancies between this document and PcMetricsTracker. Which one should be considered as the reference?

That's my bad 😅 I tried to collect all the metrics ideas from different experimental branches, remove the ones that can be calculated or derived from their primitive form and, also, see if my understanding of each metric semantics was right.
So, PCMetricsTracker should be the canonical reference. The document was just an artefact to centralise and discuss potential metrics to consider.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much for input @cykl !

I think per shard metrics are not possible with key level parallelism and a non trivial amount of keys.

Do you mean all types of shard metrics? or only timers? or ..?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likely both.

Timers are worst offenders because each timer generates multiple time-series in the TSDB (ex. count, totalTime, maxTime + buckets for histograms etc.) while a gauge or a counter creates only one.

But if you have many shards, even a single gauge with a shard label could be problematic.

It is often recommended to not use user supplied input for tags values and to be mindful of the cardinality.

Micrometer:

Beware of the potential for tag values coming from user-supplied sources to blow up the cardinality of a metric. You should always carefully normalize and add bounds to user-supplied input. Sometimes, the cause is sneaky. Consider the URI tag for recording HTTP requests on service endpoints. If we do not constrain 404’s to a value like NOT_FOUND, the dimensionality of the metric would grow with each resource that cannot be found.

Prometheus:

Remember that every unique combination of key-value label pairs represents a new time series, which can dramatically increase the amount of data stored. Do not use labels to store dimensions with high cardinality (many different label values), such as user IDs, email addresses, or other unbounded sets of values.

My understanding is that with key parallelism, shardId = key which can be a very large set of values. In this case you will likely disable shard level metrics because they are too expensive.

vaguely related question: is PC able to deal with a large number of keys (ex. millions) with key level parallelism? My understanding is that you have a per shard memory overhead and offset maps are going to contains many entries possibly exceeding the 4KB limit. Is there any benchmark / feedback available about that?

@@ -37,6 +38,7 @@
*
* @author Antony Stubbs
*/
// metrics: number of queues, average queue length
@Slf4j
public class ShardManager<K, V> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

slow work gauge of some sort - how many records are "slow", average wait time, something like that. average size of the slow work collection every time we try to take work.

cc @Ehud-Lev-Forter

@nachomdo nachomdo force-pushed the feature/health-metrics-micrometer branch 8 times, most recently from 60f68ef to 6a3df9d Compare December 14, 2022 21:26
@nachomdo nachomdo force-pushed the feature/health-metrics-micrometer branch from 6a3df9d to 15131ea Compare December 14, 2022 22:52
@nachomdo
Copy link
Contributor Author

👋 @astubbs , I've just added the Prometheus/Grafana plumbing and IntegrationTests based on TestContainers that we agreed on and wired the MeterRegistry dependency using PCModule as per your example in your remote.

On a different note, I also played with some potential Grafana visualizations and documented most of the metrics I have been looking at.

image
image
image

@cykl
Copy link

cykl commented Dec 20, 2022

and most of the metrics I have been looking at.

I see discrepancies between this document and PcMetricsTracker. Which one should be considered as the reference?

this.registeredMeterIds.stream()
.filter(id -> !this.meterRegistry
.find(id.getName())
.tags(METRIC_CATEGORY, subsystemTag)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about common tags? I'm not sure why the filter clause is required, but if we must ensure that the meter really exist in the registry then we likely don't want false positives from another binder.

*/
@Value
@SuperBuilder(toBuilder = true)
public class PCMetrics {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, MicroMeter allows filtering when setting up the registry so developers can selectively drop metrics.

MeterFilter are powerful but annoying. IMHO It would be much more pleasant to be able to enable "basics" and "detailed" metrics using specifics binder or by configuring the binder. One can still use MeterFilters in specific circumstance but they shouldn't be required by the happy path.

I think per shard metrics are not possible with key level parallelism and a non trivial amount of keys.

  1. IIRC the memory footprint of a timer is in the 2..8kB range depending if percentile / SLO / histogram are enabled. It's likely bearable for most use case but not unremarkable.
  2. I think that most TSDB won't be able to handle such cardinality, at least in a cost effective way.

I wouldn't enable them by default but look for a way to be able to spot anomalies using low cardinality meters, logs or any other means.

Per shard metrics become impossible

It would be nice to not have to rely on metrics filtering to opt-out from shard level metrics, I find it annoying and inefficient compared to be just bind desired meters. Splitting PCMetricsTracker or making it configurable may be one way to do it.

@astubbs
Copy link
Contributor

astubbs commented Dec 20, 2022

👋 @astubbs , I've just added the Prometheus/Grafana plumbing and IntegrationTests based on TestContainers that we agreed on and wired the MeterRegistry dependency using PCModule as per your example in your remote.

OMG these looks amazing! Can't wait to try it out!

Copy link
Contributor

@astubbs astubbs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

big ticket discussion points sent on slack. Great work! getting there, now for the polish...

// call the user's function
List<R> resultsFromUserFunction;
PollContextInternal<K, V> context = new PollContextInternal<>(workContainerBatch);
public PCMetrics calculateMetrics() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it make sense to have this manual call still? what should our "health" end point look like?

@nachomdo nachomdo force-pushed the feature/health-metrics-micrometer branch from e3b56f8 to 2a451e3 Compare January 6, 2023 12:34
@nachomdo
Copy link
Contributor Author

nachomdo commented Jan 6, 2023

👋 @rkolesnev @astubbs as per our discussion I have done the following changes:

  • Instrument classes surfacing structured events via an EventBus instead of using meters directly. If you prefer we can use the actor-based subsystem from PC. Just wanted to validate that it is aligned with @rkolesnev suggestion 😊 An example of raising the event is below

    module.eventBus().post(MetricsEvent.builder()
    .name(PCMetricsTracker.METRIC_NAME_PROCESSED_RECORDS)
    .value(1.0)
    .tags(Tags.of(computeTagsForWorkContainer(wc))).build());

    and the processing is done in PCMetricsTracker
    @Subscribe
    public void eventHandler(final MetricsEvent event){
    switch (event.getType()){
    case COUNTER -> Optional.ofNullable(
    meterRegistry.find(event.getName()).tags(event.getTags()).counter()).orElseGet(()->
    bindAndGetCounter(event.getName(), event.getDescription(), event.getTags())).increment();
    case TIMER -> Optional.ofNullable(
    meterRegistry.find(event.getName()).tags(event.getTags()).timer()).orElseGet(()->
    bindAndGetTimer(event.getName(), event.getDescription(), event.getTags()))
    .record(event.getTimerValue());
    case GAUGE -> throw new UnsupportedOperationException("Not Implemented yet");
    }
    }

  • Remove any docker-compose harness and Grafana dashboards to reduce PR noise

  • Move metrics example using Prometheus into its own module

  • Merge master to apply these changes over the most up-to-date version

If you are happy with the Structured Event approach I can continue by adding the remaining initial metrics documented here.

@nachomdo nachomdo requested review from astubbs and removed request for rkolesnev January 6, 2023 16:25
@nachomdo nachomdo marked this pull request as ready for review January 6, 2023 16:28
@acktsap
Copy link
Contributor

acktsap commented Feb 23, 2023

Any updates? I need this feature..

FYI, i also made simple changes using micrometer here (I didn't know this PR exists).

@astubbs
Copy link
Contributor

astubbs commented Apr 5, 2023

Any updates? I need this feature..

FYI, i also made simple changes using micrometer here (I didn't know this PR exists).

Hi! I'm back online now, so this will really start moving. It's my top priority.

@astubbs
Copy link
Contributor

astubbs commented Apr 5, 2023

Any updates? I need this feature..

FYI, i also made simple changes using micrometer here (I didn't know this PR exists).

BTW - we'd love to get your input into this PR?

@cla-assistant
Copy link

cla-assistant bot commented Aug 8, 2023

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
0 out of 3 committers have signed the CLA.

❌ nachomdo
❌ astubbs
❌ rkolesnev
You have signed the CLA already but the status is still pending? Let us recheck it.

@eddyv
Copy link
Member

eddyv commented Sep 11, 2023

Closed by #613

@eddyv eddyv closed this Sep 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants