Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OkHttp instrumentation generates metrics with very high cardinality #9972

Closed
gaeljw opened this issue Nov 29, 2023 · 14 comments
Closed

OkHttp instrumentation generates metrics with very high cardinality #9972

gaeljw opened this issue Nov 29, 2023 · 14 comments
Labels
bug Something isn't working needs author feedback Waiting for additional feedback from the author needs triage New issue that requires triage

Comments

@gaeljw
Copy link

gaeljw commented Nov 29, 2023

Describe the bug

OkHttp instrumentation generates metrics http_client_duration with a http_url and http_response_content_length labels. By nature these are almost unique labels and thus each request generate a unique set of labels for the metric, or said differently a very high cardinality for these labels.

This causes issue for storage.

Example of such metrics (Prometheus format):

http_client_duration_milliseconds_count{
  otel_scope_name="io.opentelemetry.okhttp-3.0",
  http_method="GET",
  http_response_content_length="1081",
  http_status_code="200",
  http_url="http://trino.mycompany.net:8080/v1/statement/executing/20231128_072415_20801_d2gny/y412c0ae3189948116bfc9ee9c3498b380a28be93/9",
  net_peer_name="trino.mycompany.net",
  net_peer_port="8080",
  net_protocol_name="http",
  net_protocol_version="1.1",
  user_agent_original="Trino JDBC Driver/433"} 1.0 1701260869528

http_client_duration_milliseconds_count{otel_scope_name="io.opentelemetry.okhttp-3.0",
  http_method="GET",
  http_response_content_length="923",
  http_status_code="200",
  http_url="http://trino.mycompany.net:8080/v1/statement/executing/20231128_072917_21203_d2gny/y7080dd845e468e8853803535c1eed6efd08625cc/2",
  net_peer_name="trino.mycompany.net",
  net_peer_port="8080",
  net_protocol_name="http",
  net_protocol_version="1.1",
  user_agent_original="Trino JDBC Driver/433"} 1.0 1701260869528

http_client_duration_milliseconds_count{otel_scope_name="io.opentelemetry.okhttp-3.0",
  http_method="GET",
  http_response_content_length="909",
  http_status_code="200",
  http_url="http://trino.mycompany.net:8080/v1/statement/executing/20231128_072624_20987_d2gny/y4d8797c5b88e82d46e0382f3986817f1af5e8009/124",
  net_peer_name="trino.mycompany.net",
  net_peer_port="8080",
  net_protocol_name="http",
  net_protocol_version="1.1",
  user_agent_original="Trino JDBC Driver/433"} 1.0 1701260869528

As these are histograms, there is also a bunch of other metrics http_client_duration_milliseconds_bucket.

I don't think these labels make much sense in the 1st place. Is this expected?

Related issue in a library using OkHttp instrumentation explicitly: trinodb/trino#19958

Steps to reproduce

N/A

(I can look to provide a reproduction project if needed but don't think it's necessary).

Expected behavior

No such labels. Low cardinality of this metric.

Actual behavior

See above.

Javaagent or library instrumentation version

1.32.0 (also observed in 1.29.0)

Environment

JDK: 11
OS: Linux

Additional context

No response

@gaeljw gaeljw added bug Something isn't working needs triage New issue that requires triage labels Nov 29, 2023
@trask
Copy link
Member

trask commented Nov 30, 2023

Is this expected?

this is definitely not expected, can you try using the latest okhttp library instrumentation directly to confirm the issue?

@gaeljw
Copy link
Author

gaeljw commented Dec 1, 2023

can you try using the latest okhttp library instrumentation directly to confirm the issue?

For sure, I can provide a minimal reproduction case with latest okhttp instrumentation library.

I cannot test my real project against latest okhttp instrumentation as I noticed the Trino JDBC driver which is the one defining the okhttp instrumentation is shading OTEL dependency.

@mateuszrzeszutek
Copy link
Member

Hey @gaeljw , what version of the OpenTelemetry SDK are you using? I think there might be a version mismatch between the two, and as the result the metrics advice might not be applied.

@gaeljw
Copy link
Author

gaeljw commented Dec 1, 2023

@mateuszrzeszutek I'm using OTEL SDK 1.32.0 and from what I can see the dependency I'm using is shading OTEL (API) 1.31.0.

I can try to use OTEL SDK 1.31.0 to see if problem stil occurs. If that's what you're suggesting?

I'll do it tomorrow or the day after.

@gaeljw
Copy link
Author

gaeljw commented Dec 2, 2023

I was not able to reproduce the issue in a minimal case with only OTEL 1.32.0 and OkHttp. Neither in another case where I shaded OTEL API 1.31.0 in a module and used it in another module with OTEL SDK 1.32.0 (to be closer to the real case I'm experiencing).

I'll investigate further!

@gaeljw
Copy link
Author

gaeljw commented Dec 2, 2023

@mateuszrzeszutek you mentioned a "metrics advice", where is it defined / what should make it applied?

@laurit laurit added the needs author feedback Waiting for additional feedback from the author label Dec 5, 2023
@gaeljw
Copy link
Author

gaeljw commented Dec 5, 2023

I can try to use OTEL SDK 1.31.0 to see if problem stil occurs. If that's what you're suggesting?

Using SDK 1.31.0, same version as what Trino is shading. Still the same issue.


Still trying to come up with a minimal reproduction repository... :)

@github-actions github-actions bot removed the needs author feedback Waiting for additional feedback from the author label Dec 5, 2023
@gaeljw
Copy link
Author

gaeljw commented Dec 5, 2023

I was able to come up with a minimal reproduction case, it's not using directly OkHttp instrumentation but Trino library which itself uses OkHttp instrumentation.

https://github.com/gaeljw/otel9972

I hope it will help.

To reproduce:

  • clone the repo
  • run the org.example.Main class, it will start a Trino container and do some silly requests to it each second for 30 seconds
  • while it runs, open http://localhost:19000/

You'll see a bunch of metrics having the issue.

For the record, here are the top lines I get in my case:

# TYPE target info
# HELP target Target metadata
target_info{service_name="my-app",telemetry_sdk_language="java",telemetry_sdk_name="opentelemetry",telemetry_sdk_version="1.31.0"} 1
# TYPE otel_scope_info info
# HELP otel_scope_info Scope metadata
otel_scope_info{otel_scope_name="io.opentelemetry.okhttp-3.0"} 1
# TYPE http_client_duration_milliseconds histogram
# HELP http_client_duration_milliseconds The duration of the outbound HTTP request
http_client_duration_milliseconds_count{otel_scope_name="io.opentelemetry.okhttp-3.0",http_method="GET",http_response_content_length="455",http_status_code="200",http_url="http://localhost:32785/v1/statement/executing/20231205_194230_00023_dtni2/yafaffd20c244c96661c43b612e3d59f054e2474e/1",net_peer_name="localhost",net_peer_port="32785",net_protocol_name="http",net_protocol_version="1.1",user_agent_original="Trino JDBC Driver/434"} 1.0 1701805352774
http_client_duration_milliseconds_sum{otel_scope_name="io.opentelemetry.okhttp-3.0",http_method="GET",http_response_content_length="455",http_status_code="200",http_url="http://localhost:32785/v1/statement/executing/20231205_194230_00023_dtni2/yafaffd20c244c96661c43b612e3d59f054e2474e/1",net_peer_name="localhost",net_peer_port="32785",net_protocol_name="http",net_protocol_version="1.1",user_agent_original="Trino JDBC Driver/434"} 5.401179 1701805352774
http_client_duration_milliseconds_bucket{otel_scope_name="io.opentelemetry.okhttp-3.0",http_method="GET",http_response_content_length="455",http_status_code="200",http_url="http://localhost:32785/v1/statement/executing/20231205_194230_00023_dtni2/yafaffd20c244c96661c43b612e3d59f054e2474e/1",net_peer_name="localhost",net_peer_port="32785",net_protocol_name="http",net_protocol_version="1.1",user_agent_original="Trino JDBC Driver/434",le="0.0"} 0.0 1701805352774

You can see the http_response_content_length and http_url labels.

(I'll cross-post the reproduction scenario to trinodb/trino#19958)

If that helps, the Trino code that setups the instrumentation: https://github.com/trinodb/trino/blob/aa431c77a6a187920f5d6433532a19647d901742/client/trino-jdbc/src/main/java/io/trino/jdbc/NonRegisteringTrinoDriver.java#L69

@laurit
Copy link
Contributor

laurit commented Dec 6, 2023

This is a trinio issue. Trinio does not shade opentelemetry-api classes like io.opentelemetry.api.metrics.DoubleHistogramBuilder but it shades the opentelemetry-extension-incubator classes like io.opentelemetry.extension.incubator.metrics.ExtendedDoubleHistogramBuilder to io.trino.jdbc.$internal.opentelemetry.extension.incubator.metrics.ExtendedDoubleHistogramBuilder. Due to this the test in

fails because it uses the shaded name but the actual object implements the interface with unshaded name.

@laurit laurit added the needs author feedback Waiting for additional feedback from the author label Dec 6, 2023
@gaeljw
Copy link
Author

gaeljw commented Dec 6, 2023

Thanks for the investigation @laurit .

Sorry to bother you but two things are still unclear to me:

  • what makes the advice applied? Where is it called?
  • the advice is not applied which is why labels are not filtered out, but still the original metric labels are coming from somewhere, where do they come from? It must be OkHttp instrumentation, isn't it? Or is it Trino that's adding custom labels?

@laurit
Copy link
Contributor

laurit commented Dec 6, 2023

what makes the advice applied? Where is it called?

It is applied in

static void applyClientDurationAdvice(DoubleHistogramBuilder builder) {
if (!(builder instanceof ExtendedDoubleHistogramBuilder)) {
return;
}
((ExtendedDoubleHistogramBuilder) builder)
.setAttributesAdvice(
asList(
SemanticAttributes.HTTP_REQUEST_METHOD,
SemanticAttributes.HTTP_RESPONSE_STATUS_CODE,
HttpAttributes.ERROR_TYPE,
SemanticAttributes.NETWORK_PROTOCOL_NAME,
SemanticAttributes.NETWORK_PROTOCOL_VERSION,
SemanticAttributes.SERVER_ADDRESS,
SemanticAttributes.SERVER_PORT));
}

note that in the release version this code is slightly different

he advice is not applied which is why labels are not filtered out, but still the original metric labels are coming from somewhere, where do they come from?

Instrumentation sets all the attributes. Advice determines which of these attributes are actually added to the metric by default. Advice can be overridden by configuring a metrics view. This allows adding attributes that are not included in the advice to the metric (or dropping attributes included in the advice).

@laurit laurit added the needs author feedback Waiting for additional feedback from the author label Dec 6, 2023
@gaeljw
Copy link
Author

gaeljw commented Dec 6, 2023

Oh okay, I got it I think :)

There is a "standard definition" of Http Client metrics and Trino instrumentation is implementing it. This is from

And you are saying that it still makes sense that the http_response_content_length label is originally set as it might be used on some cases by overriding metrics view/advice?
I would disagree a bit on this but I don't necessarily have all possible usages in mind, having it disabled by default is enough to me 👌


Now, back to the Trino issue, I may try to provide a fix in their implementation but I guess it won't be that easy because they'd want to shade everything excepted the GlobalOpenTelemetry (and maybe a couple of other things?).

Not clear to me yet how a shaded implementation can live together with the official implementation 🤔

I wonder if it even makes sense to shade OTEL on the 1st place given their usage. Maybe they should just mark the dependency as provided and let users optionally bring OTEL of they do want it.

If you've any idea or opinion on this, I'd love to hear.

Anyway thanks a lot for your time.

@github-actions github-actions bot removed the needs author feedback Waiting for additional feedback from the author label Dec 6, 2023
@laurit
Copy link
Contributor

laurit commented Dec 8, 2023

And you are saying that it still makes sense that the http_response_content_length label is originally set as it might be used on some cases by overriding metrics view/advice?

I believe http_response_content_length does not make sense as a metric attribute, but for us it is easier to just set all the attributes and let the sdk filter them based on the advice/view. If some user finds that http_response_content_length makes sense as a metric attribute then they can enable it. Before the advice api was added we used to have a hard coded list with attributes that were added to the metrics, it did not satisfy the needs of all users.

Now, back to the Trino issue, I may try to provide a fix in their implementation but I guess it won't be that easy because they'd want to shade everything excepted the GlobalOpenTelemetry (and maybe a couple of other things?).

I'm sure they are reasonable people and understand that things can't be shaded when it breaks. I'd try removing https://github.com/trinodb/trino/blob/aa431c77a6a187920f5d6433532a19647d901742/client/trino-jdbc/pom.xml#L424-L427 and adding dependency to https://central.sonatype.com/artifact/io.opentelemetry/opentelemetry-extension-incubator the same way as they have added https://github.com/trinodb/trino/blob/aa431c77a6a187920f5d6433532a19647d901742/client/trino-jdbc/pom.xml#L86-L90 (note the provided scope)

@laurit laurit added the needs author feedback Waiting for additional feedback from the author label Dec 8, 2023
@gaeljw gaeljw closed this as completed Dec 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs author feedback Waiting for additional feedback from the author needs triage New issue that requires triage
Projects
None yet
Development

No branches or pull requests

4 participants