Can't avoid duplicate metrics in short lived cloud functions e.g. GCP Cloud Function #35522

AkselAllas · 2024-10-01T12:58:13Z

Component(s)

exporter/googlecloud

Describe the issue you're reporting

I am calling forceFlush multiple times fast (e.g 2x in 0.5 sec) because GCP Cloud Functions run once and detach CPU, so oftentimes a periodic metric exporter will either fail to export, because CPU/network has detached before metric export or it will cause error spam due to trying export after CPU/network has detached.

Is it possible to call metric forceFlush in e.g. nodejs metricReader.forceFlush() multiple times and somehow not end up with duplicate metrics errors in otel collector?

E.g. can I somehow use otel collector processor to remove duplicates before export? My main problem is that duplicate errors are creating noise in otelcol_exporter_send_failed_metric_points_total metric, which I am using to detect lost metrics.

The text was updated successfully, but these errors were encountered:

github-actions · 2024-10-01T13:00:22Z

Pinging code owners:

exporter/googlecloud: @aabmass @dashpole @jsuereth @punya @damemi @psx95

See Adding Labels via Comments if you do not have permissions to add labels yourself.

dashpole · 2024-10-01T14:04:36Z

@psx95, I know you looked into this recently. Can you respond?

Also @AkselAllas can you share more about your setup? Is your application sending to a collector? Or directly to google cloud?

dashpole · 2024-10-01T14:08:00Z

@AkselAllas can you share your collector config? Since cloud monitoring can only accept points every 5 seconds, you will need to aggregate over time to avoid errors. Something like https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/processor/intervalprocessor should be what you need, but it is listed as being under development right now

AkselAllas · 2024-10-01T14:51:43Z

@dashpole How would the linked processor work with 10 sec batch? I have e.g.

  batch:
    # batch metrics before sending to reduce API usage
    send_batch_max_size: 200
    send_batch_size: 200
    timeout: 10s
  memory_limiter:
    check_interval: 1s
    limit_percentage: 80
    spike_limit_percentage: 20
  transform:
    metric_statements:
      - context: datapoint
        statements:
          - set(attributes["exported_location"], attributes["location"])
          - delete_key(attributes, "location")
          - set(attributes["exported_cluster"], attributes["cluster"])
          - delete_key(attributes, "cluster")
          - set(attributes["exported_namespace"], attributes["namespace"])
          - delete_key(attributes, "namespace")
          - set(attributes["exported_job"], attributes["job"])
          - delete_key(attributes, "job")
          - set(attributes["exported_instance"], attributes["instance"])
          - delete_key(attributes, "instance")
          - set(attributes["exported_project_id"], attributes["project_id"])
          - delete_key(attributes, "project_id")
    metrics:
      receivers: [otlp]
      processors: [ memory_limiter, transform, batch ]
      exporters: [ googlemanagedprometheus ]

dashpole · 2024-10-01T16:17:43Z

I haven't tried it, but since it is aggregating metrics over time, I would expect it to replace the batch processor in your setup.

psx95 · 2024-10-01T20:52:10Z

@AkselAllas, force flushing metrics in quick succession at such a high rate would not be helpful as the granularity of metrics for Cloud Monitoring can only be at the very minimum a 10 second interval. You are likely to run into errors if you're exporting more frequently than every 10 seconds.

The core of your problem as you have correctly identified is the detachment of CPU once your function completes that prevents background export.
I'm not exactly sure how you're currently deploying a Cloud Function + OTel Collector, but you can deploy a multi-container cloud function using the gcloud run deploy command.

gcloud beta run deploy cloud-func-helloworld2 \
 --no-cpu-throttling \ # this option triggers always-allocated cpu
 --container app-function \
 --function org.example.HelloWorld \
 --build-env-vars-file=config/build-env-vars.yaml \
 --source=build/libs \
 --port=8080 \
 --container otel-collector \
 --image=us-central1-docker.pkg.dev/your-gcp-project/your-artifact-registry/otel-collector:latest

Alternatively, once you have deployed a Cloud Function, from GCP console, you could manually:

Open Your Function in Cloud Run:
Edit and Re-deploy a new revision with always-allocated CPU.

I have personally tried this with the following function code:

package org.example;

import com.google.cloud.functions.HttpFunction;
import com.google.cloud.functions.HttpRequest;
import com.google.cloud.functions.HttpResponse;
import io.opentelemetry.api.metrics.LongCounter;
import java.util.Random;

public class HelloWorld implements HttpFunction {
  private static final OpenTelemetryConfig openTelemetryConfig = OpenTelemetryConfig.getInstance();
  private static final LongCounter counter =
      openTelemetryConfig
          .getMeterProvider()
          .get("sample-function-library")
          .counterBuilder("function_counter_psx")
          .setDescription("random counter")
          .build();
  private static final Random random = new Random();

  public HelloWorld() {
    super();
    Runtime.getRuntime()
        .addShutdownHook(
            new Thread(
                () -> {
                  System.out.println("Closing OTel SDK");
                  openTelemetryConfig.closeSdk();
                  System.out.println("Sdk closed");
                }));
  }

  @Override
  public void service(HttpRequest request, HttpResponse response) throws Exception {
    System.out.println("received request: " + request.toString());
    counter.add(random.nextInt(100));
    response.getWriter().write("Hello, World\n");
    System.out.println("Function exited");
  }
}

I was able to export metrics like this to a collector running in a sidecar from a cloud function. The PeriodicMetricReaders were configured with export interval of 10 seconds.

With always-on CPU, I was also able to verify that shutdown hook was called, and the close() method flushes any pending metrics that you may have remaining (at least in Java implementation).

I have tried this only with Java, but I imagine it would work in NodeJS as well.
Let me know if this helps.

AkselAllas · 2024-10-02T05:07:34Z

@psx95 Thank you for the in-depth response!

Our problem currently is that we have tens of v1 Cloud Functions and they don't support shutdown hooks.
But I think the correct solution might indeed be to migrate to v2.

github-actions · 2024-12-02T03:39:27Z

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

exporter/googlecloud: @aabmass @dashpole @jsuereth @punya @psx95

See Adding Labels via Comments if you do not have permissions to add labels yourself.

dashpole · 2024-12-02T15:16:21Z

Feel free to reopen if you have further questions

AkselAllas added the needs triage New item requiring triage label Oct 1, 2024

github-actions bot added the exporter/googlecloud label Oct 1, 2024

dashpole removed the needs triage New item requiring triage label Oct 1, 2024

dashpole assigned psx95 Oct 1, 2024

AkselAllas mentioned this issue Oct 4, 2024

feat(add-option-to-disable-instrumentation-http-metrics open-telemetry/opentelemetry-js#5029

Closed

github-actions bot mentioned this issue Oct 8, 2024

Weekly Report: 2024-10-01 - 2024-10-08 #35659

Closed

github-actions bot added the Stale label Dec 2, 2024

dashpole closed this as completed Dec 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can't avoid duplicate metrics in short lived cloud functions e.g. GCP Cloud Function #35522

Can't avoid duplicate metrics in short lived cloud functions e.g. GCP Cloud Function #35522

AkselAllas commented Oct 1, 2024

github-actions bot commented Oct 1, 2024

dashpole commented Oct 1, 2024

dashpole commented Oct 1, 2024

AkselAllas commented Oct 1, 2024 •

edited

Loading

dashpole commented Oct 1, 2024

psx95 commented Oct 1, 2024 •

edited

Loading

AkselAllas commented Oct 2, 2024

github-actions bot commented Dec 2, 2024

dashpole commented Dec 2, 2024

Can't avoid duplicate metrics in short lived cloud functions e.g. GCP Cloud Function #35522

Can't avoid duplicate metrics in short lived cloud functions e.g. GCP Cloud Function #35522

Comments

AkselAllas commented Oct 1, 2024

Component(s)

Describe the issue you're reporting

github-actions bot commented Oct 1, 2024

dashpole commented Oct 1, 2024

dashpole commented Oct 1, 2024

AkselAllas commented Oct 1, 2024 • edited Loading

dashpole commented Oct 1, 2024

psx95 commented Oct 1, 2024 • edited Loading

AkselAllas commented Oct 2, 2024

github-actions bot commented Dec 2, 2024

dashpole commented Dec 2, 2024

AkselAllas commented Oct 1, 2024 •

edited

Loading

psx95 commented Oct 1, 2024 •

edited

Loading