Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't avoid duplicate metrics in short lived cloud functions e.g. GCP Cloud Function #35522

Closed
AkselAllas opened this issue Oct 1, 2024 · 9 comments

Comments

@AkselAllas
Copy link

Component(s)

exporter/googlecloud

Describe the issue you're reporting

I am calling forceFlush multiple times fast (e.g 2x in 0.5 sec) because GCP Cloud Functions run once and detach CPU, so oftentimes a periodic metric exporter will either fail to export, because CPU/network has detached before metric export or it will cause error spam due to trying export after CPU/network has detached.

Is it possible to call metric forceFlush in e.g. nodejs metricReader.forceFlush() multiple times and somehow not end up with duplicate metrics errors in otel collector?

E.g. can I somehow use otel collector processor to remove duplicates before export? My main problem is that duplicate errors are creating noise in otelcol_exporter_send_failed_metric_points_total metric, which I am using to detect lost metrics.

@AkselAllas AkselAllas added the needs triage New item requiring triage label Oct 1, 2024
Copy link
Contributor

github-actions bot commented Oct 1, 2024

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@dashpole dashpole removed the needs triage New item requiring triage label Oct 1, 2024
@dashpole
Copy link
Contributor

dashpole commented Oct 1, 2024

@psx95, I know you looked into this recently. Can you respond?

Also @AkselAllas can you share more about your setup? Is your application sending to a collector? Or directly to google cloud?

@dashpole
Copy link
Contributor

dashpole commented Oct 1, 2024

@AkselAllas can you share your collector config? Since cloud monitoring can only accept points every 5 seconds, you will need to aggregate over time to avoid errors. Something like https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/processor/intervalprocessor should be what you need, but it is listed as being under development right now

@AkselAllas
Copy link
Author

AkselAllas commented Oct 1, 2024

@dashpole How would the linked processor work with 10 sec batch? I have e.g.

  batch:
    # batch metrics before sending to reduce API usage
    send_batch_max_size: 200
    send_batch_size: 200
    timeout: 10s
  memory_limiter:
    check_interval: 1s
    limit_percentage: 80
    spike_limit_percentage: 20
  transform:
    metric_statements:
      - context: datapoint
        statements:
          - set(attributes["exported_location"], attributes["location"])
          - delete_key(attributes, "location")
          - set(attributes["exported_cluster"], attributes["cluster"])
          - delete_key(attributes, "cluster")
          - set(attributes["exported_namespace"], attributes["namespace"])
          - delete_key(attributes, "namespace")
          - set(attributes["exported_job"], attributes["job"])
          - delete_key(attributes, "job")
          - set(attributes["exported_instance"], attributes["instance"])
          - delete_key(attributes, "instance")
          - set(attributes["exported_project_id"], attributes["project_id"])
          - delete_key(attributes, "project_id")
    metrics:
      receivers: [otlp]
      processors: [ memory_limiter, transform, batch ]
      exporters: [ googlemanagedprometheus ]

@dashpole
Copy link
Contributor

dashpole commented Oct 1, 2024

I haven't tried it, but since it is aggregating metrics over time, I would expect it to replace the batch processor in your setup.

@psx95
Copy link
Contributor

psx95 commented Oct 1, 2024

@AkselAllas, force flushing metrics in quick succession at such a high rate would not be helpful as the granularity of metrics for Cloud Monitoring can only be at the very minimum a 10 second interval. You are likely to run into errors if you're exporting more frequently than every 10 seconds.

The core of your problem as you have correctly identified is the detachment of CPU once your function completes that prevents background export.
I'm not exactly sure how you're currently deploying a Cloud Function + OTel Collector, but you can deploy a multi-container cloud function using the gcloud run deploy command.

gcloud beta run deploy cloud-func-helloworld2 \
 --no-cpu-throttling \ # this option triggers always-allocated cpu
 --container app-function \
 --function org.example.HelloWorld \
 --build-env-vars-file=config/build-env-vars.yaml \
 --source=build/libs \
 --port=8080 \
 --container otel-collector \
 --image=us-central1-docker.pkg.dev/your-gcp-project/your-artifact-registry/otel-collector:latest

Alternatively, once you have deployed a Cloud Function, from GCP console, you could manually:

  1. Open Your Function in Cloud Run:
    Screenshot 2024-10-01 at 4 45 59 PM

  2. Edit and Re-deploy a new revision with always-allocated CPU.
    Screenshot 2024-10-01 at 4 47 28 PM

I have personally tried this with the following function code:

package org.example;

import com.google.cloud.functions.HttpFunction;
import com.google.cloud.functions.HttpRequest;
import com.google.cloud.functions.HttpResponse;
import io.opentelemetry.api.metrics.LongCounter;
import java.util.Random;

public class HelloWorld implements HttpFunction {
  private static final OpenTelemetryConfig openTelemetryConfig = OpenTelemetryConfig.getInstance();
  private static final LongCounter counter =
      openTelemetryConfig
          .getMeterProvider()
          .get("sample-function-library")
          .counterBuilder("function_counter_psx")
          .setDescription("random counter")
          .build();
  private static final Random random = new Random();

  public HelloWorld() {
    super();
    Runtime.getRuntime()
        .addShutdownHook(
            new Thread(
                () -> {
                  System.out.println("Closing OTel SDK");
                  openTelemetryConfig.closeSdk();
                  System.out.println("Sdk closed");
                }));
  }

  @Override
  public void service(HttpRequest request, HttpResponse response) throws Exception {
    System.out.println("received request: " + request.toString());
    counter.add(random.nextInt(100));
    response.getWriter().write("Hello, World\n");
    System.out.println("Function exited");
  }
}

I was able to export metrics like this to a collector running in a sidecar from a cloud function. The PeriodicMetricReaders were configured with export interval of 10 seconds.

With always-on CPU, I was also able to verify that shutdown hook was called, and the close() method flushes any pending metrics that you may have remaining (at least in Java implementation).

I have tried this only with Java, but I imagine it would work in NodeJS as well.
Let me know if this helps.

@AkselAllas
Copy link
Author

@psx95 Thank you for the in-depth response!

Our problem currently is that we have tens of v1 Cloud Functions and they don't support shutdown hooks.
But I think the correct solution might indeed be to migrate to v2.

Copy link
Contributor

github-actions bot commented Dec 2, 2024

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@github-actions github-actions bot added the Stale label Dec 2, 2024
@dashpole
Copy link
Contributor

dashpole commented Dec 2, 2024

Feel free to reopen if you have further questions

@dashpole dashpole closed this as completed Dec 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants