[BUG] Cached metrics makes us crazy #234

ViseLuca · 2024-05-16T10:27:09Z

Hi, we have a system with some cursors where we are calculating metrics on how much delays cursors have from last events emitted. We have built a custom plugin for this but the system sometimes it remains stuck with this values in cache and is not going ahead since we restart the kubernetes pods.

What we can do for that?

akoutmos · 2024-05-16T15:20:29Z

Hello Luca!

Unfortunately I don't have enough information to help you with this one. Could you produce a repo to minimally reproduce the issue that you are seeing? Do you think it is a problem with the plug that you created as opposed to PromEx the library? A shot in the dark guess would be that you want to use a gauge (via last_value https://hexdocs.pm/telemetry_metrics/Telemetry.Metrics.html#last_value/2) in your plug in if you just want the last value of some measurement.

Happy to help, but need some more information :).

ViseLuca · 2024-05-20T05:43:57Z

We have an eventstore with events and cursors:

Every cursor has a set of event types that could read to process them.

We are checking with our plugin how many events are missing to reach the end. So in this case for example the metrics would be:
promex_cursor_1 2 (Event1 types)
promex_cursor_2 1 (Event2 types)
promex_cursor_3 3 (Event3 types)

We are querying the DB with a query to check how many events are still available after the last_offset_read. Sometimes the metric is coming back with the same data for a while and then changing. The problem is that we have alert on those metrics and sometimes we have false positive alerts.

Do you have any idea about it?

The problem is that locally is not happening, meanwhile in staging with all the cursors running (~50 cursors) sometimes the metrics are stuck or totally absent. I'm polling every 10 seconds.

I am already using last value

defp cursor_delay_metrics(metric_prefix, poll_rate) do
    Polling.build(
      :cursor_delay_polling,
      poll_rate,
      {__MODULE__, :cursor_delay_metrics_metrics, []},
      [
        last_value(
          metric_prefix ++ @event_delay_cursor,
          event_name: @event_delay_cursor,
          description: "The number of events that all the cursor must process to be aligned",
          measurement: :cursor_delay,
          tags: [:cursor_name]
        )
      ]
    )
  end

meanwhile this is the function that calculates the metrics:

def cursor_delay_metrics_metrics do
    CursorSchema
    |> Repo.all()
    |> Enum.map(fn %{id: id, name: name} ->
      {:ok, count} = Cursors.count_events_not_processed_yet(id)

      {name, count}
    end)
    |> Enum.map(fn {name, count} ->
      :telemetry.execute(
        @event_delay_cursor,
        %{cursor_delay: count},
        %{
          cursor_name: name
        }
      )
    end)
  end

ViseLuca · 2024-05-20T08:56:56Z

Sometimes also there are few metrics, missing the cursors one.

I am thinking: we have 3 pods on k8s, is possible that the PromEx process is just starting on one of that 3 and the call is succeed just on 1 of the 3?

fedme · 2024-06-06T16:22:56Z

@ViseLuca could it be related to this issue I just opened #236?

We are observing the same thing and I have pinpointed it to errors thrown from within the mfa callback that collects the metric in the plugin.

ViseLuca · 2024-06-07T07:15:14Z

@fedme it could be, the query on DB was going on timeout sometimes so was raising and error and being stuck for the same reason. It fits technically

akoutmos · 2024-08-09T17:55:31Z

Closing this ticket for now as a release will be cut soon with the ability to not detach the polling job when an error is encountered (example in #236).

ViseLuca added the bug Something isn't working label May 16, 2024

ViseLuca assigned akoutmos May 16, 2024

akoutmos closed this as completed Aug 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Cached metrics makes us crazy #234

[BUG] Cached metrics makes us crazy #234

ViseLuca commented May 16, 2024

akoutmos commented May 16, 2024

ViseLuca commented May 20, 2024 •

edited

Loading

ViseLuca commented May 20, 2024

fedme commented Jun 6, 2024

ViseLuca commented Jun 7, 2024 •

edited

Loading

akoutmos commented Aug 9, 2024

[BUG] Cached metrics makes us crazy #234

[BUG] Cached metrics makes us crazy #234

Comments

ViseLuca commented May 16, 2024

akoutmos commented May 16, 2024

ViseLuca commented May 20, 2024 • edited Loading

ViseLuca commented May 20, 2024

fedme commented Jun 6, 2024

ViseLuca commented Jun 7, 2024 • edited Loading

akoutmos commented Aug 9, 2024

ViseLuca commented May 20, 2024 •

edited

Loading

ViseLuca commented Jun 7, 2024 •

edited

Loading