Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Cached metrics makes us crazy #234

Closed
ViseLuca opened this issue May 16, 2024 · 6 comments
Closed

[BUG] Cached metrics makes us crazy #234

ViseLuca opened this issue May 16, 2024 · 6 comments
Assignees
Labels
bug Something isn't working

Comments

@ViseLuca
Copy link

Hi, we have a system with some cursors where we are calculating metrics on how much delays cursors have from last events emitted. We have built a custom plugin for this but the system sometimes it remains stuck with this values in cache and is not going ahead since we restart the kubernetes pods.

What we can do for that?

@ViseLuca ViseLuca added the bug Something isn't working label May 16, 2024
@akoutmos
Copy link
Owner

Hello Luca!

Unfortunately I don't have enough information to help you with this one. Could you produce a repo to minimally reproduce the issue that you are seeing? Do you think it is a problem with the plug that you created as opposed to PromEx the library? A shot in the dark guess would be that you want to use a gauge (via last_value https://hexdocs.pm/telemetry_metrics/Telemetry.Metrics.html#last_value/2) in your plug in if you just want the last value of some measurement.

Happy to help, but need some more information :).

@ViseLuca
Copy link
Author

ViseLuca commented May 20, 2024

We have an eventstore with events and cursors:

so a table like:
Table: Events
event_name | offset |
Event1 | 1 |
Event1 | 2 |
Event2 | 3 |

Every cursor has a set of event types that could read to process them.

Table: Cursors
name | event_types | last_offset_read |
Cursor1 | [Event1] | 0 |
Cursor2 | [Event2] | 0 |
Cursor3 | [Event1, Event2] | 0 |

We are checking with our plugin how many events are missing to reach the end. So in this case for example the metrics would be:
promex_cursor_1 2 (Event1 types)
promex_cursor_2 1 (Event2 types)
promex_cursor_3 3 (Event3 types)

We are querying the DB with a query to check how many events are still available after the last_offset_read. Sometimes the metric is coming back with the same data for a while and then changing. The problem is that we have alert on those metrics and sometimes we have false positive alerts.

Do you have any idea about it?

The problem is that locally is not happening, meanwhile in staging with all the cursors running (~50 cursors) sometimes the metrics are stuck or totally absent. I'm polling every 10 seconds.

I am already using last value

defp cursor_delay_metrics(metric_prefix, poll_rate) do
    Polling.build(
      :cursor_delay_polling,
      poll_rate,
      {__MODULE__, :cursor_delay_metrics_metrics, []},
      [
        last_value(
          metric_prefix ++ @event_delay_cursor,
          event_name: @event_delay_cursor,
          description: "The number of events that all the cursor must process to be aligned",
          measurement: :cursor_delay,
          tags: [:cursor_name]
        )
      ]
    )
  end

meanwhile this is the function that calculates the metrics:

def cursor_delay_metrics_metrics do
    CursorSchema
    |> Repo.all()
    |> Enum.map(fn %{id: id, name: name} ->
      {:ok, count} = Cursors.count_events_not_processed_yet(id)

      {name, count}
    end)
    |> Enum.map(fn {name, count} ->
      :telemetry.execute(
        @event_delay_cursor,
        %{cursor_delay: count},
        %{
          cursor_name: name
        }
      )
    end)
  end

@ViseLuca
Copy link
Author

Sometimes also there are few metrics, missing the cursors one.

I am thinking: we have 3 pods on k8s, is possible that the PromEx process is just starting on one of that 3 and the call is succeed just on 1 of the 3?

@fedme
Copy link

fedme commented Jun 6, 2024

@ViseLuca could it be related to this issue I just opened #236?

We are observing the same thing and I have pinpointed it to errors thrown from within the mfa callback that collects the metric in the plugin.

@ViseLuca
Copy link
Author

ViseLuca commented Jun 7, 2024

@fedme it could be, the query on DB was going on timeout sometimes so was raising and error and being stuck for the same reason. It fits technically

@akoutmos
Copy link
Owner

akoutmos commented Aug 9, 2024

Closing this ticket for now as a release will be cut soon with the ability to not detach the polling job when an error is encountered (example in #236).

@akoutmos akoutmos closed this as completed Aug 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants