Behaviour with zapi collector data schedule - After losing cluster comms #356

hashi825 · 2021-07-28T01:22:59Z

Describe the bug
The default data schedule for the zapi collector is 180s defined at conf/zapi/default.yaml. By default when using the Prometheus exporter the data is always available after the collector starts regardless of the schedule, if these collectors fail and then recover (in my case, we lost comms to one of the clusters), the exporter begins to only report data samples for Zapi metrics every 180s, when it should be returning cached instances between those intervals.

Environment
Provide accurate information about the environment to help us reproduce the issue.

Harvest version: harvest version 21.05.4-2 (commit 19f8f25) (build date 2021-07-22T15:41:23+0000) linux/amd64
Command line arguments used: [e.g. bin/harvest start --config=foo.yml --collectors Zapi]
OS: RHEL 7.9
Install method: yum
ONTAP Version: 9.4P8
Other:

To Reproduce
Lose connection to a cluster (possibly for longer than the schedule interval? at least thats what happened for us).

Expected behavior
Zapi Collectors should recover and cache data appropriately for the prometheus exporter

Actual behavior
Zapi Collectors do not cache data and the exporter only returns samples according to the schedule interval.

Possible solution, workaround, fix
Restarting harvest resumes functions correctly.

Additional context
There's nothing in the logs to indicate any issues, funnily enough when harvest loses connection to a cluster, the only logging thats shown is context deadline exceeded for ZapiPerf collectors and zero logging for Zapi collectors.

EDIT
This seems to affect zapi perf as well, I'm actually not sure about the correlation between the data schedules, all I can say is that after collector recovery they seem to only produce metrics every 15mins.

The text was updated successfully, but these errors were encountered:

cgrinds · 2021-07-28T19:19:44Z

hi @hashi825 Let me describe what I did and maybe you can point out if I understand your ask.

I simulated an unreachable cluster by modifying zapi.go PollData() and PollInstance() to return a connection error after 3 successful polls. The idea is default.yaml defines a ratio of 1 to 3 for data to instance which means the schedule looks something like this:

i = instance
d = data
i d d d i d d d i d d ...
0 1 2 3 4 5 6 7 8 9 10   poll_index

my hack was to add a global counter poll_index that's incremented every time PollData() and PollInstance() are called. When poll_index is between [3 and 5] those two Poll methods were changed to return a connection error.

In a separate terminal I ran

while true; printf "%s %s\n" (date) (curl -s 'http://localhost:13001/metrics' | wc -l); sleep 60s; end

which prints the following (with the out-of-the-box defaults of instance=10m and data=3m)

Wed Jul 28 15:05:23 EDT 2021    21561
Wed Jul 28 15:06:23 EDT 2021    21565
Wed Jul 28 15:07:23 EDT 2021    21565
Wed Jul 28 15:08:23 EDT 2021       38
Wed Jul 28 15:09:23 EDT 2021       38
Wed Jul 28 15:10:23 EDT 2021    21564
Wed Jul 28 15:11:23 EDT 2021    21564
Wed Jul 28 15:12:23 EDT 2021    21564

That means I get the expected 21,564 metrics for the first three polls, then a period of 30+ metrics when the connection is down, followed by a return to normal when the connection problems clears.

Are you suggesting that Harvest should return 21500 instead of 38 in my example above or that after a connection error you only see 38 even after the connection is restored?

hashi825 · 2021-07-29T00:11:18Z

Basically after the collector recovered, my data looks like this. The dotted data points are 15mins apart, this appears the same on Zapi and ZapiPerf collector data (this particular metric was aggr_space_phyiscal_used_percent). The data past those 15min data points is after I restarted harvest.

cgrinds · 2021-07-29T19:38:01Z

@hashi825 thanks for the screenshot. by chance have you checked Prometheus's logs? those gaps are certainly unexpected - here's a similar graph when I lost connection - no gaps after the connections is restored. and you said nothing interesting in the harvest logs for these pollers during the dot times? Any edits to the out-of-the-box schedules?

hashi825 · 2021-07-30T07:31:45Z

There were no scrape errors in the log, this prometheus instance sits on the same server as harvest which collects metrics from 10 other clusters. I'll double check the both logs again but for what it's worth this cluster was unreachable for about 3 hours so I'm not sure if the amount of time it was unreachable plays into reproducing this issue.

hashi825 · 2021-08-01T23:11:55Z

@cgrinds

had this happen again but this time due to us doing an ONTAP upgrade on this particular cluster. I haven't found the root cause but pretty much the metrics become available for a duration of about 5 minutes before disappearing, I tested this by repeatedly curling the metrics URL. Once they disappear the only metrics returned are metadata_component_status and metadata_component_count, metadata_component_count reports 0 counts for all ZapiPerf and Zapi metrics

ie:

metadata_component_count{hostname=xxx,instance=xxx,job=harvest,name=Zapi,poller=xxx,reasong="running",target="Aggregate",type="collector",version="21.05.4"} 0

cgrinds · 2021-08-02T19:49:14Z

thanks for the update @hashi825 - we haven't figured out how to reproduce yet but will try a longer period of being offline.

vgratian · 2021-08-12T10:05:03Z

Hi @hashi825, regarding:

if these collectors fail and then recover (in my case, we lost comms to one of the clusters), the exporter begins to only report data samples for Zapi metrics every 180s, when it should be returning cached instances between those intervals.

The Prometheus exporter will cache metrics only for a limited amount of time, by default this is 180s. You can change this to a longer interval, check the cache_max_keep parameter in the doc.

hashi825 · 2021-08-12T10:35:27Z

Hi @hashi825, regarding:

if these collectors fail and then recover (in my case, we lost comms to one of the clusters), the exporter begins to only report data samples for Zapi metrics every 180s, when it should be returning cached instances between those intervals.

The Prometheus exporter will cache metrics only for a limited amount of time, by default this is 180s. You can change this to a longer interval, check the cache_max_keep parameter in the doc.

Still doesn't explain why after a long disconnect and when the collectors recover, that metrics are only available every 15mins for 180s, when the schedule.is 60s.

vgratian · 2021-08-12T10:57:06Z

thanks, I'll respond to that as well

- add more logging when connection fails Fixes #356

rahulguptajss · 2021-08-24T08:20:42Z

There is a problem with schedule framework which changes task interval. Resetting retrydelay fixes 17 min gap issue. We still need to handle change of task interval for any other edge cases.

hashi825 added the status/needs-triage label Jul 28, 2021

cgrinds assigned vgratian Aug 10, 2021

vgratian closed this as completed Aug 12, 2021

vgratian added wontfix This will not be worked on and removed status/needs-triage labels Aug 12, 2021

vgratian reopened this Aug 12, 2021

cgrinds assigned cgrinds and unassigned vgratian Aug 20, 2021

cgrinds added a commit that referenced this issue Aug 20, 2021

fix: reset collector retryDelay when connection resumes

e89a62e

- add more logging when connection fails Fixes #356

cgrinds mentioned this issue Aug 20, 2021

fix: reset collector retryDelay when connection resumes #436

Merged

cgrinds closed this as completed in #436 Aug 20, 2021

cgrinds added a commit that referenced this issue Aug 20, 2021

fix: reset collector retryDelay when connection resumes

3979373

- add more logging when connection fails Fixes #356

cgrinds added bug Something isn't working status/testme and removed wontfix This will not be worked on labels Aug 20, 2021

rahulguptajss reopened this Aug 24, 2021

rahulguptajss added status/open and removed status/testme labels Aug 24, 2021

rahulguptajss self-assigned this Aug 24, 2021

rahulguptajss linked a pull request Aug 25, 2021 that will close this issue

fix: collector tasks time interval correction during recover #444

Merged

cgrinds closed this as completed in #444 Aug 25, 2021

cgrinds added status/done and removed status/open labels Sep 28, 2021

cgrinds unassigned cgrinds and rahulguptajss May 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Behaviour with zapi collector data schedule - After losing cluster comms #356

Behaviour with zapi collector data schedule - After losing cluster comms #356

hashi825 commented Jul 28, 2021 •

edited

Loading

cgrinds commented Jul 28, 2021

hashi825 commented Jul 29, 2021

cgrinds commented Jul 29, 2021 •

edited

Loading

hashi825 commented Jul 30, 2021

hashi825 commented Aug 1, 2021

cgrinds commented Aug 2, 2021 •

edited

Loading

vgratian commented Aug 12, 2021

hashi825 commented Aug 12, 2021

vgratian commented Aug 12, 2021

rahulguptajss commented Aug 24, 2021

Behaviour with zapi collector data schedule - After losing cluster comms #356

Behaviour with zapi collector data schedule - After losing cluster comms #356

Comments

hashi825 commented Jul 28, 2021 • edited Loading

cgrinds commented Jul 28, 2021

hashi825 commented Jul 29, 2021

cgrinds commented Jul 29, 2021 • edited Loading

hashi825 commented Jul 30, 2021

hashi825 commented Aug 1, 2021

cgrinds commented Aug 2, 2021 • edited Loading

vgratian commented Aug 12, 2021

hashi825 commented Aug 12, 2021

vgratian commented Aug 12, 2021

rahulguptajss commented Aug 24, 2021

hashi825 commented Jul 28, 2021 •

edited

Loading

cgrinds commented Jul 29, 2021 •

edited

Loading

cgrinds commented Aug 2, 2021 •

edited

Loading