Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Behaviour with zapi collector data schedule - After losing cluster comms #356

Closed
hashi825 opened this issue Jul 28, 2021 · 10 comments · Fixed by #436 or #444
Closed

Behaviour with zapi collector data schedule - After losing cluster comms #356

hashi825 opened this issue Jul 28, 2021 · 10 comments · Fixed by #436 or #444
Labels
bug Something isn't working status/done

Comments

@hashi825
Copy link

hashi825 commented Jul 28, 2021

Describe the bug
The default data schedule for the zapi collector is 180s defined at conf/zapi/default.yaml. By default when using the Prometheus exporter the data is always available after the collector starts regardless of the schedule, if these collectors fail and then recover (in my case, we lost comms to one of the clusters), the exporter begins to only report data samples for Zapi metrics every 180s, when it should be returning cached instances between those intervals.

Environment
Provide accurate information about the environment to help us reproduce the issue.

  • Harvest version: harvest version 21.05.4-2 (commit 19f8f25) (build date 2021-07-22T15:41:23+0000) linux/amd64
  • Command line arguments used: [e.g. bin/harvest start --config=foo.yml --collectors Zapi]
  • OS: RHEL 7.9
  • Install method: yum
  • ONTAP Version: 9.4P8
  • Other:

To Reproduce
Lose connection to a cluster (possibly for longer than the schedule interval? at least thats what happened for us).

Expected behavior
Zapi Collectors should recover and cache data appropriately for the prometheus exporter

Actual behavior
Zapi Collectors do not cache data and the exporter only returns samples according to the schedule interval.

Possible solution, workaround, fix
Restarting harvest resumes functions correctly.

Additional context
There's nothing in the logs to indicate any issues, funnily enough when harvest loses connection to a cluster, the only logging thats shown is context deadline exceeded for ZapiPerf collectors and zero logging for Zapi collectors.

EDIT
This seems to affect zapi perf as well, I'm actually not sure about the correlation between the data schedules, all I can say is that after collector recovery they seem to only produce metrics every 15mins.

@cgrinds
Copy link
Collaborator

cgrinds commented Jul 28, 2021

hi @hashi825 Let me describe what I did and maybe you can point out if I understand your ask.

I simulated an unreachable cluster by modifying zapi.go PollData() and PollInstance() to return a connection error after 3 successful polls. The idea is default.yaml defines a ratio of 1 to 3 for data to instance which means the schedule looks something like this:

i = instance
d = data
i d d d i d d d i d d ...
0 1 2 3 4 5 6 7 8 9 10   poll_index

my hack was to add a global counter poll_index that's incremented every time PollData() and PollInstance() are called. When poll_index is between [3 and 5] those two Poll methods were changed to return a connection error.

In a separate terminal I ran

while true; printf "%s %s\n" (date) (curl -s 'http://localhost:13001/metrics' | wc -l); sleep 60s; end

which prints the following (with the out-of-the-box defaults of instance=10m and data=3m)

Wed Jul 28 15:05:23 EDT 2021    21561
Wed Jul 28 15:06:23 EDT 2021    21565
Wed Jul 28 15:07:23 EDT 2021    21565
Wed Jul 28 15:08:23 EDT 2021       38
Wed Jul 28 15:09:23 EDT 2021       38
Wed Jul 28 15:10:23 EDT 2021    21564
Wed Jul 28 15:11:23 EDT 2021    21564
Wed Jul 28 15:12:23 EDT 2021    21564

That means I get the expected 21,564 metrics for the first three polls, then a period of 30+ metrics when the connection is down, followed by a return to normal when the connection problems clears.

Are you suggesting that Harvest should return 21500 instead of 38 in my example above or that after a connection error you only see 38 even after the connection is restored?

@hashi825
Copy link
Author

Basically after the collector recovered, my data looks like this. The dotted data points are 15mins apart, this appears the same on Zapi and ZapiPerf collector data (this particular metric was aggr_space_phyiscal_used_percent). The data past those 15min data points is after I restarted harvest.

aggr metrics

@cgrinds
Copy link
Collaborator

cgrinds commented Jul 29, 2021

@hashi825 thanks for the screenshot. by chance have you checked Prometheus's logs? those gaps are certainly unexpected - here's a similar graph when I lost connection - no gaps after the connections is restored. and you said nothing interesting in the harvest logs for these pollers during the dot times? Any edits to the out-of-the-box schedules?

image

@hashi825
Copy link
Author

There were no scrape errors in the log, this prometheus instance sits on the same server as harvest which collects metrics from 10 other clusters. I'll double check the both logs again but for what it's worth this cluster was unreachable for about 3 hours so I'm not sure if the amount of time it was unreachable plays into reproducing this issue.

@hashi825
Copy link
Author

hashi825 commented Aug 1, 2021

@cgrinds

had this happen again but this time due to us doing an ONTAP upgrade on this particular cluster. I haven't found the root cause but pretty much the metrics become available for a duration of about 5 minutes before disappearing, I tested this by repeatedly curling the metrics URL. Once they disappear the only metrics returned are metadata_component_status and metadata_component_count, metadata_component_count reports 0 counts for all ZapiPerf and Zapi metrics

ie:

metadata_component_count{hostname=xxx,instance=xxx,job=harvest,name=Zapi,poller=xxx,reasong="running",target="Aggregate",type="collector",version="21.05.4"} 0 

@cgrinds
Copy link
Collaborator

cgrinds commented Aug 2, 2021

thanks for the update @hashi825 - we haven't figured out how to reproduce yet but will try a longer period of being offline.

@vgratian
Copy link
Contributor

Hi @hashi825, regarding:

if these collectors fail and then recover (in my case, we lost comms to one of the clusters), the exporter begins to only report data samples for Zapi metrics every 180s, when it should be returning cached instances between those intervals.

The Prometheus exporter will cache metrics only for a limited amount of time, by default this is 180s. You can change this to a longer interval, check the cache_max_keep parameter in the doc.

@vgratian vgratian added wontfix This will not be worked on and removed status/needs-triage labels Aug 12, 2021
@hashi825
Copy link
Author

Hi @hashi825, regarding:

if these collectors fail and then recover (in my case, we lost comms to one of the clusters), the exporter begins to only report data samples for Zapi metrics every 180s, when it should be returning cached instances between those intervals.

The Prometheus exporter will cache metrics only for a limited amount of time, by default this is 180s. You can change this to a longer interval, check the cache_max_keep parameter in the doc.

Still doesn't explain why after a long disconnect and when the collectors recover, that metrics are only available every 15mins for 180s, when the schedule.is 60s.

@vgratian
Copy link
Contributor

thanks, I'll respond to that as well

@vgratian vgratian reopened this Aug 12, 2021
@cgrinds cgrinds assigned cgrinds and unassigned vgratian Aug 20, 2021
cgrinds added a commit that referenced this issue Aug 20, 2021
- add more logging when connection fails

Fixes #356
cgrinds added a commit that referenced this issue Aug 20, 2021
- add more logging when connection fails

Fixes #356
@cgrinds cgrinds added bug Something isn't working status/testme and removed wontfix This will not be worked on labels Aug 20, 2021
@rahulguptajss rahulguptajss reopened this Aug 24, 2021
@rahulguptajss rahulguptajss self-assigned this Aug 24, 2021
@rahulguptajss
Copy link
Contributor

There is a problem with schedule framework which changes task interval. Resetting retrydelay fixes 17 min gap issue. We still need to handle change of task interval for any other edge cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working status/done
Projects
None yet
4 participants