-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Behaviour with zapi collector data schedule - After losing cluster comms #356
Comments
hi @hashi825 Let me describe what I did and maybe you can point out if I understand your ask. I simulated an unreachable cluster by modifying
my hack was to add a global counter In a separate terminal I ran while true; printf "%s %s\n" (date) (curl -s 'http://localhost:13001/metrics' | wc -l); sleep 60s; end which prints the following (with the out-of-the-box defaults of
That means I get the expected 21,564 metrics for the first three polls, then a period of 30+ metrics when the connection is down, followed by a return to normal when the connection problems clears. Are you suggesting that Harvest should return 21500 instead of 38 in my example above or that after a connection error you only see 38 even after the connection is restored? |
@hashi825 thanks for the screenshot. by chance have you checked Prometheus's logs? those gaps are certainly unexpected - here's a similar graph when I lost connection - no gaps after the connections is restored. and you said nothing interesting in the harvest logs for these pollers during the dot times? Any edits to the out-of-the-box schedules? |
There were no scrape errors in the log, this prometheus instance sits on the same server as harvest which collects metrics from 10 other clusters. I'll double check the both logs again but for what it's worth this cluster was unreachable for about 3 hours so I'm not sure if the amount of time it was unreachable plays into reproducing this issue. |
had this happen again but this time due to us doing an ONTAP upgrade on this particular cluster. I haven't found the root cause but pretty much the metrics become available for a duration of about 5 minutes before disappearing, I tested this by repeatedly curling the metrics URL. Once they disappear the only metrics returned are metadata_component_status and metadata_component_count, metadata_component_count reports 0 counts for all ZapiPerf and Zapi metrics ie:
|
thanks for the update @hashi825 - we haven't figured out how to reproduce yet but will try a longer period of being offline. |
Hi @hashi825, regarding:
The Prometheus exporter will cache metrics only for a limited amount of time, by default this is |
Still doesn't explain why after a long disconnect and when the collectors recover, that metrics are only available every 15mins for 180s, when the schedule.is 60s. |
thanks, I'll respond to that as well |
- add more logging when connection fails Fixes #356
- add more logging when connection fails Fixes #356
There is a problem with schedule framework which changes task interval. Resetting retrydelay fixes 17 min gap issue. We still need to handle change of task interval for any other edge cases. |
Describe the bug
The default data schedule for the zapi collector is 180s defined at
conf/zapi/default.yaml
. By default when using the Prometheus exporter the data is always available after the collector starts regardless of the schedule, if these collectors fail and then recover (in my case, we lost comms to one of the clusters), the exporter begins to only report data samples for Zapi metrics every 180s, when it should be returning cached instances between those intervals.Environment
Provide accurate information about the environment to help us reproduce the issue.
bin/harvest start --config=foo.yml --collectors Zapi
]To Reproduce
Lose connection to a cluster (possibly for longer than the schedule interval? at least thats what happened for us).
Expected behavior
Zapi Collectors should recover and cache data appropriately for the prometheus exporter
Actual behavior
Zapi Collectors do not cache data and the exporter only returns samples according to the schedule interval.
Possible solution, workaround, fix
Restarting harvest resumes functions correctly.
Additional context
There's nothing in the logs to indicate any issues, funnily enough when harvest loses connection to a cluster, the only logging thats shown is context deadline exceeded for
ZapiPerf
collectors and zero logging forZapi
collectors.EDIT
This seems to affect zapi perf as well, I'm actually not sure about the correlation between the data schedules, all I can say is that after collector recovery they seem to only produce metrics every 15mins.
The text was updated successfully, but these errors were encountered: