-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pollers abruptly exiting #300
Comments
What version of Harvest are you using? Judging by the logs this is an older version. There have been improvements made over the past month around pollers panicing. Can you grab the latest version (21.05.3) and try with it? |
Harvest 2.0 GA version |
customer has upgraded to latest version and share feedback in few days |
after upgrading the harvest to the latest version, Prometheus is failing to scrap the metrics. Below is the screenshot This is happening across all 5 harvest pollers post upgrade. This makes me believe there is some issue with harvest pollers after the uprade. Customer has opened an issue in GitHub as well as #308. |
Customer of mine is complaining about pollers abruptly exiting and looking for ways to identify and fix the issues
[root@dc6-netapp-harvest-01 harvest]# harvest status | grep unknow
dc6-engineering-scratch dc6-cdot01 3370 unknown: os: process already finished
dc6-engineering-scratch dc6-cdot03 3382 unknown: os: process already finished
dc6-engineering-ec dc6-cdot04 3394 unknown: os: process already finished
dc6-engineering-chamber dc6-cdot13 3566 unknown: os: process already finished
dc6-engineering-scratch dc6-cdot14 3583 unknown: os: process already finished
dc6-engineering-flexclone dc6-cdot16 3616 unknown: os: process already finished
dc6-corporate dc6-cdot18 3636 unknown: os: process already finished
dc6-engineering-ec dc6-cdot20 3679 unknown: os: process already finished
dc6-engineering-scratch dc6-cdot21 3701 unknown: os: process already finished
dc6-engineering-scratch dc6-cdot24 3719 unknown: os: process already finished
[root@dc6-netapp-harvest-01 harvest]# ps -aef | grep 3370
root 15648 15586 0 22:03 pts/0 00:00:00 grep --color=auto 3370
Logg file is not having any clue why pollers are exiting. You can see the log has stopped recording activity after 27th of June
[root@dc6-netapp-harvest-01 harvest]# tail -f /var/log/harvest/poller_dc6-cdot01.log
2021/06/27 22:21:19 (warning) (collector) (ZapiPerf:ExtCacheObj): target unreachable, entering standby mode (retry to connect in 16 s)
2021/06/27 22:21:19 (warning) (collector) (ZapiPerf:NFSv3): target unreachable, entering standby mode (retry to connect in 16 s)
2021/06/27 22:21:22 (error ) (collector) (ZapiPerf:FcpPort): instance request: connection error => Post https://dc6-cdot01.nvidia.com:443/servlets/netapp.servlets.admin.XMLrequest_filer: context deadline exceeded (Client.Timeout exceeded while awaiting headers)
2021/06/27 22:21:22 (info ) (collector) (ZapiPerf:FcpPort): no [FcpPort] instances on system, entering standby mode
2021/06/27 22:21:24 (warning) (collector) (ZapiPerf:CIFSNode): target unreachable, entering standby mode (retry to connect in 16 s)
2021/06/27 22:21:24 (warning) (collector) (ZapiPerf:HeadroomAggr): target unreachable, entering standby mode (retry to connect in 16 s)
2021/06/27 22:21:29 (warning) (collector) (ZapiPerf:CopyManager): target unreachable, entering standby mode (retry to connect in 16 s)
2021/06/27 22:22:12 (info ) (poller) (dc6-cdot01): updated status, up collectors: 0 (of 41), up exporters: 1 (of 1)
2021/06/27 22:23:10 (info ) (collector) (ZapiPerf:Disk): recovered from standby mode, back to normal schedule
2021/06/27 22:23:10 (warning) (collector) (ZapiPerf:Disk): lagging behind schedule 15.17607ms
Can’t even stop and start dead pollers.
[root@dc6-netapp-harvest-01 harvest]# harvest stop dc6-cdot01 dc6-cdot03 dc6-cdot04
Datacenter Poller PID PromPort Status
++++++++++++++++++++++++ +++++++++++++++++++++ ++++++++++ +++++++++++++++ ++++++++++++++++++++
os: process already finished
dc6-engineering-scratch dc6-cdot01 3370 unknown: os: process already finished
os: process already finished
dc6-engineering-ec dc6-cdot04 3394 unknown: os: process already finished
os: process already finished
dc6-engineering-scratch dc6-cdot03 3382 unknown: os: process already finished
++++++++++++++++++++++++ +++++++++++++++++++++ ++++++++++ +++++++++++++++ ++++++++++++++++++++
[root@dc6-netapp-harvest-01 harvest]# harvest start dc6-cdot01 dc6-cdot03 dc6-cdot04
Datacenter Poller PID PromPort Status
++++++++++++++++++++++++ +++++++++++++++++++++ ++++++++++ +++++++++++++++ ++++++++++++++++++++
can't verify status of [dc6-cdot01]: kill poller and try again
can't verify status of [dc6-cdot04]: kill poller and try again
can't verify status of [dc6-cdot03]: kill poller and try again
++++++++++++++++++++++++ +++++++++++++++++++++ ++++++++++ +++++++++++++++ ++++++++++++++++++++
How we can identify the issue here and fix it?
The text was updated successfully, but these errors were encountered: