Pollers abruptly exiting #300

demalik · 2021-07-13T20:46:48Z

Customer of mine is complaining about pollers abruptly exiting and looking for ways to identify and fix the issues

        We are still seeing issues of pollers abruptly existing.

[root@dc6-netapp-harvest-01 harvest]# harvest status | grep unknow
dc6-engineering-scratch dc6-cdot01 3370 unknown: os: process already finished
dc6-engineering-scratch dc6-cdot03 3382 unknown: os: process already finished
dc6-engineering-ec dc6-cdot04 3394 unknown: os: process already finished
dc6-engineering-chamber dc6-cdot13 3566 unknown: os: process already finished
dc6-engineering-scratch dc6-cdot14 3583 unknown: os: process already finished
dc6-engineering-flexclone dc6-cdot16 3616 unknown: os: process already finished
dc6-corporate dc6-cdot18 3636 unknown: os: process already finished
dc6-engineering-ec dc6-cdot20 3679 unknown: os: process already finished
dc6-engineering-scratch dc6-cdot21 3701 unknown: os: process already finished
dc6-engineering-scratch dc6-cdot24 3719 unknown: os: process already finished

[root@dc6-netapp-harvest-01 harvest]# ps -aef | grep 3370
root 15648 15586 0 22:03 pts/0 00:00:00 grep --color=auto 3370

Logg file is not having any clue why pollers are exiting. You can see the log has stopped recording activity after 27th of June

[root@dc6-netapp-harvest-01 harvest]# tail -f /var/log/harvest/poller_dc6-cdot01.log
2021/06/27 22:21:19 (warning) (collector) (ZapiPerf:ExtCacheObj): target unreachable, entering standby mode (retry to connect in 16 s)
2021/06/27 22:21:19 (warning) (collector) (ZapiPerf:NFSv3): target unreachable, entering standby mode (retry to connect in 16 s)
2021/06/27 22:21:22 (error ) (collector) (ZapiPerf:FcpPort): instance request: connection error => Post https://dc6-cdot01.nvidia.com:443/servlets/netapp.servlets.admin.XMLrequest_filer: context deadline exceeded (Client.Timeout exceeded while awaiting headers)
2021/06/27 22:21:22 (info ) (collector) (ZapiPerf:FcpPort): no [FcpPort] instances on system, entering standby mode
2021/06/27 22:21:24 (warning) (collector) (ZapiPerf:CIFSNode): target unreachable, entering standby mode (retry to connect in 16 s)
2021/06/27 22:21:24 (warning) (collector) (ZapiPerf:HeadroomAggr): target unreachable, entering standby mode (retry to connect in 16 s)
2021/06/27 22:21:29 (warning) (collector) (ZapiPerf:CopyManager): target unreachable, entering standby mode (retry to connect in 16 s)
2021/06/27 22:22:12 (info ) (poller) (dc6-cdot01): updated status, up collectors: 0 (of 41), up exporters: 1 (of 1)
2021/06/27 22:23:10 (info ) (collector) (ZapiPerf:Disk): recovered from standby mode, back to normal schedule
2021/06/27 22:23:10 (warning) (collector) (ZapiPerf:Disk): lagging behind schedule 15.17607ms

Can’t even stop and start dead pollers.

[root@dc6-netapp-harvest-01 harvest]# harvest stop dc6-cdot01 dc6-cdot03 dc6-cdot04
Datacenter Poller PID PromPort Status
++++++++++++++++++++++++ +++++++++++++++++++++ ++++++++++ +++++++++++++++ ++++++++++++++++++++
os: process already finished
dc6-engineering-scratch dc6-cdot01 3370 unknown: os: process already finished
os: process already finished
dc6-engineering-ec dc6-cdot04 3394 unknown: os: process already finished
os: process already finished
dc6-engineering-scratch dc6-cdot03 3382 unknown: os: process already finished
++++++++++++++++++++++++ +++++++++++++++++++++ ++++++++++ +++++++++++++++ ++++++++++++++++++++

[root@dc6-netapp-harvest-01 harvest]# harvest start dc6-cdot01 dc6-cdot03 dc6-cdot04
Datacenter Poller PID PromPort Status
++++++++++++++++++++++++ +++++++++++++++++++++ ++++++++++ +++++++++++++++ ++++++++++++++++++++
can't verify status of [dc6-cdot01]: kill poller and try again
can't verify status of [dc6-cdot04]: kill poller and try again
can't verify status of [dc6-cdot03]: kill poller and try again
++++++++++++++++++++++++ +++++++++++++++++++++ ++++++++++ +++++++++++++++ ++++++++++++++++++++

How we can identify the issue here and fix it?

The text was updated successfully, but these errors were encountered:

cgrinds · 2021-07-13T23:16:00Z

What version of Harvest are you using? Judging by the logs this is an older version. There have been improvements made over the past month around pollers panicing. Can you grab the latest version (21.05.3) and try with it?

demalik · 2021-07-13T23:47:34Z

Harvest 2.0 GA version

demalik · 2021-07-14T18:36:14Z

customer has upgraded to latest version and share feedback in few days

demalik · 2021-07-16T22:27:27Z

after upgrading the harvest to the latest version, Prometheus is failing to scrap the metrics. Below is the screenshot

This is happening across all 5 harvest pollers post upgrade. This makes me believe there is some issue with harvest pollers after the uprade. Customer has opened an issue in GitHub as well as #308.

cgrinds · 2021-07-17T12:14:28Z

Hi @demalik in #308 I asked some questions a few days ago. Let's continue the conversation there

cgrinds closed this as completed Jul 17, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pollers abruptly exiting #300

Pollers abruptly exiting #300

demalik commented Jul 13, 2021

cgrinds commented Jul 13, 2021

demalik commented Jul 13, 2021

demalik commented Jul 14, 2021

demalik commented Jul 16, 2021 •

edited

Loading

cgrinds commented Jul 17, 2021

Pollers abruptly exiting #300

Pollers abruptly exiting #300

Comments

demalik commented Jul 13, 2021

cgrinds commented Jul 13, 2021

demalik commented Jul 13, 2021

demalik commented Jul 14, 2021

demalik commented Jul 16, 2021 • edited Loading

cgrinds commented Jul 17, 2021

demalik commented Jul 16, 2021 •

edited

Loading