Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pollers abruptly exiting #300

Closed
demalik opened this issue Jul 13, 2021 · 5 comments
Closed

Pollers abruptly exiting #300

demalik opened this issue Jul 13, 2021 · 5 comments

Comments

@demalik
Copy link

demalik commented Jul 13, 2021

Customer of mine is complaining about pollers abruptly exiting and looking for ways to identify and fix the issues

        We are still seeing issues of pollers abruptly existing.

[root@dc6-netapp-harvest-01 harvest]# harvest status | grep unknow
dc6-engineering-scratch dc6-cdot01 3370 unknown: os: process already finished
dc6-engineering-scratch dc6-cdot03 3382 unknown: os: process already finished
dc6-engineering-ec dc6-cdot04 3394 unknown: os: process already finished
dc6-engineering-chamber dc6-cdot13 3566 unknown: os: process already finished
dc6-engineering-scratch dc6-cdot14 3583 unknown: os: process already finished
dc6-engineering-flexclone dc6-cdot16 3616 unknown: os: process already finished
dc6-corporate dc6-cdot18 3636 unknown: os: process already finished
dc6-engineering-ec dc6-cdot20 3679 unknown: os: process already finished
dc6-engineering-scratch dc6-cdot21 3701 unknown: os: process already finished
dc6-engineering-scratch dc6-cdot24 3719 unknown: os: process already finished

[root@dc6-netapp-harvest-01 harvest]# ps -aef | grep 3370
root 15648 15586 0 22:03 pts/0 00:00:00 grep --color=auto 3370

Logg file is not having any clue why pollers are exiting. You can see the log has stopped recording activity after 27th of June

[root@dc6-netapp-harvest-01 harvest]# tail -f /var/log/harvest/poller_dc6-cdot01.log
2021/06/27 22:21:19 (warning) (collector) (ZapiPerf:ExtCacheObj): target unreachable, entering standby mode (retry to connect in 16 s)
2021/06/27 22:21:19 (warning) (collector) (ZapiPerf:NFSv3): target unreachable, entering standby mode (retry to connect in 16 s)
2021/06/27 22:21:22 (error ) (collector) (ZapiPerf:FcpPort): instance request: connection error => Post https://dc6-cdot01.nvidia.com:443/servlets/netapp.servlets.admin.XMLrequest_filer: context deadline exceeded (Client.Timeout exceeded while awaiting headers)
2021/06/27 22:21:22 (info ) (collector) (ZapiPerf:FcpPort): no [FcpPort] instances on system, entering standby mode
2021/06/27 22:21:24 (warning) (collector) (ZapiPerf:CIFSNode): target unreachable, entering standby mode (retry to connect in 16 s)
2021/06/27 22:21:24 (warning) (collector) (ZapiPerf:HeadroomAggr): target unreachable, entering standby mode (retry to connect in 16 s)
2021/06/27 22:21:29 (warning) (collector) (ZapiPerf:CopyManager): target unreachable, entering standby mode (retry to connect in 16 s)
2021/06/27 22:22:12 (info ) (poller) (dc6-cdot01): updated status, up collectors: 0 (of 41), up exporters: 1 (of 1)
2021/06/27 22:23:10 (info ) (collector) (ZapiPerf:Disk): recovered from standby mode, back to normal schedule
2021/06/27 22:23:10 (warning) (collector) (ZapiPerf:Disk): lagging behind schedule 15.17607ms

Can’t even stop and start dead pollers.

[root@dc6-netapp-harvest-01 harvest]# harvest stop dc6-cdot01 dc6-cdot03 dc6-cdot04
Datacenter Poller PID PromPort Status
++++++++++++++++++++++++ +++++++++++++++++++++ ++++++++++ +++++++++++++++ ++++++++++++++++++++
os: process already finished
dc6-engineering-scratch dc6-cdot01 3370 unknown: os: process already finished
os: process already finished
dc6-engineering-ec dc6-cdot04 3394 unknown: os: process already finished
os: process already finished
dc6-engineering-scratch dc6-cdot03 3382 unknown: os: process already finished
++++++++++++++++++++++++ +++++++++++++++++++++ ++++++++++ +++++++++++++++ ++++++++++++++++++++

[root@dc6-netapp-harvest-01 harvest]# harvest start dc6-cdot01 dc6-cdot03 dc6-cdot04
Datacenter Poller PID PromPort Status
++++++++++++++++++++++++ +++++++++++++++++++++ ++++++++++ +++++++++++++++ ++++++++++++++++++++
can't verify status of [dc6-cdot01]: kill poller and try again
can't verify status of [dc6-cdot04]: kill poller and try again
can't verify status of [dc6-cdot03]: kill poller and try again
++++++++++++++++++++++++ +++++++++++++++++++++ ++++++++++ +++++++++++++++ ++++++++++++++++++++

How we can identify the issue here and fix it?

@cgrinds
Copy link
Collaborator

cgrinds commented Jul 13, 2021

What version of Harvest are you using? Judging by the logs this is an older version. There have been improvements made over the past month around pollers panicing. Can you grab the latest version (21.05.3) and try with it?

@demalik
Copy link
Author

demalik commented Jul 13, 2021

Harvest 2.0 GA version

@demalik
Copy link
Author

demalik commented Jul 14, 2021

customer has upgraded to latest version and share feedback in few days

@demalik
Copy link
Author

demalik commented Jul 16, 2021

after upgrading the harvest to the latest version, Prometheus is failing to scrap the metrics. Below is the screenshot

This is happening across all 5 harvest pollers post upgrade. This makes me believe there is some issue with harvest pollers after the uprade. Customer has opened an issue in GitHub as well as #308.

prometheus-failure

@cgrinds
Copy link
Collaborator

cgrinds commented Jul 17, 2021

Hi @demalik in #308 I asked some questions a few days ago. Let's continue the conversation there

@cgrinds cgrinds closed this as completed Jul 17, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants