Harvest 22.08 is making many queries to internal DNS #1353

faguayot · 2022-10-18T15:37:13Z

Describe the bug
The harvest 22.08 seems to be making a huge number of queries to our internal DNS. In fact, the IP's checked mostly are those for the Vserver "Cluster", those IP's are internal for the Cluster. Every 2-3 minutes there are like two peak of requests reaching the 1,5K-2,5K in seconds. We have tried to stop the harvest instances that we have and the high demand stopped.

Environment
Provide accurate information about the environment to help us reproduce the issue.

Harvest version: [harvest version 22.08.0-1 (commit 93db10a) (build date 2022-08-19T09:09:07-0400) linux/amd64]
Command line arguments used: [e.g. bin/harvest start --config=foo.yml --collectors Zapi]
OS: [Red Hat Enterprise Linux release 8.4 (Ootpa)]
Install method: [rhel]
ONTAP Version: [9.7 and 9.10.1]

Expected behavior
Not do any DNS request from harvest, at least for the internal interfaces.

Actual behavior
The line in red are the requested for the storage arrays.

Same test but with the harvest instances stopped.

Additional context
I would like to find which collector, object is making constantly those queries for comment that and avoiding any problem for the DNS service. I don't understand why something is making those requests for IP addresses which are internal, only for every cluster.

The text was updated successfully, but these errors were encountered:

cgrinds · 2022-10-18T16:31:05Z

hi @faguayot let me make sure I understand the problem. You're saying that Harvest is making too many DNS requests, in the range of 2-3K DNS requests every two to three minutes? Do you know if those requests are causing a problem or you're trying to understand why those requests are being made? Or is the concern that Harvest is making that many requests?

Harvest uses ZAPI or REST protocols to gather metrics from ONTAP, typically by talking to the cluster management lif.

With the out-of-the-box templates, a single Harvest poller will make, per-object concurrent, requests to the cluster for each object listed in the collector's default.yaml on a schedule as follows:

ZapiPerf metrics are collected roughly every 1m
Zapi metrics are collected roughly every 3m

In cases where there are many ONTAP objects, e.g. say there are 50 thousand qtrees, ONTAP won't return all of them in a single request, and instead, Harvest requests them 500 at a time, which means there would be 100 requests to gather all the qtrees. In other words, the number of requests Harvest sends will be a function of the number of objects being monitored since we request them in chunks.

Perhaps the spikes you're seeing are when the schedules for multiple objects overlap? From a DNS perspective, this shouldn't be a problem though. Are these concurrent requests causing errors?

In terms of DNS, Harvest isn't doing anything DNS related. Harvest only talks HTTPs to ONTAP. The OS will make DNS lookups when those HTTP requests contain hostname instead of IPs, but all of that happens further down the stack than Harvest.

Some Questions

Does your harvest.yml file specify the cluster addr as a hostname or IP address? Is it possible, that you have listed a hostname and that's causing the OS to do more DNS translations to IPs? If so, perhaps switching to IPs would reduce the number of DNS queries.
Can you share how you are collecting the DNS stats shown in your screenshots?
Are those the number of requests from the poller side or ONTAP side? If poller side, how many pollers are running on the host?
Were the number of requests different for release 22.05?
Can you email a poller log file to [email protected] and we can dig out some of the object counts from there.

faguayot · 2022-10-19T08:36:59Z

Hello @cgrinds,
Yes, you understood correctly in fact harvest isn't directly the owner of those DNS requests but every cluster does the request to the DNS when the harvest talks with the cluster. Sorry because I tried to explain the issue well but I dind't do it. I mentioned the time of 2-3 minutes because we were investigating to find the frequency time when the event occurs due to I know that every collector has defined its time to collect new data. So maybe the objects which can cause this problem could be in the Zapi.

The DNS requests for the moment aren't impacting the DNS in terms of performance or availability but it's something that could happen.

No that the point, we don't use domain or hostname as I said before. We only use IPs in the harvest.yml configuration but the worst is the main queries are doing for the internal IPs cluster
eg. 169.254.13.63

The IPs addresses which the DNS requests have to resolve are the following (this is an example of a cluster only):

As I said before, these are internal IPs which doesn't have a name resolution.

Our networking team has the DNS monitored just like us have the storage array monitorized, so they use different tools but in the case of the screenshots are from wireshark. They were taking information for 10 minutes with the harvest collecting and then with harvest stopped and those are the results.
The requests are done from the ONTAP side, to the internal IPs.
We don't have that information, the problem was found when the networking team detect that increase of requests.
I will send you a poller log file to that email.

cgrinds · 2022-10-19T16:27:29Z

Thanks for the details and log files @faguayot. We don't see a problem in Harvest that would cause higher than expected DNS queries. So far, it appears these requests are a consequence of Harvest sending REST & ZAPI requests to ONTAP. I'm going to see if we can get permission to wireshark one of our large clusters.

I pulled out some counts from you log file (see table below). These stand out because of the high instance and/or metric count.

Can you share your Zapi:NFSLock template. The number of metrics is quite high and I don't understand why.
As a way to narrow in on the problem, can you disable these collectors and see if these reduce the number of DNS requests?

Rest:NFSClients
Rest:NetConnections
Zapi:NFSLock
ZapiPerf:Workload
ZapiPerf:WorkloadDetail

Name	Instances	Metrics
Rest:NFSClients	3,194	22,351
Rest:NFSClients	3,203	22,414
Rest:NetConnections	25,225	71,616
Zapi:NFSLock	12	439,665
Zapi:NFSLock	473	439,383
ZapiPerf:Workload	3,827	99,502
ZapiPerf:Workload	3,846	99,996
ZapiPerf:WorkloadDetail	34,551	829,096

faguayot · 2022-10-20T13:37:01Z

Hello @cgrinds,

Thanks for your checks and the detailed information shared.

Here it is the Zapi:NFSLock template that we are using, if I am not wrong is the same code there is in some of your tickets.

#### NFS Locks
name:             Lock
query:            lock-get-iter
object:           lock

schedule:
  - instance: 180s
  - data: 180s

counters:
  lock-info:
    - ^^lockid                => lock_id
    - ^volume                 => volume
    - ^vserver                => svm
    - ^client-address         => client_address
    - ^is-constituent         => is_constituent
    - ^is-sharelock-soft      => is_share_lock_soft
    - ^lif                    => lif
    - ^lock-state             => lock_state
    - ^lock-type              => lock_type
    - ^node                   => node
    - ^path                   => path
    - ^protocol               => protocol
    - ^sharelock-mode         => share_lock_mode


collect_only_labels: true

export_options:
  instance_keys:
    - lock_id
  instance_labels:
    - volume
    - svm
    - client_address
    - is_constituent
    - is_share_lock_soft
    - lif
    - lock_state
    - lock_type
    - path
    - protocol
    - share_lock_mode

This morning we were doing some tests in which we have disabled the following objects:

Rest:NetConnections
Rest:NFSClients
Zapi:NFSLock

The result was that the DNS queries disappeared. So I think that you have bounded from where come the problem we are having.

Regarding the Workload objects we don't disable because it is something that we are using for sometime and we don't believe that was the problem.

For give you more information, the log shared with you is from a storage array which is based on NFS.

cgrinds · 2022-10-20T19:31:24Z

Thanks for the Zapi:NFSLock template, yep, that's the one we posted. We found a logging bug that causes the number of instances in your log files to be wrong #1366. No other problem, just that the number logged is wrong. (fixed now).

We're confident you have ~39,943 locks, which means it will take Harvest around 80 ZAPI requests to return them all. And while it's only taking 5s to do that, it would not be surprising, if those 80 ZAPI requests became multiple DNS requests when ONTAP requests lock information.

Now that you've narrowed it down to Rest:NetConnections, Rest:NFSClients, and Zapi:NFSLock, would it be possible to enable each individually until the DNS requests return again?

It could be that when ONTAP queries the active network connections, it needs to do DNS queries to find/validate the connections? In particular, when it tries to return the IP address for remote hosts, connected clients, and client IP connected to each interface.

Understood on the Workloads and yes, those have been there since day one and have not changed much so unlikely it's related.

faguayot · 2022-10-24T06:35:31Z

Good morning @cgrinds,

We disabled in a first step the: Zapi:NFSLock and the result was no queries to the DNS so it seems this object wasn't the problem. Today we want to continue disabling the others in different moment at time. When we have the results with those, I will share with you.

My suspicion is that possibly data collection of Rest:NFSClients is the problem, as you said when this checks network active connections the ONTAP execute the name resolution but I can't understand why the ONTAP does that with the internal IPs addresses those which only are use by cluster.

faguayot · 2022-10-24T16:34:38Z

@cgrinds Today we made the tests with the other two objects and we discover that the object which was generating many queries to DNS was Rest:NetConnections.

cgrinds · 2022-10-24T16:42:52Z

Thanks for the confirmation @faguayot!

That means you will see the same "DNS storm" from the ONTAP CLI since Harvest's REST template for NetConnection is calling api/private/cli/network/connections/active, which is the same as, network connections active show from the ONTAP CLI.

I don't think there is anything Harvest can do about this. It's probably worth opening a case with ONTAP if you want clarification or would like them to reduce the number of DNS requests.

faguayot · 2022-10-28T12:40:47Z

Sorry for the delay in answering.

I didn't know which query to the API or ZAPI was doing that object, so thanks for share that helpful information again Chris.

What kind of request/query is doing the NFSClients?, because I was almost sure that this object was going to be the guilty of the "DNS storm".

Thanks for your time during this issue which wasn't directly a problem of the harvest although harvest was indirectly involved.

cgrinds · 2022-10-31T13:06:00Z

The NFSClients template is calling the ONTAP REST api/protocols/nfs/connected-clients endpoint, which retrieves the NFS configuration of SVMs as defined here 9.7 and 9.10.1

cgrinds · 2023-03-22T15:17:16Z

The remote_host counter in the NetConnection template causes ONTAP to attempt to resolve all IP addresses from every active connection on the cluster. This can cause a DNS storm that is mostly harmless but may cause noisy logging especially when DNS is misconfigured or timeouts happen.

Since this template is not used by any dashboards we're going to:

Disable the template
Disable the remote_host counter to reduce the noise when the template is enabled

Thanks to Alessandro for reporting and LeonardoA for providing the details on remote_host.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Harvest 22.08 is making many queries to internal DNS #1353

Harvest 22.08 is making many queries to internal DNS #1353

faguayot commented Oct 18, 2022

cgrinds commented Oct 18, 2022 •

edited

Loading

faguayot commented Oct 19, 2022

cgrinds commented Oct 19, 2022

faguayot commented Oct 20, 2022

cgrinds commented Oct 20, 2022

faguayot commented Oct 24, 2022

faguayot commented Oct 24, 2022

cgrinds commented Oct 24, 2022

faguayot commented Oct 28, 2022

cgrinds commented Oct 31, 2022

cgrinds commented Mar 22, 2023 •

edited

Loading

Harvest 22.08 is making many queries to internal DNS #1353

Harvest 22.08 is making many queries to internal DNS #1353

Comments

faguayot commented Oct 18, 2022

cgrinds commented Oct 18, 2022 • edited Loading

Some Questions

faguayot commented Oct 19, 2022

cgrinds commented Oct 19, 2022

faguayot commented Oct 20, 2022

cgrinds commented Oct 20, 2022

faguayot commented Oct 24, 2022

faguayot commented Oct 24, 2022

cgrinds commented Oct 24, 2022

faguayot commented Oct 28, 2022

cgrinds commented Oct 31, 2022

cgrinds commented Mar 22, 2023 • edited Loading

cgrinds commented Oct 18, 2022 •

edited

Loading

cgrinds commented Mar 22, 2023 •

edited

Loading