Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Harvest 22.08 is making many queries to internal DNS #1353

Closed
faguayot opened this issue Oct 18, 2022 · 11 comments
Closed

Harvest 22.08 is making many queries to internal DNS #1353

faguayot opened this issue Oct 18, 2022 · 11 comments

Comments

@faguayot
Copy link

Describe the bug
The harvest 22.08 seems to be making a huge number of queries to our internal DNS. In fact, the IP's checked mostly are those for the Vserver "Cluster", those IP's are internal for the Cluster. Every 2-3 minutes there are like two peak of requests reaching the 1,5K-2,5K in seconds. We have tried to stop the harvest instances that we have and the high demand stopped.

Environment
Provide accurate information about the environment to help us reproduce the issue.

  • Harvest version: [harvest version 22.08.0-1 (commit 93db10a) (build date 2022-08-19T09:09:07-0400) linux/amd64]
  • Command line arguments used: [e.g. bin/harvest start --config=foo.yml --collectors Zapi]
  • OS: [Red Hat Enterprise Linux release 8.4 (Ootpa)]
  • Install method: [rhel]
  • ONTAP Version: [9.7 and 9.10.1]

Expected behavior
Not do any DNS request from harvest, at least for the internal interfaces.

Actual behavior
The line in red are the requested for the storage arrays.
image

Same test but with the harvest instances stopped.
image

Additional context
I would like to find which collector, object is making constantly those queries for comment that and avoiding any problem for the DNS service. I don't understand why something is making those requests for IP addresses which are internal, only for every cluster.

@cgrinds
Copy link
Collaborator

cgrinds commented Oct 18, 2022

hi @faguayot let me make sure I understand the problem. You're saying that Harvest is making too many DNS requests, in the range of 2-3K DNS requests every two to three minutes? Do you know if those requests are causing a problem or you're trying to understand why those requests are being made? Or is the concern that Harvest is making that many requests?

Harvest uses ZAPI or REST protocols to gather metrics from ONTAP, typically by talking to the cluster management lif.

With the out-of-the-box templates, a single Harvest poller will make, per-object concurrent, requests to the cluster for each object listed in the collector's default.yaml on a schedule as follows:

  • ZapiPerf metrics are collected roughly every 1m
  • Zapi metrics are collected roughly every 3m

In cases where there are many ONTAP objects, e.g. say there are 50 thousand qtrees, ONTAP won't return all of them in a single request, and instead, Harvest requests them 500 at a time, which means there would be 100 requests to gather all the qtrees. In other words, the number of requests Harvest sends will be a function of the number of objects being monitored since we request them in chunks.

Perhaps the spikes you're seeing are when the schedules for multiple objects overlap? From a DNS perspective, this shouldn't be a problem though. Are these concurrent requests causing errors?

In terms of DNS, Harvest isn't doing anything DNS related. Harvest only talks HTTPs to ONTAP. The OS will make DNS lookups when those HTTP requests contain hostname instead of IPs, but all of that happens further down the stack than Harvest.

Some Questions

  1. Does your harvest.yml file specify the cluster addr as a hostname or IP address? Is it possible, that you have listed a hostname and that's causing the OS to do more DNS translations to IPs? If so, perhaps switching to IPs would reduce the number of DNS queries.

  2. Can you share how you are collecting the DNS stats shown in your screenshots?

  3. Are those the number of requests from the poller side or ONTAP side? If poller side, how many pollers are running on the host?

  4. Were the number of requests different for release 22.05?

  5. Can you email a poller log file to [email protected] and we can dig out some of the object counts from there.

@faguayot
Copy link
Author

Hello @cgrinds,
Yes, you understood correctly in fact harvest isn't directly the owner of those DNS requests but every cluster does the request to the DNS when the harvest talks with the cluster. Sorry because I tried to explain the issue well but I dind't do it. I mentioned the time of 2-3 minutes because we were investigating to find the frequency time when the event occurs due to I know that every collector has defined its time to collect new data. So maybe the objects which can cause this problem could be in the Zapi.

The DNS requests for the moment aren't impacting the DNS in terms of performance or availability but it's something that could happen.

  1. No that the point, we don't use domain or hostname as I said before. We only use IPs in the harvest.yml configuration but the worst is the main queries are doing for the internal IPs cluster
    eg. 169.254.13.63

The IPs addresses which the DNS requests have to resolve are the following (this is an example of a cluster only):

image

As I said before, these are internal IPs which doesn't have a name resolution.

  1. Our networking team has the DNS monitored just like us have the storage array monitorized, so they use different tools but in the case of the screenshots are from wireshark. They were taking information for 10 minutes with the harvest collecting and then with harvest stopped and those are the results.

  2. The requests are done from the ONTAP side, to the internal IPs.

  3. We don't have that information, the problem was found when the networking team detect that increase of requests.

  4. I will send you a poller log file to that email.

@cgrinds
Copy link
Collaborator

cgrinds commented Oct 19, 2022

Thanks for the details and log files @faguayot. We don't see a problem in Harvest that would cause higher than expected DNS queries. So far, it appears these requests are a consequence of Harvest sending REST & ZAPI requests to ONTAP. I'm going to see if we can get permission to wireshark one of our large clusters.

I pulled out some counts from you log file (see table below). These stand out because of the high instance and/or metric count.

  1. Can you share your Zapi:NFSLock template. The number of metrics is quite high and I don't understand why.
  2. As a way to narrow in on the problem, can you disable these collectors and see if these reduce the number of DNS requests?
  • Rest:NFSClients
  • Rest:NetConnections
  • Zapi:NFSLock
  • ZapiPerf:Workload
  • ZapiPerf:WorkloadDetail
Name Instances Metrics
Rest:NFSClients 3,194 22,351
Rest:NFSClients 3,203 22,414
Rest:NetConnections 25,225 71,616
Zapi:NFSLock 12 439,665
Zapi:NFSLock 473 439,383
ZapiPerf:Workload 3,827 99,502
ZapiPerf:Workload 3,846 99,996
ZapiPerf:WorkloadDetail 34,551 829,096

@faguayot
Copy link
Author

Hello @cgrinds,

Thanks for your checks and the detailed information shared.

  1. Here it is the Zapi:NFSLock template that we are using, if I am not wrong is the same code there is in some of your tickets.
#### NFS Locks
name:             Lock
query:            lock-get-iter
object:           lock

schedule:
  - instance: 180s
  - data: 180s

counters:
  lock-info:
    - ^^lockid                => lock_id
    - ^volume                 => volume
    - ^vserver                => svm
    - ^client-address         => client_address
    - ^is-constituent         => is_constituent
    - ^is-sharelock-soft      => is_share_lock_soft
    - ^lif                    => lif
    - ^lock-state             => lock_state
    - ^lock-type              => lock_type
    - ^node                   => node
    - ^path                   => path
    - ^protocol               => protocol
    - ^sharelock-mode         => share_lock_mode


collect_only_labels: true

export_options:
  instance_keys:
    - lock_id
  instance_labels:
    - volume
    - svm
    - client_address
    - is_constituent
    - is_share_lock_soft
    - lif
    - lock_state
    - lock_type
    - path
    - protocol
    - share_lock_mode

  1. This morning we were doing some tests in which we have disabled the following objects:
  • Rest:NetConnections
  • Rest:NFSClients
  • Zapi:NFSLock

The result was that the DNS queries disappeared. So I think that you have bounded from where come the problem we are having.

Regarding the Workload objects we don't disable because it is something that we are using for sometime and we don't believe that was the problem.

For give you more information, the log shared with you is from a storage array which is based on NFS.

@cgrinds
Copy link
Collaborator

cgrinds commented Oct 20, 2022

Thanks for the Zapi:NFSLock template, yep, that's the one we posted. We found a logging bug that causes the number of instances in your log files to be wrong #1366. No other problem, just that the number logged is wrong. (fixed now).

We're confident you have ~39,943 locks, which means it will take Harvest around 80 ZAPI requests to return them all. And while it's only taking 5s to do that, it would not be surprising, if those 80 ZAPI requests became multiple DNS requests when ONTAP requests lock information.

Now that you've narrowed it down to Rest:NetConnections, Rest:NFSClients, and Zapi:NFSLock, would it be possible to enable each individually until the DNS requests return again?

It could be that when ONTAP queries the active network connections, it needs to do DNS queries to find/validate the connections? In particular, when it tries to return the IP address for remote hosts, connected clients, and client IP connected to each interface.

Understood on the Workloads and yes, those have been there since day one and have not changed much so unlikely it's related.

@faguayot
Copy link
Author

Good morning @cgrinds,

We disabled in a first step the: Zapi:NFSLock and the result was no queries to the DNS so it seems this object wasn't the problem. Today we want to continue disabling the others in different moment at time. When we have the results with those, I will share with you.

My suspicion is that possibly data collection of Rest:NFSClients is the problem, as you said when this checks network active connections the ONTAP execute the name resolution but I can't understand why the ONTAP does that with the internal IPs addresses those which only are use by cluster.

@faguayot
Copy link
Author

@cgrinds Today we made the tests with the other two objects and we discover that the object which was generating many queries to DNS was Rest:NetConnections.

@cgrinds
Copy link
Collaborator

cgrinds commented Oct 24, 2022

Thanks for the confirmation @faguayot!

That means you will see the same "DNS storm" from the ONTAP CLI since Harvest's REST template for NetConnection is calling api/private/cli/network/connections/active, which is the same as, network connections active show from the ONTAP CLI.

I don't think there is anything Harvest can do about this. It's probably worth opening a case with ONTAP if you want clarification or would like them to reduce the number of DNS requests.

@faguayot
Copy link
Author

Sorry for the delay in answering.

I didn't know which query to the API or ZAPI was doing that object, so thanks for share that helpful information again Chris.

What kind of request/query is doing the NFSClients?, because I was almost sure that this object was going to be the guilty of the "DNS storm".

Thanks for your time during this issue which wasn't directly a problem of the harvest although harvest was indirectly involved.

@cgrinds
Copy link
Collaborator

cgrinds commented Oct 31, 2022

The NFSClients template is calling the ONTAP REST api/protocols/nfs/connected-clients endpoint, which retrieves the NFS configuration of SVMs as defined here 9.7 and 9.10.1

@cgrinds
Copy link
Collaborator

cgrinds commented Mar 22, 2023

The remote_host counter in the NetConnection template causes ONTAP to attempt to resolve all IP addresses from every active connection on the cluster. This can cause a DNS storm that is mostly harmless but may cause noisy logging especially when DNS is misconfigured or timeouts happen.

Since this template is not used by any dashboards we're going to:

  1. Disable the template
  2. Disable the remote_host counter to reduce the noise when the template is enabled

Thanks to Alessandro for reporting and LeonardoA for providing the details on remote_host.

See also:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants