Centralized Telegraf Manager #7478

raider111111 · 2020-05-07T13:58:50Z

Feature Request

A centralized way to monitor the health of the agents as well as deploy new configuration files to a server or groups of servers remotely.
Opening a feature request kicks off a discussion.

Proposal:

A centralized way to monitor the health of the agents as well as deploy new configuration files to a server or groups of servers remotely.

Current behavior:

Telegraf appears to need to be monitored using a separate tool, like Wavefront and the configurations have to deployed using a desired state tool like Puppet.

Desired behavior:

A centralized way to monitor the health of the agents as well as deploy new configuration files to a server or groups of servers remotely.

Use case:

The ability to monitor the health of the Telegraf agents is important.
The ability to deploy new configuration files to thousands of servers would be extremely useful in our environment.

ssoroka · 2020-05-07T18:25:19Z

Hey @raider111111, I appreciate you raising the issue as it mirrors some discussions we've been having on the team.

I might like to split off the "monitoring health of agents" conversation to something separate, as Telegraf agents already have an inputs.internal plugin that Telegraf can use to self-report health statistics. Also check out the Telegraf community template in that regard.

For the config server, we're thinking something like a Telegraf-config service that can store and manage a central config repository, storing config templates for certain types of machines, then maybe the machines would pull the config type they want, using local environment variables for some settings, and start Telegraf.
some questions that come from this:

are there any other services out there that already provide a similar service that we should consider using, integrating with, or building on?
Would agents periodically check for or subscribe to new configs?

I'd love to hear your thoughts on what you think and what you currently use for configuration management.

M0rdecay · 2020-05-07T19:06:29Z

@ssoroka, how about integration with zookeeper?
Ideally, this works as follows: telegraf looks at the specified node in the tree and receives a configuration from it, which it then processes using Go templates and stores it in the config-directory.

danielnelson · 2020-05-07T19:42:30Z

I've been thinking there are two separate pieces: key/value stores for service-discovery/secrets and full config storage. With zookeeper/consul/etcd you may want to pull certain variable and use them in an existing configuration. There is also configuration storage where you store the full configs, plugin configs, or configuration templates. These two sources of information need to be combined in order to produce the final configuration.

Right now we have analogues of these two information sources: full config is local only now and can be split in the config-directory. For key/value lookups we have the environment variable support in the config file. I think we would probably keep it as a two step process, this way you can mix and match where the configuration is kept and where the variables are kept.

So for zookeeper, may support a way of grabbing variables and using them in the configuration. Maybe something along the lines of:

[[variables.zookeeper]]
  # zookeeper connection settings

[[inputs.http]]
  urls = ["$zookeeper{some_key}"]

Not sure how we would do advanced tasks such as producing multiple plugins from a list of keys. I'm not sold on introducing an official template format built-in to Telegraf. It may make more sense to layer this on the outside.

M0rdecay · 2020-05-07T19:56:56Z

@raider111111 I would like to share the solution for monitoring the status of agents that we use in our environment.
We did this:
Each telegraph in its configuration has sections:

[[inputs.internal]]

[[inputs.procstat]]
  systemd_unit = "telegraf.service"
  namepass = [ "procstat_lookup" ]
  [inputs.procstat.tags]
    appl = "telegraf"

Thanks to them, each agent marks itself in the procstat_lookup measurement, which reports its status and the host where it is located. Then, we display their status on the Grafana dashboard, where the lack of data is interpreted as a problem with the agent.
Query: SELECT mean("running") FROM "$ret_policy"."procstat_lookup" WHERE "host" =~ /^$hosts$/ AND "appl" = 'telegraf' AND $timeFilter GROUP BY "host", time($coll_interval) fill(0)

This is not an ideal solution for a number of reasons, but it works. Hope you find this helpful.

M0rdecay · 2020-05-07T20:03:00Z

@danielnelson, didn’t you think about downloading configurations to a directory defined through --config-directory, and then sending SIGHUP to yourself? Some of our colleagues use a telegraf fork with this behavior and this seems to work well.

danielnelson · 2020-05-07T20:37:32Z

Whats different in the fork? Does it add a new plugin that downloads files and sends SIGHUP? You could definitely do this without a fork using the execd input. Another popular tool for doing this is confd.

M0rdecay · 2020-05-07T20:40:53Z

Thanks for the idea, I'll try to do it the way you suggest!

It is added that, as far as I know, a fork was made from version 1.6-1.7, when the telegraf did not yet have the execd plugin, probably could not work with environment variables (I could be wrong) and could not load the configuration via http. Then the approach I described seemed appropriate, but now your decision is more correct.

raider111111 · 2020-05-08T16:18:49Z

Thanks @M0rdecay , @danielnelson, and @ssoroka!!

pberlowski · 2020-05-27T11:43:51Z

@raider111111 What we do is use a repo with config setup and an ansible playbook to install the agents to our inventory. We utilize Ansible AWX to maintain the state of all the agents by running the playbook multiple times a day. We provide sane defaults to what should be monitored, but also allow "extras" in our inventory. When a target is configured with an extra - then it is deployed with an additional conf.d fragment. We have several thousand of agents deployed in this manner.

With regards to state monitoring, we only care if an agent is not submitting as it should. We have gloabl alerts on internal.write.metrics_written which alerts both at drops and at "no data" events to let us know that an agent is out (happens extremely rarely). We don't care for seeing all the agents as greeen otherwise. No news is good news.

Does that help?

ssoroka · 2020-05-27T19:05:28Z

This is a great way to do it, and I would follow @pberlowski 's practice of monitoring internal.write.metrics_written as a best practice. Some other things to watch: metrics_dropped, errors, and the write buffer size. You can see an example of this in my Telegraf dashboard template.

sjwang90 · 2020-05-29T20:16:50Z

Closing. See discussion in #272

raider111111 · 2020-06-01T16:57:06Z

Thanks @ssoroka, @pberlowski!

Yes, this is what we're currently doing.
The thing we wanted the most was a way to deploy configurations centrally.
I spoke to InfluxData and they suggest using Ansible, though we don't currently have that available in our enterprise.

pberlowski · 2020-06-02T11:03:13Z

@raider111111 I can walk you through what we have for the centralized Telegraf management if you want to connect. I'll put a blog post on the topic down on my todo list as well, so I can share this knowledge and start a discussion.

ssoroka · 2020-06-02T14:17:54Z

Ansible is a push model, and I'd be interested in more of a central config repository for Telegraf itself, though as far as individual solutions it's just whatever fits your organization the best.

I'm curious if anyone is using Kubernetes to deploy Telegraf configured along side other service pods?

pberlowski · 2020-06-02T14:33:14Z

We're just finalizing an operator for Kubernetes to install "metrics" custom resources for our internal customers. Our main driver is that we're looking to maintain the control over the agent (to allow version and config updates) but we want to shard it across namespaces.

00willo · 2023-06-07T21:55:36Z

@pberlowski Did you end up writing up a blog post on this? I wasn't able to track down a blog from your github profile. I would be very much interested in what you were able to get done in that space. Thanks.

danielnelson added the discussion Topics for discussion label May 7, 2020

sjwang90 closed this as completed May 29, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Centralized Telegraf Manager #7478

Centralized Telegraf Manager #7478

raider111111 commented May 7, 2020

ssoroka commented May 7, 2020

M0rdecay commented May 7, 2020 •

edited

Loading

danielnelson commented May 7, 2020

M0rdecay commented May 7, 2020

M0rdecay commented May 7, 2020

danielnelson commented May 7, 2020

M0rdecay commented May 7, 2020 •

edited

Loading

raider111111 commented May 8, 2020 •

edited

Loading

pberlowski commented May 27, 2020

ssoroka commented May 27, 2020

sjwang90 commented May 29, 2020

raider111111 commented Jun 1, 2020

pberlowski commented Jun 2, 2020

ssoroka commented Jun 2, 2020

pberlowski commented Jun 2, 2020

00willo commented Jun 7, 2023

Centralized Telegraf Manager #7478

Centralized Telegraf Manager #7478

Comments

raider111111 commented May 7, 2020

Feature Request

Proposal:

Current behavior:

Desired behavior:

Use case:

ssoroka commented May 7, 2020

M0rdecay commented May 7, 2020 • edited Loading

danielnelson commented May 7, 2020

M0rdecay commented May 7, 2020

M0rdecay commented May 7, 2020

danielnelson commented May 7, 2020

M0rdecay commented May 7, 2020 • edited Loading

raider111111 commented May 8, 2020 • edited Loading

pberlowski commented May 27, 2020

ssoroka commented May 27, 2020

sjwang90 commented May 29, 2020

raider111111 commented Jun 1, 2020

pberlowski commented Jun 2, 2020

ssoroka commented Jun 2, 2020

pberlowski commented Jun 2, 2020

00willo commented Jun 7, 2023

M0rdecay commented May 7, 2020 •

edited

Loading

M0rdecay commented May 7, 2020 •

edited

Loading

raider111111 commented May 8, 2020 •

edited

Loading