-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Procstat: collect connections and listener for each proc #8755
Procstat: collect connections and listener for each proc #8755
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks so much for the pull request!
🤝 ✒️ Just a reminder that the CLA has not yet been signed, and we'll need it before merging. Please sign the CLA when you get a chance, then post a comment here saying !signed-cla
!signed-cla |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤝 ✒️ Just a reminder that the CLA has not yet been signed, and we'll need it before merging. Please sign the CLA when you get a chance, then post a comment here saying !signed-cla
4a0d64e
to
b5e0415
Compare
I want to add someway to only gather the procs with connections, to avoid the extra cost of parsing all procs. |
!signed-cla |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤝 ✅ CLA has been signed. Thank you!
@adrianlzt Sorry for the bot being a nuisance, there was a bug. Thanks for working on this PR! |
Procstat modified to add a new metric "procstat_tcp" with to values "conn" and "listen". It also adds number of connections for each TCP state. Those metrics will be added only if it they are activated specifically. Those values will contain a comma separated value of the endpoints where the proc is connecting to or listening (IPv4 and IPv6). Only for linux. Local, virtual and docker interfaces are ignored. If some proc is listening on 0.0.0.0, an endpoint for each of the "public" (those not ignored as internal) IPv4 IPs is created. If it is listening on :: (IPv6) an endpoint is created for each IPv4 and IPv6. For programs with one parent and several children, all listening in some endpoint, only the parent process is taken into account. Child endpoints are ignored. Connections made to this cost (local port is one of the listening ports) are ignored. To avoid having servers with thousands of connections. We prefer to collect that info in the clients. It is also added connection info (number of connections in each of the TCP states) for each proc. Improving PR influxdata#5402 To gather tcp connections netlink is used, to avoid the cost of parsing /proc/net/tcp(6), but /proc should be readed to get the mapping between inodes and pids.
b5e0415
to
4698e5b
Compare
Added an option to skip some metrics as suggested |
If we want to collect only the metric procstat_lookup or procstat_tcp is a big overhead having to parse lots of files in /proc to get number of fds, CPU time, memory usage, etc. Added a new configuration option to be able to select which metrics to collect. If the configuration parameter, 'metrics_include', is undefined, the plugin behaves like before adding this parameter. The difference (measured with ~1100 procs) between gathering the procstat_tcp metrics alone or with the default collection of metrics is around one tenth (~50ms alone, ~500ms with default collection).
/proc/n/fd dirs were not being closed till the end of mapInodesToPid function, leaving a lot of file descriptors open. This could reach the hard limit in Linux breaking the functionality of Telegraf.
4698e5b
to
81657b5
Compare
netInfo.Fetch() should read all files in /proc/n/fd dirs to map connections to pids. This operation should only be executed if we are going to use that information.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks pretty thorough. I think it needs a bit of cleanup on error handling and logging, and some flow control around that. if you add Log telegraf.Logger
to the struct the plugin system will populate it, and you can use testutils.Logger{}
in tests.
Q: does github.com/elastic/gosigar/ use cgo?
// TODO allow passing a list of MAC headers to ignore | ||
// TODO allow passing exclude list of regexp to match interface names |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
consider removing this, I think you could accomplish it with tagexclude or a processor
fmt.Printf("Error opening /proc: %v", err) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be using the struct's Log telegraf.Logger
to log this.
Missing return
|
||
dirContents, err := fd.Readdirnames(0) | ||
if err != nil { | ||
fmt.Printf("Error reading files in /proc: %v", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be using the struct's Log telegraf.Logger
to log this.
Missing return
pid, err := strconv.ParseUint(pidStr, 10, 32) | ||
if err != nil { | ||
// exclude files with a not numeric name. We only want to access pid directories | ||
return | ||
} | ||
|
||
pidDir, err := os.Open("/proc/" + pidStr + "/fd/") | ||
if err != nil { | ||
// ignore errors: | ||
// - missing directory, pid has already finished | ||
// - permission denied | ||
return | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might be nice to handle the errors specifically and return others
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be really good to log all of the errors in this function at the Debug level
@@ -10,6 +10,7 @@ import ( | |||
|
|||
type Process interface { | |||
PID() PID | |||
Ppid() (PID, error) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
consider uppercase
Ppid() (PID, error) | |
PPID() (PID, error) |
if (p.metricEnabled(MetricsTCPStats) || p.metricEnabled(MetricsConnectionsEndpoints)) && len(p.procs) > 0 { | ||
err = netInfo.Fetch() | ||
if err != nil { | ||
acc.AddError(fmt.Errorf("E! [inputs.procstat] getting TCP network info: %v", err)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
acc.AddError(fmt.Errorf("E! [inputs.procstat] getting TCP network info: %v", err)) | |
acc.AddError(fmt.Errorf("getting TCP network info: %w", err)) |
} else { | ||
// Ignore errors because pid was not found. It is normal to have procs without connections | ||
if !errors.Is(err, ErrorPIDNotFound) { | ||
acc.AddError(fmt.Errorf("D! [inputs.procstat] not able to get connections for pid=%v: %v", proc.PID(), err)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add Log telegraf.Logger
to the struct and use that instead for debug.
acc.AddError(fmt.Errorf("D! [inputs.procstat] not able to get connections for pid=%v: %v", proc.PID(), err)) | |
p.Log.Debug("not able to get connections for pid=%v: %v", proc.PID(), err) |
func addConnectionStats(pidConnections []ConnInfo, fields map[string]interface{}, prefix string) { | ||
} | ||
|
||
func addConnectionEnpoints(acc telegraf.Accumulator, proc Process, netInfo NetworkInfo) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing d change to addConnectionEndpoints
and update uses of it
@@ -60,6 +60,35 @@ Processes can be selected for monitoring using one of several methods: | |||
## the native finder performs the search directly in a manor dependent on the | |||
## platform. Default is 'pgrep' | |||
# pid_finder = "pgrep" | |||
|
|||
## Select wich extra metrics should be added: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing h update to which
@@ -90,6 +142,35 @@ var sampleConfig = ` | |||
## the native finder performs the search directly in a manor dependent on the | |||
## platform. Default is 'pgrep' | |||
# pid_finder = "pgrep" | |||
|
|||
## Select wich extra metrics should be added: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing h update to which
fields[prefix+"tcp_closing"] = counts[linux.TCP_CLOSING] | ||
} | ||
|
||
// addConnectionEnpoints add listen and connection endpoints to the procstat_tcp metric. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing d change toaddConnectionEndpoints
// If listen is 0.0.0.0 or ::, it will be added one value for each of the IP addresses of the host. | ||
// Listeners in private IPs are ignored (maybe a flag could be added, but now the reasoning is matching connections between hosts). | ||
// Connections made to this server are ignored (the local port is one of the listening ports). | ||
func addConnectionEnpoints(acc telegraf.Accumulator, proc Process, netInfo NetworkInfo) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing d change to addConnectionEndpoints
@@ -157,11 +186,49 @@ the `win_perf_counters` input plugin as a more mature alternative. | |||
- running (int) | |||
- result_code (int, success = 0, lookup_error = 1) | |||
|
|||
If ``connections_stats`` enabled, added fields: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be tcp_stats?
Closing as there hasn't been any response to the latest reviews, but if you are still seeing this pull request merged please comment or re-open and we can help work through any outstanding issues. |
Procstat modified to add a new metric "procstat_tcp" with to values
"conn" and "listen".
It also adds number of connections for each TCP state.
Those metrics will be added only if it they are activated specifically.
Those values will contain a comma separated value of the endpoints where
the proc is connecting to or listening (IPv4 and IPv6).
Only for linux.
Local, virtual and docker interfaces are ignored.
If some proc is listening on 0.0.0.0, an endpoint for each of the
"public" (those not ignored as internal) IPv4 IPs is created.
If it is listening on :: (IPv6) an endpoint is created for each IPv4 and
IPv6.
For programs with one parent and several children, all listening in some
endpoint, only the parent process is taken into account. Child endpoints
are ignored.
Connections made to this cost (local port is one of the listening ports)
are ignored. To avoid having servers with thousands of connections.
We prefer to collect that info in the clients.
It is also added connection info (number of connections in each of the
TCP states) for each proc. Improving PR #5402
To gather tcp connections netlink is used, to avoid the cost of parsing
/proc/net/tcp(6), but /proc should be readed to get the mapping between
inodes and pids.
Required for all PRs: