-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
server, build: add time series for disk and net IO stats #27137
Conversation
26ad28b
to
e1c0fdd
Compare
Ready to go now; PTAL @couchand |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems generally fine, one copy-pasta bug
Reviewed 2 of 2 files at r1, 2 of 2 files at r2.
Reviewable status:complete! 0 of 0 LGTMs obtained
pkg/server/status/runtime.go, line 138 at r2 (raw file):
} metaHostDiskReadCount = metric.Metadata{
why "host" on all of these? what host?
pkg/server/status/runtime.go, line 139 at r2 (raw file):
metaHostDiskReadCount = metric.Metadata{ Name: "sys.disk.read.count.host",
suggest removing .host
from all of these. if it's an important organizing detail (per above comment), maybe put it immediately after (or instead of?) sys
?
pkg/server/status/runtime.go, line 141 at r2 (raw file):
Name: "sys.disk.read.count.host", Unit: metric.Unit_COUNT, Measurement: "Disk read operations",
all of these Measurement
fields should be pithier. suggestions:
- Operations or Reads and Writes
- Disk Time
- Throughput
- Packets
etc.
pkg/server/status/runtime.go, line 142 at r2 (raw file):
Unit: metric.Unit_COUNT, Measurement: "Disk read operations", Help: "Disk read operations across all disks since this process started",
suggest removing the "since this process started"
pkg/server/status/runtime.go, line 154 at r2 (raw file):
Unit: metric.Unit_BYTES, Measurement: "Bytes read from disk", Help: "Bytes read to all disks since this process started",
s/to/from/
pkg/server/status/runtime.go, line 254 at r2 (raw file):
CPUSysNS *metric.Gauge CPUSysPercent *metric.GaugeFloat64 // Memory stat.
it is nice to organize these... perhaps all four ...Bytes
metrics should go down here?
pkg/server/status/runtime.go, line 260 at r2 (raw file):
FDSoftLimit *metric.Gauge // Disk and network stats. // The values here are counters of the number of operations, packets, bytes, and cumulative time
This comment seems a bit out of place. It's useful information, it just seems weird to put here.
pkg/server/status/runtime.go, line 320 at r2 (raw file):
CPUSysPercent: metric.NewGaugeFloat64(metaCPUSysPercent), Rss: metric.NewGauge(metaRSS), HostDiskReadTime: metric.NewGauge(metaHostDiskReadTime),
three different places these have to be repeated and they're in three different orders! please sort consistently.
pkg/server/status/runtime.go, line 341 at r2 (raw file):
log.Errorf(ctx, "could not get initial disk IO counters: %v", err) } rsr.initialDiskCounters = diskCounters
maybe initialize these above so you can continue to just return the struct?
pkg/server/status/runtime.go, line 424 at r2 (raw file):
goTotal := ms.Sys - ms.HeapReleased if err := rsr.recordDiskStats(ctx); err != nil {
it seems a little bit odd to stick these lines here...
pkg/server/status/runtime.go, line 484 at r2 (raw file):
rsr.HostDiskReadCount.Update(int64(summedDiskCounters.ReadCount)) rsr.HostDiskWriteBytes.Update(int64(summedDiskCounters.WriteBytes)) rsr.HostDiskWriteCount.Update(int64(summedDiskCounters.WriteCount))
is one missing?
pkg/server/status/runtime.go, line 567 at r2 (raw file):
from.PacketsRecv -= sub.PacketsRecv from.PacketsSent -= sub.PacketsSent }
maybe a test or two?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 0 of 0 LGTMs obtained
pkg/server/status/runtime.go, line 138 at r2 (raw file):
Previously, couchand (Andrew Couch) wrote…
why "host" on all of these? what host?
Done.
pkg/server/status/runtime.go, line 139 at r2 (raw file):
Previously, couchand (Andrew Couch) wrote…
suggest removing
.host
from all of these. if it's an important organizing detail (per above comment), maybe put it immediately after (or instead of?)sys
?
Done. Changed to sys.host.
pkg/server/status/runtime.go, line 141 at r2 (raw file):
Previously, couchand (Andrew Couch) wrote…
all of these
Measurement
fields should be pithier. suggestions:
- Operations or Reads and Writes
- Disk Time
- Throughput
- Packets
etc.
Done.
pkg/server/status/runtime.go, line 142 at r2 (raw file):
Previously, couchand (Andrew Couch) wrote…
suggest removing the "since this process started"
This seems pretty important to interpreting the numbers. Is it enough to just say it above in a comment instead of duplicating it in each metric?
pkg/server/status/runtime.go, line 154 at r2 (raw file):
Previously, couchand (Andrew Couch) wrote…
s/to/from/
Done.
pkg/server/status/runtime.go, line 254 at r2 (raw file):
Previously, couchand (Andrew Couch) wrote…
it is nice to organize these... perhaps all four
...Bytes
metrics should go down here?
You're suggesting putting the disk and net IO metrics under memory? Not following.
pkg/server/status/runtime.go, line 260 at r2 (raw file):
Previously, couchand (Andrew Couch) wrote…
This comment seems a bit out of place. It's useful information, it just seems weird to put here.
Wasn't sure where to put it… Moved to alongside metric metadata structs.
pkg/server/status/runtime.go, line 320 at r2 (raw file):
Previously, couchand (Andrew Couch) wrote…
three different places these have to be repeated and they're in three different orders! please sort consistently.
Done.
pkg/server/status/runtime.go, line 341 at r2 (raw file):
Previously, couchand (Andrew Couch) wrote…
maybe initialize these above so you can continue to just return the struct?
Done.
pkg/server/status/runtime.go, line 424 at r2 (raw file):
Previously, couchand (Andrew Couch) wrote…
it seems a little bit odd to stick these lines here...
Done. Hm yeah. Moved them to the bottom.
pkg/server/status/runtime.go, line 484 at r2 (raw file):
Previously, couchand (Andrew Couch) wrote…
is one missing?
Done. Yes, WriteTime
was missing. Whoops!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PTAL @couchand
Reviewable status:
complete! 0 of 0 LGTMs obtained
pkg/server/status/runtime.go, line 567 at r2 (raw file):
Previously, couchand (Andrew Couch) wrote…
maybe a test or two?
Done, although they are pretty silly...
pkg/server/status/runtime.go
Outdated
@@ -280,4 +450,118 @@ func (rsr *RuntimeStatSampler) SampleEnvironment(ctx context.Context) { | |||
rsr.FDSoftLimit.Update(int64(fds.SoftLimit)) | |||
rsr.Rss.Update(int64(mem.Resident)) | |||
rsr.Uptime.Update((now - rsr.startTimeNanos) / 1e9) | |||
|
|||
if err := rsr.recordDiskStats(ctx); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: these methods aren't actually recording disk stats (which typically means writing to disks), they're sampling these statistics. The word "sample" is used for other methods on this class (e.g. "SampleEnvironment"), so I would suggest using "sample" as the prefix instead to avoid confusion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 5 files at r3, 1 of 3 files at r5.
Reviewable status:complete! 0 of 0 LGTMs obtained
pkg/server/status/runtime.go, line 142 at r2 (raw file):
Previously, vilterp (Pete Vilter) wrote…
This seems pretty important to interpreting the numbers. Is it enough to just say it above in a comment instead of duplicating it in each metric?
You're right, it's good as-is.
pkg/server/status/runtime.go, line 254 at r2 (raw file):
Previously, vilterp (Pete Vilter) wrote…
You're suggesting putting the disk and net IO metrics under memory? Not following.
Sorry, no, not your new ones, I mean the four other memory stats from up above. It's only relevant here since you added the headings.
pkg/server/status/runtime.go, line 567 at r2 (raw file):
Previously, vilterp (Pete Vilter) wrote…
Done, although they are pretty silly...
would it have caught the copy-paste bug above, though?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Reviewable status:
complete! 0 of 0 LGTMs obtained
Use new library `gopsutil` to gather IO stats from the host kernel and save them as time series. Caveat: these counters are for the entire host over its lifetime. As such, the first value cockroach records may be very high, and if we plot these charts as derivatives, the derivative from zero to the first recorded data point will be much higher than subsequent derivatives, making a plot where it's hard to see changes after that. A future commit will gather per-process IO stats, but gopsutil only provides that on Linux. Release note (admin ui change): gather disk and network IO statistics and save them as time series.
Release note: None
TFTR! (Reviewable/GH still thinks you haven't approved… Must be that new disposition stuff…) bors r+ |
Build failed (retrying...) |
27137: server, build: add time series for disk and net IO stats r=vilterp a=vilterp Use new library `gopsutil` to gather IO stats from the host kernel and save them as time series. Caveat: these counters are for the entire host over its lifetime, not the cockroach process's lifetime. As such, the first value cockroach records may be very high, and if we plot these charts as derivatives, the derivative from zero to the first recorded data point will be much higher than subsequent derivatives, making a plot where it's hard to see changes after that. A future commit will gather per-process IO stats, but gopsutil only provides that on Linux. Release note (admin ui change): gather disk and network IO statistics and save them as time series. Co-authored-by: Pete Vilter <[email protected]>
Build succeeded |
27626: ui: add hardware metrics to new "Hardware" dashboard r=vilterp a=vilterp First two commits are from #27137; that PR will go in first. data:image/s3,"s3://crabby-images/4dc8c/4dc8c6c6aebc4ec16a048d9445b0a8e3132c66b6" alt="image" TODO - [ ] tooltips - [ ] format CPU percentage (currently 100% of one core = 1) Co-authored-by: Pete Vilter <[email protected]>
Use new library
gopsutil
to gather IO stats from the host kernel and save them as time series.Caveat: these counters are for the entire host over its lifetime, not the cockroach process's lifetime. As such, the first value cockroach records may be very high, and if we plot these charts as derivatives, the derivative from zero to the first recorded data point will be much higher than subsequent derivatives, making a plot where it's hard to see changes after that.
A future commit will gather per-process IO stats, but gopsutil only provides that on Linux.
Release note (admin ui change): gather disk and network IO statistics and
save them as time series.