Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server, build: add time series for disk and net IO stats #27137

Merged
merged 2 commits into from
Jul 23, 2018

Conversation

vilterp
Copy link
Contributor

@vilterp vilterp commented Jul 3, 2018

Use new library gopsutil to gather IO stats from the host kernel and save them as time series.

Caveat: these counters are for the entire host over its lifetime, not the cockroach process's lifetime. As such, the first value cockroach records may be very high, and if we plot these charts as derivatives, the derivative from zero to the first recorded data point will be much higher than subsequent derivatives, making a plot where it's hard to see changes after that.

A future commit will gather per-process IO stats, but gopsutil only provides that on Linux.

Release note (admin ui change): gather disk and network IO statistics and
save them as time series.

@vilterp vilterp requested review from couchand and a team July 3, 2018 22:08
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@vilterp vilterp changed the title server: add time series for disk and net IO stats server, build: add time series for disk and net IO stats Jul 3, 2018
@vilterp vilterp changed the title server, build: add time series for disk and net IO stats [wip] server, build: add time series for disk and net IO stats Jul 9, 2018
@vilterp vilterp added the do-not-merge bors won't merge a PR with this label. label Jul 9, 2018
@vilterp vilterp force-pushed the host-io-stats branch 2 times, most recently from 26ad28b to e1c0fdd Compare July 16, 2018 21:03
@vilterp vilterp changed the title [wip] server, build: add time series for disk and net IO stats server, build: add time series for disk and net IO stats Jul 16, 2018
@vilterp vilterp removed the do-not-merge bors won't merge a PR with this label. label Jul 16, 2018
@vilterp
Copy link
Contributor Author

vilterp commented Jul 16, 2018

Ready to go now; PTAL @couchand

Copy link
Contributor

@couchand couchand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems generally fine, one copy-pasta bug

Reviewed 2 of 2 files at r1, 2 of 2 files at r2.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/server/status/runtime.go, line 138 at r2 (raw file):

	}

	metaHostDiskReadCount = metric.Metadata{

why "host" on all of these? what host?


pkg/server/status/runtime.go, line 139 at r2 (raw file):

	metaHostDiskReadCount = metric.Metadata{
		Name:        "sys.disk.read.count.host",

suggest removing .host from all of these. if it's an important organizing detail (per above comment), maybe put it immediately after (or instead of?) sys?


pkg/server/status/runtime.go, line 141 at r2 (raw file):

		Name:        "sys.disk.read.count.host",
		Unit:        metric.Unit_COUNT,
		Measurement: "Disk read operations",

all of these Measurement fields should be pithier. suggestions:

  • Operations or Reads and Writes
  • Disk Time
  • Throughput
  • Packets
    etc.

pkg/server/status/runtime.go, line 142 at r2 (raw file):

		Unit:        metric.Unit_COUNT,
		Measurement: "Disk read operations",
		Help:        "Disk read operations across all disks since this process started",

suggest removing the "since this process started"


pkg/server/status/runtime.go, line 154 at r2 (raw file):

		Unit:        metric.Unit_BYTES,
		Measurement: "Bytes read from disk",
		Help:        "Bytes read to all disks since this process started",

s/to/from/


pkg/server/status/runtime.go, line 254 at r2 (raw file):

	CPUSysNS       *metric.Gauge
	CPUSysPercent  *metric.GaugeFloat64
	// Memory stat.

it is nice to organize these... perhaps all four ...Bytes metrics should go down here?


pkg/server/status/runtime.go, line 260 at r2 (raw file):

	FDSoftLimit *metric.Gauge
	// Disk and network stats.
	// The values here are counters of the number of operations, packets, bytes, and cumulative time

This comment seems a bit out of place. It's useful information, it just seems weird to put here.


pkg/server/status/runtime.go, line 320 at r2 (raw file):

		CPUSysPercent:      metric.NewGaugeFloat64(metaCPUSysPercent),
		Rss:                metric.NewGauge(metaRSS),
		HostDiskReadTime:   metric.NewGauge(metaHostDiskReadTime),

three different places these have to be repeated and they're in three different orders! please sort consistently.


pkg/server/status/runtime.go, line 341 at r2 (raw file):

		log.Errorf(ctx, "could not get initial disk IO counters: %v", err)
	}
	rsr.initialDiskCounters = diskCounters

maybe initialize these above so you can continue to just return the struct?


pkg/server/status/runtime.go, line 424 at r2 (raw file):

	goTotal := ms.Sys - ms.HeapReleased

	if err := rsr.recordDiskStats(ctx); err != nil {

it seems a little bit odd to stick these lines here...


pkg/server/status/runtime.go, line 484 at r2 (raw file):

	rsr.HostDiskReadCount.Update(int64(summedDiskCounters.ReadCount))
	rsr.HostDiskWriteBytes.Update(int64(summedDiskCounters.WriteBytes))
	rsr.HostDiskWriteCount.Update(int64(summedDiskCounters.WriteCount))

is one missing?


pkg/server/status/runtime.go, line 567 at r2 (raw file):

	from.PacketsRecv -= sub.PacketsRecv
	from.PacketsSent -= sub.PacketsSent
}

maybe a test or two?

Copy link
Contributor Author

@vilterp vilterp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/server/status/runtime.go, line 138 at r2 (raw file):

Previously, couchand (Andrew Couch) wrote…

why "host" on all of these? what host?

Done.


pkg/server/status/runtime.go, line 139 at r2 (raw file):

Previously, couchand (Andrew Couch) wrote…

suggest removing .host from all of these. if it's an important organizing detail (per above comment), maybe put it immediately after (or instead of?) sys?

Done. Changed to sys.host.


pkg/server/status/runtime.go, line 141 at r2 (raw file):

Previously, couchand (Andrew Couch) wrote…

all of these Measurement fields should be pithier. suggestions:

  • Operations or Reads and Writes
  • Disk Time
  • Throughput
  • Packets
    etc.

Done.


pkg/server/status/runtime.go, line 142 at r2 (raw file):

Previously, couchand (Andrew Couch) wrote…

suggest removing the "since this process started"

This seems pretty important to interpreting the numbers. Is it enough to just say it above in a comment instead of duplicating it in each metric?


pkg/server/status/runtime.go, line 154 at r2 (raw file):

Previously, couchand (Andrew Couch) wrote…

s/to/from/

Done.


pkg/server/status/runtime.go, line 254 at r2 (raw file):

Previously, couchand (Andrew Couch) wrote…

it is nice to organize these... perhaps all four ...Bytes metrics should go down here?

You're suggesting putting the disk and net IO metrics under memory? Not following.


pkg/server/status/runtime.go, line 260 at r2 (raw file):

Previously, couchand (Andrew Couch) wrote…

This comment seems a bit out of place. It's useful information, it just seems weird to put here.

Wasn't sure where to put it… Moved to alongside metric metadata structs.


pkg/server/status/runtime.go, line 320 at r2 (raw file):

Previously, couchand (Andrew Couch) wrote…

three different places these have to be repeated and they're in three different orders! please sort consistently.

Done.


pkg/server/status/runtime.go, line 341 at r2 (raw file):

Previously, couchand (Andrew Couch) wrote…

maybe initialize these above so you can continue to just return the struct?

Done.


pkg/server/status/runtime.go, line 424 at r2 (raw file):

Previously, couchand (Andrew Couch) wrote…

it seems a little bit odd to stick these lines here...

Done. Hm yeah. Moved them to the bottom.


pkg/server/status/runtime.go, line 484 at r2 (raw file):

Previously, couchand (Andrew Couch) wrote…

is one missing?

Done. Yes, WriteTime was missing. Whoops!

Copy link
Contributor Author

@vilterp vilterp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PTAL @couchand

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/server/status/runtime.go, line 567 at r2 (raw file):

Previously, couchand (Andrew Couch) wrote…

maybe a test or two?

Done, although they are pretty silly...

@@ -280,4 +450,118 @@ func (rsr *RuntimeStatSampler) SampleEnvironment(ctx context.Context) {
rsr.FDSoftLimit.Update(int64(fds.SoftLimit))
rsr.Rss.Update(int64(mem.Resident))
rsr.Uptime.Update((now - rsr.startTimeNanos) / 1e9)

if err := rsr.recordDiskStats(ctx); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: these methods aren't actually recording disk stats (which typically means writing to disks), they're sampling these statistics. The word "sample" is used for other methods on this class (e.g. "SampleEnvironment"), so I would suggest using "sample" as the prefix instead to avoid confusion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Contributor

@couchand couchand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 5 files at r3, 1 of 3 files at r5.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/server/status/runtime.go, line 142 at r2 (raw file):

Previously, vilterp (Pete Vilter) wrote…

This seems pretty important to interpreting the numbers. Is it enough to just say it above in a comment instead of duplicating it in each metric?

You're right, it's good as-is.


pkg/server/status/runtime.go, line 254 at r2 (raw file):

Previously, vilterp (Pete Vilter) wrote…

You're suggesting putting the disk and net IO metrics under memory? Not following.

Sorry, no, not your new ones, I mean the four other memory stats from up above. It's only relevant here since you added the headings.


pkg/server/status/runtime.go, line 567 at r2 (raw file):

Previously, vilterp (Pete Vilter) wrote…

Done, although they are pretty silly...

would it have caught the copy-paste bug above, though?

Copy link
Contributor

@couchand couchand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained

Use new library `gopsutil` to gather IO stats from the host kernel and
save them as time series.

Caveat: these counters are for the entire host over its lifetime. As
such, the first value cockroach records may be very high, and if we plot
these charts as derivatives, the derivative from zero to the first
recorded data point will be much higher than subsequent derivatives,
making a plot where it's hard to see changes after that.

A future commit will gather per-process IO stats, but gopsutil only
provides that on Linux.

Release note (admin ui change): gather disk and network IO statistics and
save them as time series.
@vilterp
Copy link
Contributor Author

vilterp commented Jul 23, 2018

TFTR! (Reviewable/GH still thinks you haven't approved… Must be that new disposition stuff…)

bors r+

@couchand couchand dismissed their stale review July 23, 2018 15:15

reviewable

@craig
Copy link
Contributor

craig bot commented Jul 23, 2018

Build failed (retrying...)

craig bot pushed a commit that referenced this pull request Jul 23, 2018
27137: server, build: add time series for disk and net IO stats r=vilterp a=vilterp

Use new library `gopsutil` to gather IO stats from the host kernel and save them as time series.

Caveat: these counters are for the entire host over its lifetime, not the cockroach process's lifetime. As such, the first value cockroach records may be very high, and if we plot these charts as derivatives, the derivative from zero to the first recorded data point will be much higher than subsequent derivatives, making a plot where it's hard to see changes after that.

A future commit will gather per-process IO stats, but gopsutil only provides that on Linux.

Release note (admin ui change): gather disk and network IO statistics and
save them as time series.

Co-authored-by: Pete Vilter <[email protected]>
@craig
Copy link
Contributor

craig bot commented Jul 23, 2018

Build succeeded

@craig craig bot merged commit a20b48f into cockroachdb:master Jul 23, 2018
@vilterp vilterp deleted the host-io-stats branch July 23, 2018 16:11
craig bot pushed a commit that referenced this pull request Jul 23, 2018
27626: ui: add hardware metrics to new "Hardware" dashboard r=vilterp a=vilterp

First two commits are from #27137; that PR will go in first.

![image](https://user-images.githubusercontent.com/7341/42787591-2166749e-8929-11e8-886b-fa89ff580d88.png)

TODO
- [ ] tooltips
- [ ] format CPU percentage (currently 100% of one core = 1)


Co-authored-by: Pete Vilter <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants