Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rpc: don't close gRPC connections on heartbeat timeouts #14424

Merged
merged 1 commit into from
Apr 7, 2017

Conversation

andreimatei
Copy link
Contributor

@andreimatei andreimatei commented Mar 28, 2017

Fixes #13989

Before this patch, the rpc.Context would perform heartbeats (a dedicated
RPC) to see if a connection is healthy. If the heartbeats failed, the
connection was closed (causing in-flight RPCs to fail) and the node was
marked as unhealthy.
These heartbeats, being regular RPCs, were subject to gRPC's flow
control. This means that they were easily blocked by other large RPCs,
which meant they were too feeble. In particular, they were easily
blocked by large DistSQL streams.

This patch moves to using gRPC's internal HTTP2 ping frames for checking
conn health. These are not subject to flow control. The grpc
transport-level connection is closed when they fail (and so in-flight
RPCs still fail), but otherwise gRPC reconnects transparently.
Heartbeats stay for the other current uses - clock skew detection and
node health marking. Marking a node as unhealthy is debatable, give the
shortcomings of these RPCs. However, this marking currently doesn't have
big consequences - it only affects the order in which replicas are tried
when a leaseholder is unknown.

cc @petermattis @bdarnell


This change is Reviewable

@andreimatei
Copy link
Contributor Author

By popular demand, I've extracted this from #14376

@tamird
Copy link
Contributor

tamird commented Mar 28, 2017

This should also close:

#13734
#13838
#13886


Reviewed 1 of 1 files at r1.
Review status: all files reviewed at latest revision, 5 unresolved discussions, some commit checks pending.


pkg/rpc/context.go, line 239 at r1 (raw file):

		}

		dialOpts := make([]grpc.DialOption, 0, 2+len(opts))

s/2/3/ or else remove this preallocation completely.


pkg/rpc/context.go, line 243 at r1 (raw file):

		dialOpts = append(dialOpts, grpc.WithBackoffMaxDelay(maxBackoff))
		dialOpts = append(dialOpts, grpc.WithKeepaliveParams(
			keepalive.ClientParameters{

move this up a line for symmetry with all the closers })) at the end. This will also fix this weird looking indentation.


pkg/rpc/context.go, line 244 at r1 (raw file):

		dialOpts = append(dialOpts, grpc.WithKeepaliveParams(
			keepalive.ClientParameters{
				// Send periodic pings on the connection.

isn't there a comment on the struct definition? this comment doesn't seem helpful to me.


pkg/rpc/context.go, line 245 at r1 (raw file):

			keepalive.ClientParameters{
				// Send periodic pings on the connection.
				Time: 3 * time.Second,

base.NetworkTimeout here and below?


pkg/rpc/context.go, line 246 at r1 (raw file):

				// Send periodic pings on the connection.
				Time: 3 * time.Second,
				// If the pings don't get a response within the timeout, the connection

I think we need a very basic smoke test for this. Perhaps a simple GRPC server-client pair with a custom dialer that uses 2 net.Pipes so you can selectively start discarding bytes from the client, simulating a partition.

server <-net.Pipe-> test machinery <-net.Pipe-> client


Comments from Reviewable

@andreimatei andreimatei force-pushed the heartbeats-keepalive branch from 8e157b2 to 15fdae1 Compare March 30, 2017 22:49
@andreimatei
Copy link
Contributor Author

I'll stress #13886 to see if it fixes it. The other 2 are now independently closed.


Review status: 0 of 4 files reviewed at latest revision, 5 unresolved discussions, some commit checks pending.


pkg/rpc/context.go, line 239 at r1 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

s/2/3/ or else remove this preallocation completely.

Done.


pkg/rpc/context.go, line 243 at r1 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

move this up a line for symmetry with all the closers })) at the end. This will also fix this weird looking indentation.

Done but I find this worse


pkg/rpc/context.go, line 244 at r1 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

isn't there a comment on the struct definition? this comment doesn't seem helpful to me.

there is, but this comment serves as the description of the purpose of the whole struct. Otherwise the 2nd comment, that I think is more useful, would be more out of context.


pkg/rpc/context.go, line 245 at r1 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

base.NetworkTimeout here and below?

Done.


pkg/rpc/context.go, line 246 at r1 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

I think we need a very basic smoke test for this. Perhaps a simple GRPC server-client pair with a custom dialer that uses 2 net.Pipes so you can selectively start discarding bytes from the client, simulating a partition.

server <-net.Pipe-> test machinery <-net.Pipe-> client

Done.


Comments from Reviewable

@andreimatei andreimatei force-pushed the heartbeats-keepalive branch from 15fdae1 to 08244a7 Compare March 30, 2017 23:33
@tamird
Copy link
Contributor

tamird commented Mar 31, 2017

This is looking good. Will finish the review tomorrow.


Reviewed 3 of 4 files at r2.
Review status: 3 of 4 files reviewed at latest revision, 22 unresolved discussions, some commit checks failed.


pkg/rpc/context.go, line 262 at r2 (raw file):

			// Send periodic pings on the connection.
			Time: base.NetworkTimeout,
			// If the pings don't get a response within the timeout, the connection

I think this comment is incorrect; the connection won't be closed in the GRPC sense.


pkg/testutils/net.go, line 35 at r2 (raw file):

and
// pipes every read and write to it.

what does "pipe" mean here?


pkg/testutils/net.go, line 37 at r2 (raw file):

// pipes every read and write to it.
//
// While a direction is partitioned, data send in that direction doesn't flow. A

s/send/sent/


pkg/testutils/net.go, line 39 at r2 (raw file):

// While a direction is partitioned, data send in that direction doesn't flow. A
// write done while partitioned may block. Data written to the conn after the
// partition has been established is not delivered to the remote party until the

this comment is generally very unclear - is data dropped or not? is there a delay or is data dropped completely?


pkg/testutils/net.go, line 46 at r2 (raw file):

type PartitionableConn struct {
	// We embed a net.Conn so that we inherit the interface. Note that we override
	// Read() and Write() though.

s/ though//

add an empty comment line after this to preserve the paragraph.


pkg/testutils/net.go, line 93 at r2 (raw file):

		c.mu.Unlock()
		if err := c.clientConn.Close(); err != nil {
			log.Fatalf(context.TODO(), "unexpected error closing internal pipe: %s", err)

let's avoid log.Fatal here - if we want to be good about error reporting we should pass a channel to this constructor, or just hang one on the connection and export it.


pkg/testutils/net.go, line 148 at r2 (raw file):

	c.mu.Lock()
	if !c.mu.c2sPartitioned {
		panic("not partitioned")

there's no equivalent panic in Partition, why is this one here?


pkg/testutils/net.go, line 174 at r2 (raw file):

is part of the net.Conn interface.

"implements net.Conn."


pkg/testutils/net.go, line 220 at r2 (raw file):

		nr, err := args.src.Read(buf)

		args.mu.Lock()

there is no need to pass the mutex, bool, or sync.Cond; just pass a closure that returns when it's time to unblock.

also, you have the exact same pattern in the test; seems like you could pass a closure here and reuse the function in the test.


pkg/testutils/net.go, line 227 at r2 (raw file):

		if nr > 0 {
			nw, ew := args.dst.Write(buf[0:nr])

is this code lifted from somewhere? reference it.


pkg/testutils/net_test.go, line 33 at r2 (raw file):

	"github.com/cockroachdb/cockroach/pkg/util/log"
	"github.com/cockroachdb/cockroach/pkg/util/netutil"
	"github.com/pkg/errors"

nit: group this with "context" above.


pkg/testutils/net_test.go, line 48 at r2 (raw file):

		netutil.FatalIfUnexpected(err)
		if err != nil {
			return

return the error.


pkg/testutils/net_test.go, line 58 at r2 (raw file):

	if err != nil {
		log.Warning(context.TODO(), err)

no reason not to return this error. discard it in the caller if you want, but better yet, send it on a channel.


pkg/testutils/net_test.go, line 126 at r2 (raw file):

		t.Fatalf("expecting: %q , got %q", exp, got)
	}
	pConn.Close()

defer?


pkg/testutils/net_test.go, line 132 at r2 (raw file):

	defer leaktest.AfterTest(t)()
	if testing.Short() {
		t.Skip("short flag")

is this test actually slow?


pkg/testutils/net_test.go, line 169 at r2 (raw file):

	// In the background, the client waits on a read.
	clientDoneCh := make(chan error)
	go func() {
go func() {
  clientDoneCh <- func() error {
   ...
  }()
}()

pkg/testutils/net_test.go, line 183 at r2 (raw file):

	timerDoneCh := make(chan error)
	time.AfterFunc(3*time.Millisecond, func() {

looks likely to be flaky =/


pkg/testutils/net_test.go, line 184 at r2 (raw file):

	timerDoneCh := make(chan error)
	time.AfterFunc(3*time.Millisecond, func() {
		var err error

i think you can use the same inner func() error pattern here.


pkg/testutils/net_test.go, line 211 at r2 (raw file):

	}

	pConn.Close()

defer?


pkg/testutils/net_test.go, line 259 at r2 (raw file):

	clientDoneCh := make(chan error)
	go func() {
		got, err := bufio.NewReader(pConn).ReadString('\n')

inner func() error


pkg/testutils/net_test.go, line 275 at r2 (raw file):

		select {
		case err := <-clientDoneCh:
			t.Fatalf("unexpected reply while partitioned: %s", err)

no bueno, this is not the main goroutine.


pkg/testutils/net_test.go, line 289 at r2 (raw file):

	}

	pConn.Close()

defer?


Comments from Reviewable

@cuongdo
Copy link
Contributor

cuongdo commented Apr 3, 2017

Let's get this in soon, because this is blocking serious load testing for distsql.

@tamird
Copy link
Contributor

tamird commented Apr 3, 2017

@cuongdo if this is blocking you, i think you can ship a custom binary with https://github.com/cockroachdb/cockroach/blob/56d6ed4/pkg/rpc/context.go#L376:L378 commented out or removed.

@andreimatei andreimatei force-pushed the heartbeats-keepalive branch 2 times, most recently from bf32eef to d510c20 Compare April 4, 2017 22:13
@andreimatei
Copy link
Contributor Author

Review status: 0 of 4 files reviewed at latest revision, 22 unresolved discussions, some commit checks pending.


pkg/rpc/context.go, line 262 at r2 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

I think this comment is incorrect; the connection won't be closed in the GRPC sense.

clarified


pkg/testutils/net.go, line 35 at r2 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

and
// pipes every read and write to it.

what does "pipe" mean here?

fixed


pkg/testutils/net.go, line 37 at r2 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

s/send/sent/

Done.


pkg/testutils/net.go, line 39 at r2 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

this comment is generally very unclear - is data dropped or not? is there a delay or is data dropped completely?

see now pls


pkg/testutils/net.go, line 46 at r2 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

s/ though//

add an empty comment line after this to preserve the paragraph.

Done.


pkg/testutils/net.go, line 93 at r2 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

let's avoid log.Fatal here - if we want to be good about error reporting we should pass a channel to this constructor, or just hang one on the connection and export it.

I don't know what to do with these errors. I don't particularly want "to be good" - but linter.
In the case of clientConn, there really should be no error cause pipes don't throw errors. We created clientConn so we know the deal.

In the case of serverConn...

I'm not sure how a client would use a channel exported by the PartitionableConn and what its semantics would be. I'd rather we log and ignore as we do now.


pkg/testutils/net.go, line 148 at r2 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

there's no equivalent panic in Partition, why is this one here?

I was thinking that it makes more sense for Partition to be idempotent than for Unpartition. But on second thought I put panics everywhere.


pkg/testutils/net.go, line 174 at r2 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

is part of the net.Conn interface.

"implements net.Conn."

I've been using "is part of". Saying that a method "implements" an interface seems to me to be an abuse of the language that I don't like much.


pkg/testutils/net.go, line 220 at r2 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

there is no need to pass the mutex, bool, or sync.Cond; just pass a closure that returns when it's time to unblock.

also, you have the exact same pattern in the test; seems like you could pass a closure here and reuse the function in the test.

good suggestion with the callback; done.

I don't like the idea of sharing this method with the test. One difference is the handling of the EOF. It could be adapted, but I think this sharing would only make both uses harder to read and also the two implementations might diverge; we're not doing very general things here.


pkg/testutils/net.go, line 227 at r2 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

is this code lifted from somewhere? reference it.

Done. referenced io.copy above


pkg/testutils/net_test.go, line 33 at r2 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

nit: group this with "context" above.

Done.


pkg/testutils/net_test.go, line 48 at r2 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

return the error.

I think what I wanted here is to not look at the error after the FatalIfUnexpected call. And to leave the FatalIU call here, not in the caller. Right?


pkg/testutils/net_test.go, line 58 at r2 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

no reason not to return this error. discard it in the caller if you want, but better yet, send it on a channel.

Done.


pkg/testutils/net_test.go, line 126 at r2 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

defer?

Done.


pkg/testutils/net_test.go, line 132 at r2 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

is this test actually slow?

I think it's a good idea to mark tests with waits in them like this going forward...
This one has a time.After.


pkg/testutils/net_test.go, line 169 at r2 (raw file):

Previously, tamird (Tamir Duberstein) wrote…
go func() {
  clientDoneCh <- func() error {
   ...
  }()
}()

Done.


pkg/testutils/net_test.go, line 183 at r2 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

looks likely to be flaky =/

if the time is too small, the test will pass and we won't have tested what we want. But it won't fail.


pkg/testutils/net_test.go, line 184 at r2 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

i think you can use the same inner func() error pattern here.

but here there's a single send on the channel; don't see what the pattern would give us exactly.
Plus it's not my favourite pattern - I find the inline function harder to read.


pkg/testutils/net_test.go, line 211 at r2 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

defer?

Done.


pkg/testutils/net_test.go, line 259 at r2 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

inner func() error

Done.


pkg/testutils/net_test.go, line 275 at r2 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

no bueno, this is not the main goroutine.

Done.


pkg/testutils/net_test.go, line 289 at r2 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

defer?

Done.


Comments from Reviewable

@tamird
Copy link
Contributor

tamird commented Apr 5, 2017

Reviewed 3 of 4 files at r3.
Review status: 3 of 4 files reviewed at latest revision, 22 unresolved discussions.


pkg/testutils/net.go, line 39 at r2 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

see now pls

still unclear; what happens to the internal buffer?


pkg/testutils/net.go, line 93 at r2 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

I don't know what to do with these errors. I don't particularly want "to be good" - but linter.
In the case of clientConn, there really should be no error cause pipes don't throw errors. We created clientConn so we know the deal.

In the case of serverConn...

I'm not sure how a client would use a channel exported by the PartitionableConn and what its semantics would be. I'd rather we log and ignore as we do now.

Is it really so hard to do the right thing? Just close over a channel, it's a few lines of code.


pkg/testutils/net.go, line 33 at r3 (raw file):

// bufferSize is the size of the buffer used by PartitionableConn. Writes to a
// partitioned connection will block after the buffer gets filled.
const bufferSize int = 16 * 1024

remove int, and use 16 << 10 // 16 KB to follow convention


pkg/testutils/net.go, line 45 at r3 (raw file):

//
// While a direction is partitioned, data sent in that direction doesn't flow. A
// write done while partitioned may after an internal buffer gets filled. Data

missing verb. "block"?


pkg/testutils/net.go, line 45 at r3 (raw file):

//
// While a direction is partitioned, data sent in that direction doesn't flow. A
// write done while partitioned may after an internal buffer gets filled. Data

s/a write done/a write call/


pkg/testutils/net.go, line 48 at r3 (raw file):

// written to the conn after the partition has been established is not delivered
// to the remote party until the partition is lifted. Data written before the
// partition is established may or may not be blocked by a partition; use

blocked? this is incoherent in light of earlier description of blocking.


pkg/testutils/net.go, line 71 at r3 (raw file):

		s2cPartitioned bool

		c2sBuffer *buf

make these values


pkg/testutils/net.go, line 86 at r3 (raw file):

	data     []byte
	capacity int
	closed   bool

isn't this equivalent to closedErr != nil?


pkg/testutils/net.go, line 88 at r3 (raw file):

	closed   bool
	// The error that caused the buffer to be closed.
	closedErr error

how about just err?


pkg/testutils/net.go, line 97 at r3 (raw file):

}

func newBuf(name string, capacity int, mu *syncutil.Mutex) *buf {

why does this return a pointer? the sync.Cond and mutex are both pointers.

s/new/make/


pkg/testutils/net.go, line 112 at r3 (raw file):

//
// The number of bytes written is returned.
func (b *buf) addData(data []byte) (int, error) {

why isn't this called Write? that would implement io.Writer.


pkg/testutils/net.go, line 132 at r3 (raw file):

}

// errEAgain is raised when a buf.read() call was blocked when bug.signal() was

raised? this isn't C++.

what is bug.signal?


pkg/testutils/net.go, line 169 at r3 (raw file):

}

func (b *buf) signalLocked() {

remove this inconsistently-used method.


pkg/testutils/net.go, line 319 at r3 (raw file):

// readToBuffer copies data from src to buf.
func (c *PartitionableConn) readToBuffer(src net.Conn, buf *buf) error {

why is this a method on PartitionableConn rather than on buf?

This should be written in the style of https://godoc.org/bytes#Buffer.ReadFrom


pkg/testutils/net.go, line 349 at r3 (raw file):

// partitioned. It needs to be called under src.Mutex, as the check needs to be
// done atomically with consuming the buffer's data.
func (c *PartitionableConn) copyFromBuffer(

same as above, except this should emulate https://godoc.org/bytes#Buffer.WriteTo

if you hang waitForNoPartition on the buf, you can exactly emulate the above API.


pkg/testutils/net.go, line 393 at r3 (raw file):

		tasks <- c.copyFromBuffer(buf, dst, waitForNoPartitionLocked)
	}()
	err := <-tasks

why not return the channel and let the caller read from it?

also you can just write

for i := 0; i < cap(tasks); i++ {
  if err := <-tasks; err != nil {
    return err
  }
}
return nil

because the channel is buffered.


pkg/testutils/net_test.go, line 132 at r2 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

I think it's a good idea to mark tests with waits in them like this going forward...
This one has a time.After.

How long does it take? If it's under 1s then remove this.


pkg/testutils/net_test.go, line 184 at r2 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

but here there's a single send on the channel; don't see what the pattern would give us exactly.
Plus it's not my favourite pattern - I find the inline function harder to read.

The pattern gives you compiler help, but suit yourself.


pkg/testutils/net_test.go, line 54 at r3 (raw file):

		}
		go func() {
			if err := handleEchoConnection(conn, serverSideCh); err != nil {

why does this need a goroutine? seems to me it's perfectly ok for this server to be "single-threaded"


pkg/testutils/net_test.go, line 285 at r3 (raw file):

		select {
		case err := <-clientDoneCh:
			t.Errorf("unexpected reply while partitioned: %s", err)

should be %v since this can be nil


pkg/testutils/net_test.go, line 304 at r3 (raw file):

func TestPartitionableConnBuffering(t *testing.T) {
	defer leaktest.AfterTest(t)()
	if testing.Short() {

remove.


pkg/testutils/net_test.go, line 318 at r3 (raw file):

	serverDoneCh := make(chan error)
	go func() {
		conn, err := ln.Accept()

why isn't this in the func() error below? it's no value to continue after an error here.


Comments from Reviewable

@andreimatei
Copy link
Contributor Author

thanks for the review bro! Let's get this in :)


Review status: 3 of 4 files reviewed at latest revision, 21 unresolved discussions.


pkg/testutils/net.go, line 39 at r2 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

still unclear; what happens to the internal buffer?

Done.


pkg/testutils/net.go, line 93 at r2 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

Is it really so hard to do the right thing? Just close over a channel, it's a few lines of code.

so what exactly would we send on that channel? Just the errors from c.serverConn.Close()? What exactly is the point of that? And wouldn't it be confusing with the existence of c.mu.err?
I'm not even sure if errors from conn.Close() are expected and I feel that, if I would send them from some channel, I should understand that. These errors should be ignored I think; but I can't because of the linter. This is the closest thing.


pkg/testutils/net.go, line 33 at r3 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

remove int, and use 16 << 10 // 16 KB to follow convention

Done.


pkg/testutils/net.go, line 45 at r3 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

missing verb. "block"?

Done.


pkg/testutils/net.go, line 45 at r3 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

s/a write done/a write call/

Done.


pkg/testutils/net.go, line 48 at r3 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

blocked? this is incoherent in light of earlier description of blocking.

I believe it's coherent. Clarified a bit more; see now.


pkg/testutils/net.go, line 71 at r3 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

make these values

Done.


pkg/testutils/net.go, line 86 at r3 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

isn't this equivalent to closedErr != nil?

well this buffer should permit closing without an error. Even though we do only close it with one. I'd leave it.


pkg/testutils/net.go, line 88 at r3 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

how about just err?

I'd leave it


pkg/testutils/net.go, line 97 at r3 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

why does this return a pointer? the sync.Cond and mutex are both pointers.

s/new/make/

but but it's not copiable. Ok done.


pkg/testutils/net.go, line 112 at r3 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

why isn't this called Write? that would implement io.Writer.

Done.


pkg/testutils/net.go, line 132 at r3 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

raised? this isn't C++.

what is bug.signal?

done.

it's supposed to be buf :)


pkg/testutils/net.go, line 169 at r3 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

remove this inconsistently-used method.

why do you say it's inconsistently used?
I've added a comment, maybe it helps.

Perhaps you're observing that the cond vars are signaled directly sometimes, which has the same effect as calling this. But this method is specifically about the pconn interrupting the buffer, where other uses of the cond var in the pconn are about the pconn communicating to itself. Similarly for direct signals of the cvar inside buf.


pkg/testutils/net.go, line 319 at r3 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

why is this a method on PartitionableConn rather than on buf?

This should be written in the style of https://godoc.org/bytes#Buffer.ReadFrom

done; good suggestion.


pkg/testutils/net.go, line 349 at r3 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

same as above, except this should emulate https://godoc.org/bytes#Buffer.WriteTo

if you hang waitForNoPartition on the buf, you can exactly emulate the above API.

this one I'd leave on PConn. Given the current layout, waitForNoPartition is not something that the buffer should be concerned with (other than exposing the signalLocked() interface, which seems more general.

I don't care too much about respecting some other buffer interface, since this buffer here is thread-safe and has some particular semantics.


pkg/testutils/net.go, line 393 at r3 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

why not return the channel and let the caller read from it?

also you can just write

for i := 0; i < cap(tasks); i++ {
  if err := <-tasks; err != nil {
    return err
  }
}
return nil

because the channel is buffered.

I don't like returning the channel. This method is better being sync. No need to burden the caller with understanding that there's several internal tasks.

I changed it to what you were suggesting but then changed it back. Mine is more pedestrian and readable. It also always waits for both tasks to be done before returning; which is probably a good thing


pkg/testutils/net_test.go, line 132 at r2 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

How long does it take? If it's under 1s then remove this.

1s?!?!
I've removed this from all the test. But why don't you like the more liberal use of this flag? Particularly for tests that are not very important or likely to be needed for validating rando changes.


pkg/testutils/net_test.go, line 54 at r3 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

why does this need a goroutine? seems to me it's perfectly ok for this server to be "single-threaded"

let it support more of them. It's more idiomatic to write the Accept in a loop too, and leave the closing of the listener to the caller, I think.


pkg/testutils/net_test.go, line 285 at r3 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

should be %v since this can be nil

Done.


pkg/testutils/net_test.go, line 304 at r3 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

remove.

Done.


pkg/testutils/net_test.go, line 318 at r3 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

why isn't this in the func() error below? it's no value to continue after an error here.

Done.


Comments from Reviewable

@andreimatei andreimatei force-pushed the heartbeats-keepalive branch from d510c20 to f1d337a Compare April 6, 2017 01:13
@andreimatei
Copy link
Contributor Author

forgot to push, sorry. pushed now.


Review status: 1 of 4 files reviewed at latest revision, 21 unresolved discussions.


Comments from Reviewable

@tamird
Copy link
Contributor

tamird commented Apr 6, 2017

Reviewed 2 of 3 files at r4.
Review status: 3 of 4 files reviewed at latest revision, 9 unresolved discussions, all commit checks successful.


pkg/testutils/net.go, line 93 at r2 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

so what exactly would we send on that channel? Just the errors from c.serverConn.Close()? What exactly is the point of that? And wouldn't it be confusing with the existence of c.mu.err?
I'm not even sure if errors from conn.Close() are expected and I feel that, if I would send them from some channel, I should understand that. These errors should be ignored I think; but I can't because of the linter. This is the closest thing.

You're Fataling on the error, that's nowhere close to ignoring the error.


pkg/testutils/net.go, line 88 at r3 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

I'd leave it

Why? we have seen again and again that naming errors anything other than err is error-prone (no pun intended).


pkg/testutils/net.go, line 169 at r3 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

why do you say it's inconsistently used?
I've added a comment, maybe it helps.

Perhaps you're observing that the cond vars are signaled directly sometimes, which has the same effect as calling this. But this method is specifically about the pconn interrupting the buffer, where other uses of the cond var in the pconn are about the pconn communicating to itself. Similarly for direct signals of the cvar inside buf.

this is the definition of over engineering - this buffer will not be used outside this package and anyone who isn't you will either use this method incorrectly or signal the condvar when they should use the method.


pkg/testutils/net.go, line 393 at r3 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

I don't like returning the channel. This method is better being sync. No need to burden the caller with understanding that there's several internal tasks.

I changed it to what you were suggesting but then changed it back. Mine is more pedestrian and readable. It also always waits for both tasks to be done before returning; which is probably a good thing

Then why is your channel buffered?

Also, again, it is not a good idea to name errors anything other than err.


pkg/testutils/net.go, line 134 at r4 (raw file):

// errEAgain is returned by buf.read() when the read was blocked at the time
// when buf.signal() was called - the signal interrupted the read. The caller is

signal is not a thing?


pkg/testutils/net.go, line 326 at r4 (raw file):

}

// ReadFrom copies data from src into the buffer until src.Read() return an

returns


pkg/testutils/net.go, line 328 at r4 (raw file):

// ReadFrom copies data from src into the buffer until src.Read() return an
// error (e.g. io.EOF).
func (b *buf) ReadFrom(src net.Conn) error {

make it io.Reader?


pkg/testutils/net_test.go, line 54 at r3 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

let it support more of them. It's more idiomatic to write the Accept in a loop too, and leave the closing of the listener to the caller, I think.

Over engineering again, in my opinion.


pkg/testutils/net_test.go, line 285 at r3 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

Done.

Doesn't look done.


Comments from Reviewable

@tamird
Copy link
Contributor

tamird commented Apr 6, 2017

Reviewed 1 of 3 files at r4.
Review status: all files reviewed at latest revision, 19 unresolved discussions, all commit checks successful.


pkg/rpc/context_test.go, line 585 at r4 (raw file):

		log.AmbientContext{}, testutils.NewNodeTestBaseContext(), clock, stopper)
	// Disable automatic heartbeats. We'll send them by hand.
	clientCtx.heartbeatInterval = time.Hour

nit: math.MaxInt64


pkg/rpc/context_test.go, line 595 at r4 (raw file):

		grpc.WithDialer(
			func(addr string, timeout time.Duration) (net.Conn, error) {
				if atomic.LoadInt32(&firstConn) == 0 {

this code is racy. use CompareAndSwapInt32(&firstConn, 0, 1)


pkg/rpc/context_test.go, line 596 at r4 (raw file):

			func(addr string, timeout time.Duration) (net.Conn, error) {
				if atomic.LoadInt32(&firstConn) == 0 {
					// If we allow gRPC to open a 2nd transport connection, the our RPCs

s/the/then/


pkg/rpc/context_test.go, line 598 at r4 (raw file):

					// If we allow gRPC to open a 2nd transport connection, the our RPCs
					// might succeed if they're sent on that one.
					return nil, errors.Errorf("the test only allows one connection")

i think this is wrong; instead of returning an error, this should just block on a channel that you close at the end of the test.


pkg/rpc/context_test.go, line 605 at r4 (raw file):

					Timeout: timeout,
				}
				conn, err := dialer.Dial("tcp", addr)

https://godoc.org/net#DialTimeout


pkg/rpc/context_test.go, line 613 at r4 (raw file):

				return transportConn, nil
			}),
		// Override the keepalive settings that the grpContext uses to more

what is a grpContext?


pkg/rpc/context_test.go, line 617 at r4 (raw file):

		grpc.WithKeepaliveParams(
			keepalive.ClientParameters{
				// The low timeout makes the connection very flaky for any RPC use,

"the low timeout" - is this referring to the default or to the value you're setting here?

please rewrite this comment.


pkg/rpc/context_test.go, line 651 at r4 (raw file):

		// If the heartbeats didn't timeout, we're going to simulate a network
		// partition and then the heartbeats must timeout.
		log.Infof(context.TODO(), "test returning early; no partition done")

replace this with t.Skipf("test returning early; no partition done: %s", err)


pkg/rpc/context_test.go, line 664 at r4 (raw file):

	// We expect either of two errors which tests revealed that the RPC call might
	// return. We also allow

broken comment.


pkg/rpc/context_test.go, line 673 at r4 (raw file):

more realistically by not accepting new
// connections,
I've suggested a change above that would do this, I think.


Comments from Reviewable

@andreimatei andreimatei force-pushed the heartbeats-keepalive branch from f1d337a to 75b5692 Compare April 6, 2017 21:29
@andreimatei
Copy link
Contributor Author

Review status: 1 of 4 files reviewed at latest revision, 19 unresolved discussions.


pkg/rpc/context_test.go, line 585 at r4 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

nit: math.MaxInt64

Done.


pkg/rpc/context_test.go, line 595 at r4 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

this code is racy. use CompareAndSwapInt32(&firstConn, 0, 1)

I don't think it was racy cause grpc is not supposed to open multiple conns at a time. If it does, then this test is in trouble. Which begs the question about why I've made this an atomic.
But, yeah, I've switched to compareAndSwap.


pkg/rpc/context_test.go, line 596 at r4 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

s/the/then/

Done.


pkg/rpc/context_test.go, line 598 at r4 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

i think this is wrong; instead of returning an error, this should just block on a channel that you close at the end of the test.

I don't think it makes much of a difference. Not sure why you said it's wrong. It doesn't matter how exactly we prevent grpc from opening new conns, it just matters that we do.
But perhaps this is a better simulation of a partition, so I've done it. Although note that this test is all about behavior of a single RPC call, not more generally about gRPC's behavior in the face of partitions.

In fact this is what I thought about doing in the first place but I didn't do it for two reasons:

  • to not suggest that it matters for the test
  • I thought that I might run into deadlocks with the order in which this channels needs to be closed versus the grpc conn. But apparently not.

pkg/rpc/context_test.go, line 605 at r4 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

https://godoc.org/net#DialTimeout

Done.


pkg/rpc/context_test.go, line 613 at r4 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

what is a grpContext?

Done.


pkg/rpc/context_test.go, line 617 at r4 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

"the low timeout" - is this referring to the default or to the value you're setting here?

please rewrite this comment.

clarified that it's this one here and used the word "aggressive" to link to the comment above, which I've also clarified more.


pkg/rpc/context_test.go, line 651 at r4 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

replace this with t.Skipf("test returning early; no partition done: %s", err)

but the test has not been skipped. The test succeeded in demonstrating what it tries to demonstrate. As the comment above says. Is it not clear?


pkg/rpc/context_test.go, line 664 at r4 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

broken comment.

done and moved it above to the definition of the regex


pkg/rpc/context_test.go, line 673 at r4 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

more realistically by not accepting new
// connections,
I've suggested a change above that would do this, I think.

assuming you're talking about the suggestion to block when grpc tries to open a new conn - like I was saying there, "this" was already done. This comment was stale; I've updated it.
I don't think the test should test more stuff.


pkg/testutils/net.go, line 93 at r2 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

You're Fataling on the error, that's nowhere close to ignoring the error.

I removed the fatal. I would have done it from the beginning if I understood that that was the issue.
Although that's an error I'm confident cannot occur, so I think the assertion was a good thing.


pkg/testutils/net.go, line 88 at r3 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

Why? we have seen again and again that naming errors anything other than err is error-prone (no pun intended).

Because the name is more suggestive. This member is a narrow thing; it's not an error encountered by the buffer while operating - those errors are returned to the parent. This is an error passed by the parent when closing the buffer with a specific purpose - wake up all the reads/writes and make them return this thing. I've added a comment to Close(), maybe that helps.
You'll perhaps say that the buffer should always close itself on any error, but that seems like an odd thing for a buffer to do. It's also not what bytes.Buffer does, and it's probably not what the ReaderFrom interface imagined.

The "never name them anything but err" thing... Have we learned this for member variables?


pkg/testutils/net.go, line 169 at r3 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

this is the definition of over engineering - this buffer will not be used outside this package and anyone who isn't you will either use this method incorrectly or signal the condvar when they should use the method.

ok... I've removed it. But now, for example, PartitionC2S() does
c.mu.c2sBuffer.wait.Broadcast(). Should it do c.mu.c2sWaiter.Broadcast() instead? In other words, sometimes the PConn wants to communicate with the buffer, otherwise not. Without the method, this communication is more awkward; the method gave a good place for a comment.

It lead to more awkwardness for the buffer communicating with itself, which I guess was your point, but I'm not sure we improved anything.


pkg/testutils/net.go, line 393 at r3 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

Then why is your channel buffered?

Also, again, it is not a good idea to name errors anything other than err.

why wouldn't it be buffered? The tasks don't want to rendez-vous with the parent, they just want to pass values to it. But I've removed the buffer anyway.

I think these lines of code are pretty safe.


pkg/testutils/net.go, line 134 at r4 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

signal is not a thing?

condVars have the Signal method.


pkg/testutils/net.go, line 326 at r4 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

returns

Done.


pkg/testutils/net.go, line 328 at r4 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

make it io.Reader?

Done.


pkg/testutils/net_test.go, line 54 at r3 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

Over engineering again, in my opinion.

done, although it was not exactly an inscrutable contraption.
It was questionable if serverSideCh would be useful with more than one conn tho.


pkg/testutils/net_test.go, line 285 at r3 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

Doesn't look done.

ah there was another one. done.


Comments from Reviewable

@tamird
Copy link
Contributor

tamird commented Apr 6, 2017

Reviewed 3 of 3 files at r5.
Review status: all files reviewed at latest revision, 6 unresolved discussions, some commit checks failed.


pkg/rpc/context_test.go, line 651 at r4 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

but the test has not been skipped. The test succeeded in demonstrating what it tries to demonstrate. As the comment above says. Is it not clear?

I see. Why can't you let the test continue? It seems to me that the rest of the test will work just fine, even in this case (though I agree it's a bit nonsensical, I would prefer that the code be exercised than not).


pkg/testutils/net.go, line 169 at r3 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

ok... I've removed it. But now, for example, PartitionC2S() does
c.mu.c2sBuffer.wait.Broadcast(). Should it do c.mu.c2sWaiter.Broadcast() instead? In other words, sometimes the PConn wants to communicate with the buffer, otherwise not. Without the method, this communication is more awkward; the method gave a good place for a comment.

It lead to more awkwardness for the buffer communicating with itself, which I guess was your point, but I'm not sure we improved anything.

You can bring it back, but then you should always use it.


pkg/testutils/net.go, line 393 at r3 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

why wouldn't it be buffered? The tasks don't want to rendez-vous with the parent, they just want to pass values to it. But I've removed the buffer anyway.

I think these lines of code are pretty safe.

It should not be buffered because it's sloppy - you always expect those errors to be consumed, so you should be strict about it.


pkg/testutils/net.go, line 134 at r4 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

condVars have the Signal method.

s/signal/Signal/, then.

FWIW, you never actually call Signal (I think), only Broadcast, so perhaps rewrite this to "at the time the buffer's Cond was signalled"


pkg/testutils/net.go, line 255 at r5 (raw file):

	}
	c.mu.c2sPartitioned = true
	// Signal the buffer to unblock reads.

it only unblocks reads, not writes? I think the exact machinery here is complicated enough that this comment is more misleading than useful.


pkg/testutils/net_test.go, line 37 at r5 (raw file):

)

// RunEchoServer runs a network server that accepts one connection from ln and

one at a time, not one total - you still have the for loop.


Comments from Reviewable

@andreimatei andreimatei force-pushed the heartbeats-keepalive branch from 75b5692 to 3d3fe6c Compare April 7, 2017 00:41
@andreimatei
Copy link
Contributor Author

Review status: 1 of 4 files reviewed at latest revision, 4 unresolved discussions.


pkg/rpc/context_test.go, line 651 at r4 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

I see. Why can't you let the test continue? It seems to me that the rest of the test will work just fine, even in this case (though I agree it's a bit nonsensical, I would prefer that the code be exercised than not).

I believe I can let the code continue, but I'd much rather not.
The rest of the code is exercised; it's very unlikely that it wouldn't be. In other words, there's not much reason to expect this first RPC to fail (except that once in a blue moon it does). If something were to break such that it would always fail, all our tests would break at that point.


pkg/testutils/net.go, line 169 at r3 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

You can bring it back, but then you should always use it.

well I don't like "always use it" very much either. I've brought it back, but I've split the CV into 2 - one for reads (signaled both internally when new data is available and externally on partition) and one for writes - signaled when there's new capacity to fill. The two cases were previously combined only for convenience, but since we want one of the cases to be combined with pConn signals, I think it makes sense to separate them.
The one for reads is now always signaled through this method.


pkg/testutils/net.go, line 393 at r3 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

It should not be buffered because it's sloppy - you always expect those errors to be consumed, so you should be strict about it.

well since when does using a buffered channel mean that you don't expect everything to be consumed and otherwise using one makes you sloppy? I don't think that's a thing.


pkg/testutils/net.go, line 134 at r4 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

s/signal/Signal/, then.

FWIW, you never actually call Signal (I think), only Broadcast, so perhaps rewrite this to "at the time the buffer's Cond was signalled"

see now


pkg/testutils/net.go, line 255 at r5 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

it only unblocks reads, not writes? I think the exact machinery here is complicated enough that this comment is more misleading than useful.

it unblocks readLocked() (it used to also unblock writes, but only momentarily; they were going back to sleep upon waking). Anyway, the comment is gone.


pkg/testutils/net_test.go, line 37 at r5 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

one at a time, not one total - you still have the for loop.

I meant to get rid of the for too, thanks. I don't think one at a time makes much sense.


Comments from Reviewable

@tamird
Copy link
Contributor

tamird commented Apr 7, 2017

Reviewed 3 of 3 files at r6.
Review status: all files reviewed at latest revision, 5 unresolved discussions, some commit checks pending.


pkg/rpc/context_test.go, line 651 at r4 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

I believe I can let the code continue, but I'd much rather not.
The rest of the code is exercised; it's very unlikely that it wouldn't be. In other words, there's not much reason to expect this first RPC to fail (except that once in a blue moon it does). If something were to break such that it would always fail, all our tests would break at that point.

I'm much less convinced of that than you. You'd rather not let it run, but why not?


pkg/testutils/net.go, line 259 at r6 (raw file):

	c.mu.Lock()
	c.mu.c2sPartitioned = false
	c.mu.c2sWaiter.Signal()

is there a reason this wakes up just one reader? i don't expect there to be multiple, i guess, so just wondering.


pkg/testutils/net_test.go, line 53 at r6 (raw file):

	}
	if _, err := copyWithSideChan(conn, conn, serverSideCh); err != nil {
		log.Warning(context.TODO(), err)

return the error now that there's no for loop.


pkg/testutils/net_test.go, line 375 at r6 (raw file):

			received := 0
			for {
				data := make([]byte, 1024*1024)

1 << 20 // 1 MiB


pkg/testutils/net_test.go, line 406 at r6 (raw file):

		select {
		case err = <-serverDoneCh:
			err = errors.Errorf("server was not supposed to see the closing while partitioned: %v", err)

errors.Wrap


Comments from Reviewable

@andreimatei andreimatei force-pushed the heartbeats-keepalive branch from 3d3fe6c to 5994186 Compare April 7, 2017 01:33
@andreimatei
Copy link
Contributor Author

Review status: 1 of 4 files reviewed at latest revision, 5 unresolved discussions.


pkg/rpc/context_test.go, line 651 at r4 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

I'm much less convinced of that than you. You'd rather not let it run, but why not?

Because I'm not sure what the rest of the code would be testing in this case, and I can't speak with any confidence about what to expect. What if gRPC introduces other errors for RPC not sent (because conn couldn't be established) vs RPC killed by keepalive failure?

Anyway, I've let the code run. But:

Turns out that the way the code was, the next RPC would timeout. If the transport connection is closed at this point, the next rpc call will block forever. It has something to do with how we block the dialer from returning new conns. I've made the dialer return errors and now the test works. I've read a bit and I now think that returning errors is the better thing to do in that dialer. There's a "TCP socket connect timeout", of the order of 20s apparently (much shorter than the retransmission timeouts on an established conn).


pkg/testutils/net.go, line 259 at r6 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

is there a reason this wakes up just one reader? i don't expect there to be multiple, i guess, so just wondering.

this is not waking up readers exactly, this is waking up the single goroutine that copies from the buffer to the serverConn

But this has made me realize that no broadcast was needed in this code and I had been confused. There's ever a single reader and writer to the buffer too (but the reader and writer are accessing the buffer concurrently, which makes it different from, say, bytes.Buffer). I've switched all to Signal().


pkg/testutils/net_test.go, line 53 at r6 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

return the error now that there's no for loop.

Done.


pkg/testutils/net_test.go, line 375 at r6 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

1 << 20 // 1 MiB

Done.


pkg/testutils/net_test.go, line 406 at r6 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

errors.Wrap

Done.


Comments from Reviewable

@tamird
Copy link
Contributor

tamird commented Apr 7, 2017

:lgtm:


Reviewed 3 of 3 files at r7.
Review status: all files reviewed at latest revision, 3 unresolved discussions, some commit checks failed.


pkg/rpc/context_test.go, line 653 at r7 (raw file):

		// keepalive timeout caused our RPC to fail (happens occasionally under
		// stress -p 100). We're going to let the rest of the test code run, to make
		// sure it's exercised.

add an empty comment line to preserve the paragraph break.


pkg/rpc/context_test.go, line 656 at r7 (raw file):

		// If the heartbeats didn't timeout (the normal case), we're going to
		// simulate a network partition and then the heartbeats must timeout.
		log.Infof(context.TODO(), "test returning early; no partition done")

this lies now; remove?


pkg/testutils/net_test.go, line 406 at r6 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

Done.

no need for the f


Comments from Reviewable

Fixes cockroachdb#13989

Before this patch, the rpc.Context would perform heartbeats (a dedicated
RPC) to see if a connection is healthy. If the heartbeats failed, the
connection was closed (causing in-flight RPCs to fail) and the node was
marked as unhealthy.
These heartbeats, being regular RPCs, were subject to gRPC's flow
control. This means that they were easily blocked by other large RPCs,
which meant they were too feeble. In particular, they were easily
blocked by large DistSQL streams.

This patch moves to using gRPC's internal HTTP2 ping frames for checking
conn health. These are not subject to flow control. The grpc
transport-level connection is closed when they fail (and so in-flight
RPCs still fail), but otherwise gRPC reconnects transparently.
Heartbeats stay for the other current uses - clock skew detection and
node health marking. Marking a node as unhealthy is debatable, give the
shortcomings of these RPCs. However, this marking currently doesn't have
big consequences - it only affects the order in which replicas are tried
when a leaseholder is unknown.
@andreimatei andreimatei force-pushed the heartbeats-keepalive branch from 5994186 to c65b38b Compare April 7, 2017 17:33
@andreimatei
Copy link
Contributor Author

Review status: 2 of 4 files reviewed at latest revision, 3 unresolved discussions, all commit checks successful.


pkg/rpc/context_test.go, line 653 at r7 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

add an empty comment line to preserve the paragraph break.

done


pkg/rpc/context_test.go, line 656 at r7 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

this lies now; remove?

changed


pkg/testutils/net_test.go, line 406 at r6 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

no need for the f

Done.


Comments from Reviewable

@andreimatei andreimatei merged commit 6ed9d79 into cockroachdb:master Apr 7, 2017
@andreimatei andreimatei deleted the heartbeats-keepalive branch April 7, 2017 18:46
@andreimatei
Copy link
Contributor Author

TFTR


Review status: 2 of 4 files reviewed at latest revision, 3 unresolved discussions, all commit checks successful.


Comments from Reviewable

@tamird
Copy link
Contributor

tamird commented Apr 7, 2017

Reviewed 2 of 2 files at r8.
Review status: all files reviewed at latest revision, 1 unresolved discussion, all commit checks successful.


pkg/rpc/context_test.go, line 656 at r7 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

changed

Heh, may as well log the error


Comments from Reviewable

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

distsql: tpch query 7 results in context cancelled error
3 participants