-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rpc: don't close gRPC connections on heartbeat timeouts #14424
rpc: don't close gRPC connections on heartbeat timeouts #14424
Conversation
By popular demand, I've extracted this from #14376 |
This should also close: Reviewed 1 of 1 files at r1. pkg/rpc/context.go, line 239 at r1 (raw file):
s/2/3/ or else remove this preallocation completely. pkg/rpc/context.go, line 243 at r1 (raw file):
move this up a line for symmetry with all the closers pkg/rpc/context.go, line 244 at r1 (raw file):
isn't there a comment on the struct definition? this comment doesn't seem helpful to me. pkg/rpc/context.go, line 245 at r1 (raw file):
base.NetworkTimeout here and below? pkg/rpc/context.go, line 246 at r1 (raw file):
I think we need a very basic smoke test for this. Perhaps a simple GRPC server-client pair with a custom dialer that uses 2
Comments from Reviewable |
8e157b2
to
15fdae1
Compare
I'll stress #13886 to see if it fixes it. The other 2 are now independently closed. Review status: 0 of 4 files reviewed at latest revision, 5 unresolved discussions, some commit checks pending. pkg/rpc/context.go, line 239 at r1 (raw file): Previously, tamird (Tamir Duberstein) wrote…
Done. pkg/rpc/context.go, line 243 at r1 (raw file): Previously, tamird (Tamir Duberstein) wrote…
Done but I find this worse pkg/rpc/context.go, line 244 at r1 (raw file): Previously, tamird (Tamir Duberstein) wrote…
there is, but this comment serves as the description of the purpose of the whole struct. Otherwise the 2nd comment, that I think is more useful, would be more out of context. pkg/rpc/context.go, line 245 at r1 (raw file): Previously, tamird (Tamir Duberstein) wrote…
Done. pkg/rpc/context.go, line 246 at r1 (raw file): Previously, tamird (Tamir Duberstein) wrote…
Done. Comments from Reviewable |
15fdae1
to
08244a7
Compare
This is looking good. Will finish the review tomorrow. Reviewed 3 of 4 files at r2. pkg/rpc/context.go, line 262 at r2 (raw file):
I think this comment is incorrect; the connection won't be closed in the GRPC sense. pkg/testutils/net.go, line 35 at r2 (raw file):
what does "pipe" mean here? pkg/testutils/net.go, line 37 at r2 (raw file):
s/send/sent/ pkg/testutils/net.go, line 39 at r2 (raw file):
this comment is generally very unclear - is data dropped or not? is there a delay or is data dropped completely? pkg/testutils/net.go, line 46 at r2 (raw file):
s/ though// add an empty comment line after this to preserve the paragraph. pkg/testutils/net.go, line 93 at r2 (raw file):
let's avoid log.Fatal here - if we want to be good about error reporting we should pass a channel to this constructor, or just hang one on the connection and export it. pkg/testutils/net.go, line 148 at r2 (raw file):
there's no equivalent panic in Partition, why is this one here? pkg/testutils/net.go, line 174 at r2 (raw file):
"implements net.Conn." pkg/testutils/net.go, line 220 at r2 (raw file):
there is no need to pass the mutex, bool, or sync.Cond; just pass a closure that returns when it's time to unblock. also, you have the exact same pattern in the test; seems like you could pass a closure here and reuse the function in the test. pkg/testutils/net.go, line 227 at r2 (raw file):
is this code lifted from somewhere? reference it. pkg/testutils/net_test.go, line 33 at r2 (raw file):
nit: group this with "context" above. pkg/testutils/net_test.go, line 48 at r2 (raw file):
return the error. pkg/testutils/net_test.go, line 58 at r2 (raw file):
no reason not to return this error. discard it in the caller if you want, but better yet, send it on a channel. pkg/testutils/net_test.go, line 126 at r2 (raw file):
defer? pkg/testutils/net_test.go, line 132 at r2 (raw file):
is this test actually slow? pkg/testutils/net_test.go, line 169 at r2 (raw file):
pkg/testutils/net_test.go, line 183 at r2 (raw file):
looks likely to be flaky =/ pkg/testutils/net_test.go, line 184 at r2 (raw file):
i think you can use the same inner pkg/testutils/net_test.go, line 211 at r2 (raw file):
defer? pkg/testutils/net_test.go, line 259 at r2 (raw file):
inner pkg/testutils/net_test.go, line 275 at r2 (raw file):
no bueno, this is not the main goroutine. pkg/testutils/net_test.go, line 289 at r2 (raw file):
defer? Comments from Reviewable |
Let's get this in soon, because this is blocking serious load testing for distsql. |
@cuongdo if this is blocking you, i think you can ship a custom binary with https://github.com/cockroachdb/cockroach/blob/56d6ed4/pkg/rpc/context.go#L376:L378 commented out or removed. |
bf32eef
to
d510c20
Compare
Review status: 0 of 4 files reviewed at latest revision, 22 unresolved discussions, some commit checks pending. pkg/rpc/context.go, line 262 at r2 (raw file): Previously, tamird (Tamir Duberstein) wrote…
clarified pkg/testutils/net.go, line 35 at r2 (raw file): Previously, tamird (Tamir Duberstein) wrote…
fixed pkg/testutils/net.go, line 37 at r2 (raw file): Previously, tamird (Tamir Duberstein) wrote…
Done. pkg/testutils/net.go, line 39 at r2 (raw file): Previously, tamird (Tamir Duberstein) wrote…
see now pls pkg/testutils/net.go, line 46 at r2 (raw file): Previously, tamird (Tamir Duberstein) wrote…
Done. pkg/testutils/net.go, line 93 at r2 (raw file): Previously, tamird (Tamir Duberstein) wrote…
I don't know what to do with these errors. I don't particularly want "to be good" - but linter. In the case of serverConn... I'm not sure how a client would use a channel exported by the PartitionableConn and what its semantics would be. I'd rather we log and ignore as we do now. pkg/testutils/net.go, line 148 at r2 (raw file): Previously, tamird (Tamir Duberstein) wrote…
I was thinking that it makes more sense for Partition to be idempotent than for Unpartition. But on second thought I put panics everywhere. pkg/testutils/net.go, line 174 at r2 (raw file): Previously, tamird (Tamir Duberstein) wrote…
I've been using "is part of". Saying that a method "implements" an interface seems to me to be an abuse of the language that I don't like much. pkg/testutils/net.go, line 220 at r2 (raw file): Previously, tamird (Tamir Duberstein) wrote…
good suggestion with the callback; done. I don't like the idea of sharing this method with the test. One difference is the handling of the EOF. It could be adapted, but I think this sharing would only make both uses harder to read and also the two implementations might diverge; we're not doing very general things here. pkg/testutils/net.go, line 227 at r2 (raw file): Previously, tamird (Tamir Duberstein) wrote…
Done. referenced io.copy above pkg/testutils/net_test.go, line 33 at r2 (raw file): Previously, tamird (Tamir Duberstein) wrote…
Done. pkg/testutils/net_test.go, line 48 at r2 (raw file): Previously, tamird (Tamir Duberstein) wrote…
I think what I wanted here is to not look at the error after the FatalIfUnexpected call. And to leave the FatalIU call here, not in the caller. Right? pkg/testutils/net_test.go, line 58 at r2 (raw file): Previously, tamird (Tamir Duberstein) wrote…
Done. pkg/testutils/net_test.go, line 126 at r2 (raw file): Previously, tamird (Tamir Duberstein) wrote…
Done. pkg/testutils/net_test.go, line 132 at r2 (raw file): Previously, tamird (Tamir Duberstein) wrote…
I think it's a good idea to mark tests with waits in them like this going forward... pkg/testutils/net_test.go, line 169 at r2 (raw file): Previously, tamird (Tamir Duberstein) wrote…
Done. pkg/testutils/net_test.go, line 183 at r2 (raw file): Previously, tamird (Tamir Duberstein) wrote…
if the time is too small, the test will pass and we won't have tested what we want. But it won't fail. pkg/testutils/net_test.go, line 184 at r2 (raw file): Previously, tamird (Tamir Duberstein) wrote…
but here there's a single send on the channel; don't see what the pattern would give us exactly. pkg/testutils/net_test.go, line 211 at r2 (raw file): Previously, tamird (Tamir Duberstein) wrote…
Done. pkg/testutils/net_test.go, line 259 at r2 (raw file): Previously, tamird (Tamir Duberstein) wrote…
Done. pkg/testutils/net_test.go, line 275 at r2 (raw file): Previously, tamird (Tamir Duberstein) wrote…
Done. pkg/testutils/net_test.go, line 289 at r2 (raw file): Previously, tamird (Tamir Duberstein) wrote…
Done. Comments from Reviewable |
Reviewed 3 of 4 files at r3. pkg/testutils/net.go, line 39 at r2 (raw file): Previously, andreimatei (Andrei Matei) wrote…
still unclear; what happens to the internal buffer? pkg/testutils/net.go, line 93 at r2 (raw file): Previously, andreimatei (Andrei Matei) wrote…
Is it really so hard to do the right thing? Just close over a channel, it's a few lines of code. pkg/testutils/net.go, line 33 at r3 (raw file):
remove pkg/testutils/net.go, line 45 at r3 (raw file):
missing verb. "block"? pkg/testutils/net.go, line 45 at r3 (raw file):
s/a write done/a write call/ pkg/testutils/net.go, line 48 at r3 (raw file):
blocked? this is incoherent in light of earlier description of blocking. pkg/testutils/net.go, line 71 at r3 (raw file):
make these values pkg/testutils/net.go, line 86 at r3 (raw file):
isn't this equivalent to pkg/testutils/net.go, line 88 at r3 (raw file):
how about just pkg/testutils/net.go, line 97 at r3 (raw file):
why does this return a pointer? the sync.Cond and mutex are both pointers. s/new/make/ pkg/testutils/net.go, line 112 at r3 (raw file):
why isn't this called Write? that would implement io.Writer. pkg/testutils/net.go, line 132 at r3 (raw file):
raised? this isn't C++. what is bug.signal? pkg/testutils/net.go, line 169 at r3 (raw file):
remove this inconsistently-used method. pkg/testutils/net.go, line 319 at r3 (raw file):
why is this a method on PartitionableConn rather than on This should be written in the style of https://godoc.org/bytes#Buffer.ReadFrom pkg/testutils/net.go, line 349 at r3 (raw file):
same as above, except this should emulate https://godoc.org/bytes#Buffer.WriteTo if you hang pkg/testutils/net.go, line 393 at r3 (raw file):
why not return the channel and let the caller read from it? also you can just write
because the channel is buffered. pkg/testutils/net_test.go, line 132 at r2 (raw file): Previously, andreimatei (Andrei Matei) wrote…
How long does it take? If it's under 1s then remove this. pkg/testutils/net_test.go, line 184 at r2 (raw file): Previously, andreimatei (Andrei Matei) wrote…
The pattern gives you compiler help, but suit yourself. pkg/testutils/net_test.go, line 54 at r3 (raw file):
why does this need a goroutine? seems to me it's perfectly ok for this server to be "single-threaded" pkg/testutils/net_test.go, line 285 at r3 (raw file):
should be %v since this can be nil pkg/testutils/net_test.go, line 304 at r3 (raw file):
remove. pkg/testutils/net_test.go, line 318 at r3 (raw file):
why isn't this in the func() error below? it's no value to continue after an error here. Comments from Reviewable |
thanks for the review bro! Let's get this in :) Review status: 3 of 4 files reviewed at latest revision, 21 unresolved discussions. pkg/testutils/net.go, line 39 at r2 (raw file): Previously, tamird (Tamir Duberstein) wrote…
Done. pkg/testutils/net.go, line 93 at r2 (raw file): Previously, tamird (Tamir Duberstein) wrote…
so what exactly would we send on that channel? Just the errors from pkg/testutils/net.go, line 33 at r3 (raw file): Previously, tamird (Tamir Duberstein) wrote…
Done. pkg/testutils/net.go, line 45 at r3 (raw file): Previously, tamird (Tamir Duberstein) wrote…
Done. pkg/testutils/net.go, line 45 at r3 (raw file): Previously, tamird (Tamir Duberstein) wrote…
Done. pkg/testutils/net.go, line 48 at r3 (raw file): Previously, tamird (Tamir Duberstein) wrote…
I believe it's coherent. Clarified a bit more; see now. pkg/testutils/net.go, line 71 at r3 (raw file): Previously, tamird (Tamir Duberstein) wrote…
Done. pkg/testutils/net.go, line 86 at r3 (raw file): Previously, tamird (Tamir Duberstein) wrote…
well this buffer should permit closing without an error. Even though we do only close it with one. I'd leave it. pkg/testutils/net.go, line 88 at r3 (raw file): Previously, tamird (Tamir Duberstein) wrote…
I'd leave it pkg/testutils/net.go, line 97 at r3 (raw file): Previously, tamird (Tamir Duberstein) wrote…
but but it's not copiable. Ok done. pkg/testutils/net.go, line 112 at r3 (raw file): Previously, tamird (Tamir Duberstein) wrote…
Done. pkg/testutils/net.go, line 132 at r3 (raw file): Previously, tamird (Tamir Duberstein) wrote…
done. it's supposed to be pkg/testutils/net.go, line 169 at r3 (raw file): Previously, tamird (Tamir Duberstein) wrote…
why do you say it's inconsistently used? Perhaps you're observing that the cond vars are signaled directly sometimes, which has the same effect as calling this. But this method is specifically about the pconn interrupting the buffer, where other uses of the cond var in the pconn are about the pconn communicating to itself. Similarly for direct signals of the cvar inside buf. pkg/testutils/net.go, line 319 at r3 (raw file): Previously, tamird (Tamir Duberstein) wrote…
done; good suggestion. pkg/testutils/net.go, line 349 at r3 (raw file): Previously, tamird (Tamir Duberstein) wrote…
this one I'd leave on PConn. Given the current layout, I don't care too much about respecting some other buffer interface, since this buffer here is thread-safe and has some particular semantics. pkg/testutils/net.go, line 393 at r3 (raw file): Previously, tamird (Tamir Duberstein) wrote…
I don't like returning the channel. This method is better being sync. No need to burden the caller with understanding that there's several internal tasks. I changed it to what you were suggesting but then changed it back. Mine is more pedestrian and readable. It also always waits for both tasks to be done before returning; which is probably a good thing pkg/testutils/net_test.go, line 132 at r2 (raw file): Previously, tamird (Tamir Duberstein) wrote…
1s?!?! pkg/testutils/net_test.go, line 54 at r3 (raw file): Previously, tamird (Tamir Duberstein) wrote…
let it support more of them. It's more idiomatic to write the Accept in a loop too, and leave the closing of the listener to the caller, I think. pkg/testutils/net_test.go, line 285 at r3 (raw file): Previously, tamird (Tamir Duberstein) wrote…
Done. pkg/testutils/net_test.go, line 304 at r3 (raw file): Previously, tamird (Tamir Duberstein) wrote…
Done. pkg/testutils/net_test.go, line 318 at r3 (raw file): Previously, tamird (Tamir Duberstein) wrote…
Done. Comments from Reviewable |
d510c20
to
f1d337a
Compare
forgot to push, sorry. pushed now. Review status: 1 of 4 files reviewed at latest revision, 21 unresolved discussions. Comments from Reviewable |
Reviewed 2 of 3 files at r4. pkg/testutils/net.go, line 93 at r2 (raw file): Previously, andreimatei (Andrei Matei) wrote…
You're Fataling on the error, that's nowhere close to ignoring the error. pkg/testutils/net.go, line 88 at r3 (raw file): Previously, andreimatei (Andrei Matei) wrote…
Why? we have seen again and again that naming errors anything other than pkg/testutils/net.go, line 169 at r3 (raw file): Previously, andreimatei (Andrei Matei) wrote…
this is the definition of over engineering - this buffer will not be used outside this package and anyone who isn't you will either use this method incorrectly or signal the condvar when they should use the method. pkg/testutils/net.go, line 393 at r3 (raw file): Previously, andreimatei (Andrei Matei) wrote…
Then why is your channel buffered? Also, again, it is not a good idea to name errors anything other than pkg/testutils/net.go, line 134 at r4 (raw file):
signal is not a thing? pkg/testutils/net.go, line 326 at r4 (raw file):
returns pkg/testutils/net.go, line 328 at r4 (raw file):
make it io.Reader? pkg/testutils/net_test.go, line 54 at r3 (raw file): Previously, andreimatei (Andrei Matei) wrote…
Over engineering again, in my opinion. pkg/testutils/net_test.go, line 285 at r3 (raw file): Previously, andreimatei (Andrei Matei) wrote…
Doesn't look done. Comments from Reviewable |
Reviewed 1 of 3 files at r4. pkg/rpc/context_test.go, line 585 at r4 (raw file):
nit: math.MaxInt64 pkg/rpc/context_test.go, line 595 at r4 (raw file):
this code is racy. use pkg/rpc/context_test.go, line 596 at r4 (raw file):
s/the/then/ pkg/rpc/context_test.go, line 598 at r4 (raw file):
i think this is wrong; instead of returning an error, this should just block on a channel that you close at the end of the test. pkg/rpc/context_test.go, line 605 at r4 (raw file):
https://godoc.org/net#DialTimeout pkg/rpc/context_test.go, line 613 at r4 (raw file):
what is a grpContext? pkg/rpc/context_test.go, line 617 at r4 (raw file):
"the low timeout" - is this referring to the default or to the value you're setting here? please rewrite this comment. pkg/rpc/context_test.go, line 651 at r4 (raw file):
replace this with pkg/rpc/context_test.go, line 664 at r4 (raw file):
broken comment. pkg/rpc/context_test.go, line 673 at r4 (raw file):
Comments from Reviewable |
f1d337a
to
75b5692
Compare
Review status: 1 of 4 files reviewed at latest revision, 19 unresolved discussions. pkg/rpc/context_test.go, line 585 at r4 (raw file): Previously, tamird (Tamir Duberstein) wrote…
Done. pkg/rpc/context_test.go, line 595 at r4 (raw file): Previously, tamird (Tamir Duberstein) wrote…
I don't think it was racy cause grpc is not supposed to open multiple conns at a time. If it does, then this test is in trouble. Which begs the question about why I've made this an atomic. pkg/rpc/context_test.go, line 596 at r4 (raw file): Previously, tamird (Tamir Duberstein) wrote…
Done. pkg/rpc/context_test.go, line 598 at r4 (raw file): Previously, tamird (Tamir Duberstein) wrote…
I don't think it makes much of a difference. Not sure why you said it's wrong. It doesn't matter how exactly we prevent grpc from opening new conns, it just matters that we do. In fact this is what I thought about doing in the first place but I didn't do it for two reasons:
pkg/rpc/context_test.go, line 605 at r4 (raw file): Previously, tamird (Tamir Duberstein) wrote…Done. pkg/rpc/context_test.go, line 613 at r4 (raw file): Previously, tamird (Tamir Duberstein) wrote…
Done. pkg/rpc/context_test.go, line 617 at r4 (raw file): Previously, tamird (Tamir Duberstein) wrote…
clarified that it's this one here and used the word "aggressive" to link to the comment above, which I've also clarified more. pkg/rpc/context_test.go, line 651 at r4 (raw file): Previously, tamird (Tamir Duberstein) wrote…
but the test has not been skipped. The test succeeded in demonstrating what it tries to demonstrate. As the comment above says. Is it not clear? pkg/rpc/context_test.go, line 664 at r4 (raw file): Previously, tamird (Tamir Duberstein) wrote…
done and moved it above to the definition of the regex pkg/rpc/context_test.go, line 673 at r4 (raw file): Previously, tamird (Tamir Duberstein) wrote…
assuming you're talking about the suggestion to block when grpc tries to open a new conn - like I was saying there, "this" was already done. This comment was stale; I've updated it. pkg/testutils/net.go, line 93 at r2 (raw file): Previously, tamird (Tamir Duberstein) wrote…
I removed the fatal. I would have done it from the beginning if I understood that that was the issue. pkg/testutils/net.go, line 88 at r3 (raw file): Previously, tamird (Tamir Duberstein) wrote…
Because the name is more suggestive. This member is a narrow thing; it's not an error encountered by the buffer while operating - those errors are returned to the parent. This is an error passed by the parent when closing the buffer with a specific purpose - wake up all the reads/writes and make them return this thing. I've added a comment to Close(), maybe that helps. The "never name them anything but err" thing... Have we learned this for member variables? pkg/testutils/net.go, line 169 at r3 (raw file): Previously, tamird (Tamir Duberstein) wrote…
ok... I've removed it. But now, for example, It lead to more awkwardness for the buffer communicating with itself, which I guess was your point, but I'm not sure we improved anything. pkg/testutils/net.go, line 393 at r3 (raw file): Previously, tamird (Tamir Duberstein) wrote…
why wouldn't it be buffered? The tasks don't want to rendez-vous with the parent, they just want to pass values to it. But I've removed the buffer anyway. I think these lines of code are pretty safe. pkg/testutils/net.go, line 134 at r4 (raw file): Previously, tamird (Tamir Duberstein) wrote…
condVars have the pkg/testutils/net.go, line 326 at r4 (raw file): Previously, tamird (Tamir Duberstein) wrote…
Done. pkg/testutils/net.go, line 328 at r4 (raw file): Previously, tamird (Tamir Duberstein) wrote…
Done. pkg/testutils/net_test.go, line 54 at r3 (raw file): Previously, tamird (Tamir Duberstein) wrote…
done, although it was not exactly an inscrutable contraption. pkg/testutils/net_test.go, line 285 at r3 (raw file): Previously, tamird (Tamir Duberstein) wrote…
ah there was another one. done. Comments from Reviewable |
Reviewed 3 of 3 files at r5. pkg/rpc/context_test.go, line 651 at r4 (raw file): Previously, andreimatei (Andrei Matei) wrote…
I see. Why can't you let the test continue? It seems to me that the rest of the test will work just fine, even in this case (though I agree it's a bit nonsensical, I would prefer that the code be exercised than not). pkg/testutils/net.go, line 169 at r3 (raw file): Previously, andreimatei (Andrei Matei) wrote…
You can bring it back, but then you should always use it. pkg/testutils/net.go, line 393 at r3 (raw file): Previously, andreimatei (Andrei Matei) wrote…
It should not be buffered because it's sloppy - you always expect those errors to be consumed, so you should be strict about it. pkg/testutils/net.go, line 134 at r4 (raw file): Previously, andreimatei (Andrei Matei) wrote…
s/signal/Signal/, then. FWIW, you never actually call Signal (I think), only Broadcast, so perhaps rewrite this to "at the time the buffer's Cond was signalled" pkg/testutils/net.go, line 255 at r5 (raw file):
it only unblocks reads, not writes? I think the exact machinery here is complicated enough that this comment is more misleading than useful. pkg/testutils/net_test.go, line 37 at r5 (raw file):
one at a time, not one total - you still have the Comments from Reviewable |
75b5692
to
3d3fe6c
Compare
Review status: 1 of 4 files reviewed at latest revision, 4 unresolved discussions. pkg/rpc/context_test.go, line 651 at r4 (raw file): Previously, tamird (Tamir Duberstein) wrote…
I believe I can let the code continue, but I'd much rather not. pkg/testutils/net.go, line 169 at r3 (raw file): Previously, tamird (Tamir Duberstein) wrote…
well I don't like "always use it" very much either. I've brought it back, but I've split the CV into 2 - one for reads (signaled both internally when new data is available and externally on partition) and one for writes - signaled when there's new capacity to fill. The two cases were previously combined only for convenience, but since we want one of the cases to be combined with pConn signals, I think it makes sense to separate them. pkg/testutils/net.go, line 393 at r3 (raw file): Previously, tamird (Tamir Duberstein) wrote…
well since when does using a buffered channel mean that you don't expect everything to be consumed and otherwise using one makes you sloppy? I don't think that's a thing. pkg/testutils/net.go, line 134 at r4 (raw file): Previously, tamird (Tamir Duberstein) wrote…
see now pkg/testutils/net.go, line 255 at r5 (raw file): Previously, tamird (Tamir Duberstein) wrote…
it unblocks readLocked() (it used to also unblock writes, but only momentarily; they were going back to sleep upon waking). Anyway, the comment is gone. pkg/testutils/net_test.go, line 37 at r5 (raw file): Previously, tamird (Tamir Duberstein) wrote…
I meant to get rid of the for too, thanks. I don't think one at a time makes much sense. Comments from Reviewable |
Reviewed 3 of 3 files at r6. pkg/rpc/context_test.go, line 651 at r4 (raw file): Previously, andreimatei (Andrei Matei) wrote…
I'm much less convinced of that than you. You'd rather not let it run, but why not? pkg/testutils/net.go, line 259 at r6 (raw file):
is there a reason this wakes up just one reader? i don't expect there to be multiple, i guess, so just wondering. pkg/testutils/net_test.go, line 53 at r6 (raw file):
return the error now that there's no for loop. pkg/testutils/net_test.go, line 375 at r6 (raw file):
1 << 20 // 1 MiB pkg/testutils/net_test.go, line 406 at r6 (raw file):
errors.Wrap Comments from Reviewable |
3d3fe6c
to
5994186
Compare
Review status: 1 of 4 files reviewed at latest revision, 5 unresolved discussions. pkg/rpc/context_test.go, line 651 at r4 (raw file): Previously, tamird (Tamir Duberstein) wrote…
Because I'm not sure what the rest of the code would be testing in this case, and I can't speak with any confidence about what to expect. What if gRPC introduces other errors for RPC not sent (because conn couldn't be established) vs RPC killed by keepalive failure? Anyway, I've let the code run. But: Turns out that the way the code was, the next RPC would timeout. If the transport connection is closed at this point, the next rpc call will block forever. It has something to do with how we block the dialer from returning new conns. I've made the dialer return errors and now the test works. I've read a bit and I now think that returning errors is the better thing to do in that dialer. There's a "TCP socket connect timeout", of the order of 20s apparently (much shorter than the retransmission timeouts on an established conn). pkg/testutils/net.go, line 259 at r6 (raw file): Previously, tamird (Tamir Duberstein) wrote…
this is not waking up readers exactly, this is waking up the single goroutine that copies from the buffer to the But this has made me realize that no broadcast was needed in this code and I had been confused. There's ever a single reader and writer to the buffer too (but the reader and writer are accessing the buffer concurrently, which makes it different from, say, bytes.Buffer). I've switched all to Signal(). pkg/testutils/net_test.go, line 53 at r6 (raw file): Previously, tamird (Tamir Duberstein) wrote…
Done. pkg/testutils/net_test.go, line 375 at r6 (raw file): Previously, tamird (Tamir Duberstein) wrote…
Done. pkg/testutils/net_test.go, line 406 at r6 (raw file): Previously, tamird (Tamir Duberstein) wrote…
Done. Comments from Reviewable |
Reviewed 3 of 3 files at r7. pkg/rpc/context_test.go, line 653 at r7 (raw file):
add an empty comment line to preserve the paragraph break. pkg/rpc/context_test.go, line 656 at r7 (raw file):
this lies now; remove? pkg/testutils/net_test.go, line 406 at r6 (raw file): Previously, andreimatei (Andrei Matei) wrote…
no need for the Comments from Reviewable |
Fixes cockroachdb#13989 Before this patch, the rpc.Context would perform heartbeats (a dedicated RPC) to see if a connection is healthy. If the heartbeats failed, the connection was closed (causing in-flight RPCs to fail) and the node was marked as unhealthy. These heartbeats, being regular RPCs, were subject to gRPC's flow control. This means that they were easily blocked by other large RPCs, which meant they were too feeble. In particular, they were easily blocked by large DistSQL streams. This patch moves to using gRPC's internal HTTP2 ping frames for checking conn health. These are not subject to flow control. The grpc transport-level connection is closed when they fail (and so in-flight RPCs still fail), but otherwise gRPC reconnects transparently. Heartbeats stay for the other current uses - clock skew detection and node health marking. Marking a node as unhealthy is debatable, give the shortcomings of these RPCs. However, this marking currently doesn't have big consequences - it only affects the order in which replicas are tried when a leaseholder is unknown.
5994186
to
c65b38b
Compare
Review status: 2 of 4 files reviewed at latest revision, 3 unresolved discussions, all commit checks successful. pkg/rpc/context_test.go, line 653 at r7 (raw file): Previously, tamird (Tamir Duberstein) wrote…
done pkg/rpc/context_test.go, line 656 at r7 (raw file): Previously, tamird (Tamir Duberstein) wrote…
changed pkg/testutils/net_test.go, line 406 at r6 (raw file): Previously, tamird (Tamir Duberstein) wrote…
Done. Comments from Reviewable |
TFTR Review status: 2 of 4 files reviewed at latest revision, 3 unresolved discussions, all commit checks successful. Comments from Reviewable |
Reviewed 2 of 2 files at r8. pkg/rpc/context_test.go, line 656 at r7 (raw file): Previously, andreimatei (Andrei Matei) wrote…
Heh, may as well log the error Comments from Reviewable |
Fixes #13989
Before this patch, the rpc.Context would perform heartbeats (a dedicated
RPC) to see if a connection is healthy. If the heartbeats failed, the
connection was closed (causing in-flight RPCs to fail) and the node was
marked as unhealthy.
These heartbeats, being regular RPCs, were subject to gRPC's flow
control. This means that they were easily blocked by other large RPCs,
which meant they were too feeble. In particular, they were easily
blocked by large DistSQL streams.
This patch moves to using gRPC's internal HTTP2 ping frames for checking
conn health. These are not subject to flow control. The grpc
transport-level connection is closed when they fail (and so in-flight
RPCs still fail), but otherwise gRPC reconnects transparently.
Heartbeats stay for the other current uses - clock skew detection and
node health marking. Marking a node as unhealthy is debatable, give the
shortcomings of these RPCs. However, this marking currently doesn't have
big consequences - it only affects the order in which replicas are tried
when a leaseholder is unknown.
cc @petermattis @bdarnell
This change isdata:image/s3,"s3://crabby-images/d0bb7/d0bb7f7625ca5bf5c3cf7a2b7a514cf841ab8395" alt="Reviewable"