-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime: make sure blocked channels run operations in FIFO order #11506
Comments
I would make this change. The current behavior sounds unfriendly to me, and years past I used a similar policy for locks and waits (in Java) and thought it gave good behavior. Starvation is bad. |
Language semantics require that the values be FIFO, but the implications of that with multiple goroutines reading the values are unspecified and muddy at best. That said, I believe the claim here that it seems more natural for blocked ones to go first; it just seems intuitive. I think you could even make a case that the current situation is almost a bug. There may be a noticeable performance hit for heavy contention, though, because I believe this style requires asymptotically more context switches. I still think channels should behave as you suggest. |
I'd fully support this. I think this is a more "natural" result that I was expecting from one of the discussion threads we had. |
I already have a CL for basically exactly this. |
From a specification perspective, it's true that our current approach satisfies the happens-before graph, but it does not satisfy liveness. That's a formally defined, reasonable, and desirable property we could specify as a requirement of channels without specifying something as specific as FIFO blocking. Implementation-wise, I'm concerned about making blocking strictly FIFO order. This seems like the exact same mistake as making map order deterministic or goroutine scheduling deterministic. People can and will come to depend on this and eventually we may want to weaken it. In particular, this is asking for scalability problems. One of the cardinal rules that came out of my thesis was to not impose strict ordering on operations unless it is a fundamental requirement. It is by no means a fundamental requirement of the blocking order. I fully support addressing the liveness issue, and I like the approach of eliminating the wake-up race, but if we're going to do that, we should think about whether we can introduce bounded randomness into the blocking order to prevent people from depending on strict FIFO blocking. |
@aclements FIFO blocking seems harder to incorrectly rely on than map ordering. Perhaps, like goroutine scheduling, it should only be randomized with |
This change will improve tail latency during bursty load, which is what we observed in one of our benchmarks in vitess. |
Flip a coin (linear feedback shift register), choose one of the top two, till we come up with a better reason for some other policy.Starvation is exponentially rare, service is nearly fifo, but order will usually differ if there's more than one waiter. |
I think we're overthinking the randomization. With randomized goroutine scheduling (which is already going in, at least for -race testing), we'll get randomized channel ordering for free. The random scheduling should change the order in which waiters get queued and the order in which sends get matched to the waiters. |
You seem to assume that the entity that benefits from reduced latency is a goroutine only. This is true when a goroutine services a request and needs to acquire some aux resource (e.g. a database connection). But this is not true for all producer-consumer/pipelining scenarios, where the entity that benefits from reduced latency is a message in a chan. Today messages in chans are serviced strictly FIFO and with minimal latency: the next running goroutine picks up the first message in a chan. What you propose improves the database connection pool scenario but equally worsens the producer-consumer scenario. Because under your proposal we handoff the first message in the chan to a goroutine that will run who-knows-when, and a goroutine that drives by the chan next moment either blocks or services the second message ahead of the first one. If we switch the implementation we will start receiving complaints about the other scenarios. |
I like the imagery of a drive by goroutine a lot, the literature also uses On Thu, Jul 2, 2015 at 7:52 AM, Dmitry Vyukov [email protected]
|
CL https://golang.org/cl/9345 mentions this issue. |
This change is the same as CL #9345 which was reverted, except for a small bug fix. The only change is to the body of sendDirect and its callsite. Also added a test. The problem was during a channel send operation. The target of the send was a sleeping goroutine waiting to receive. We basically do: 1) Read the destination pointer out of the sudog structure 2) Copy the value we're sending to that destination pointer Unfortunately, the previous change had a goroutine suspend point between 1 & 2 (the call to sendDirect). At that point the destination goroutine's stack could be copied (shrunk). The pointer we read in step 1 is no longer valid for step 2. Fixed by not allowing any suspension points between 1 & 2. I suspect the old code worked correctly basically by accident. Fixes #13169 The original 9345: This change removes the retry mechanism we use for buffered channels. Instead, any sender waking up a receiver or vice versa completes the full protocol with its counterpart. This means the counterpart does not need to relock the channel when it wakes up. (Currently buffered channels need to relock on wakeup.) For sends on a channel with waiting receivers, this change replaces two copies (sender->queue, queue->receiver) with one (sender->receiver). For receives on channels with a waiting sender, two copies are still required. This change unifies to a large degree the algorithm for buffered and unbuffered channels, simplifying the overall implementation. Fixes #11506 Change-Id: I57dfa3fc219cffa4d48301ee15fe5479299efa09 Reviewed-on: https://go-review.googlesource.com/16740 Reviewed-by: Ian Lance Taylor <[email protected]>
I've had mail conversations with two different projects using Go recently, both centered around the surprise that if many goroutines are blocked on a particular channel, and that channel becomes available, it is still possible for running goroutines who "drive by" at the right time to get their operation in before the goroutines that are blocked, and it is possible for the goroutines who are blocked to be reordered as a result. If this happens repeatedly, the effect is that the blocked goroutines can block arbitrarily long even though the channel is known to be ready at regular intervals. I wonder if we should adjust the channel implementation to ensure that when a channel does become available for sending or receiving, blocked operations take priority over "drive by" operations. It seems to me that this can be done by completing the blocked operation as part of the operation that unblocks it.
Sends and receives on unbuffered channels already behave this way: a send with blocked receivers picks the first receiver goroutine off the queue, delivers the value to it, readies it (puts it on a run queue), and continues execution. That is, the send completes the blocked operation. Receives on unbuffered channels similarly complete blocked sends.
Sends and receives on buffered channels do not behave this way:
A send into a buffered channel with blocked receivers stores the value in the buffer, wakes up a receiver, and continues executing. When the receiver is eventually scheduled, it checks the channel, and maybe it gets lucky and the value is still there. But maybe not, in which case it goes to the end of the queue.
A receive out of a buffered channel copies a value out of the buffer, wakes a blocked sender, and continues. When the sender is eventually scheduled, it checks the channel, and maybe it gets lucky and there is still room in the buffer for the send. But maybe not, in which case it goes to the end of the queue.
It seems to me that it would be easy and symmetric with the unbuffered channel operations for a send with blocked receivers to deliver the value straight to a receiver, completing the first pending receive operation. Similarly a receive with blocked senders could take its own value out of the channel buffer and then complete the send into the newly emptied space. That would be the only operation of the four that needs to transfer a pair of values instead of just one.
It doesn't seem to me that it would hurt performance to do this, and in fact it might help, since there would not be all the unnecessary wakeups that happen now. It would give much more predictable behavior when goroutines queue up on a stuck channel: once queued, that's the order they're going to be served, guaranteed.
From a "happens before" point of view, I'm talking nonsense. There is no "happens before" for two different goroutines blocking on the same channel, and so there is no difference between all the execution orders. I understand that. But from a "what actually happens" point of view, there certainly is a difference: if you know that the goroutines have been stacking up one per minute for an hour before the channel finally gets unblocked and they all complete, you know what order they blocked and might reasonably expect them to unblock in that same order. This remains roughly as true at shorter time scales.
Especially since so many Go programs care about latency, making channels first come, first served seems worth doing. And it looks to me like it can be done trivially and at no performance cost. The current behavior looks more like a bug than an intended result.
I'm not talking about a language change, just a change to our implementation.
Thoughts?
@robpike @dvyukov @randall77 @aclements
The text was updated successfully, but these errors were encountered: