-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
transport: client keepalives should be sent every [Time] period #3102
Conversation
This commit makes the following changes: - Keep track of the time of the last read in the transport. - Use this in the keepalive implementation to decide when to send out keepalives. - Address the issue of keepalives being sent every [Time+Timeout] period instead of every [Time] period, as mandated by proposal A8. Proposal A8 is here: https://github.com/grpc/proposal/blob/master/A8-client-side-keepalive.md
This PR supersedes #2790, which it became unmanageable to fix the merge conflicts, given for how long it has been open. |
I ran the benchmarks with the following options::
and it returned the following results::
which is quite surprising that it shows a performance improvement. But I tried running it multiple times and it seems to be consistent. |
32-bit tests on travis seem to be unhappy, even though I have put the Am I missing something here? |
Moving the So, maybe the compiler just optimized the whole Should I just get rid of the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This takes care of the client - there should be similar changes necessary on the server too, right?
internal/transport/http2_client.go
Outdated
@@ -1322,36 +1351,37 @@ func (t *http2Client) keepalive() { | |||
return | |||
} | |||
if len(t.activeStreams) < 1 && !t.kp.PermitWithoutStream { | |||
// If a ping was sent out previously (because there were active | |||
// streams at that point) which wasn't acked and it's timeout |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Language nit: it's -> its
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
internal/transport/http2_client.go
Outdated
// acked). | ||
sleepDuration := minTime(t.kp.Time, timeoutLeft) | ||
timeoutLeft -= sleepDuration | ||
prevNano = time.Now().UnixNano() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we do this here? prevNano
is used to compare against lastRead
. If lastRead
is updated between line 1330 and here, we will fail to count that activity.
Consider a crazy race where the ping sent on 1373 is ack'd and that activity is recorded before the code gets here. Yes this is extremely unlikely, but it's theoretically possible which means something is wrong.
Should prevNano
be set to the lastRead
time (always, including line 1333)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it makes sense to set prevNano
to lastRead
all the time.
On 1336: Where the timer has fired, but we have had read activity since the last time we were here, it definitely makes sense to set prevNano
to lastRead
because we are effectively setting the timer to fire kp.Time
seconds after lastRead
. So, the next time we get here, we should check if there has been read activity since then. This also eliminates the extremely unlikely race of having read_activity after our atomic read of t.lr.timeNano
and the setting of prevNano
to time.Now
here (and no read_activty till the timer fires the next time).
Here around 1384 also, it makes sense to set it to lastRead
. It took me some time to convince myself that this works also for the case where we have just come out of dormancy and have sent out a ping, but I feel convinced now. Please let me know if you feel otherwise.
Thanks for this comment.
internal/transport/keepalive_test.go
Outdated
func TestKeepaliveClientFrequency(t *testing.T) { | ||
serverConfig := &ServerConfig{ | ||
KeepalivePolicy: keepalive.EnforcementPolicy{ | ||
MinTime: 2100 * time.Millisecond, // 2.1 seconds |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be even more aggressive, like 1.1s? Or will it be flaky (because Travis)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was initially mistaken to think that the server waits for 2 pingStrikes
within the configured MinTime
before sending a GOAWAY.
Changed it to 1.2 seconds, and the test passed with the race detector for 50 runs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PTAL.
internal/transport/http2_client.go
Outdated
@@ -1322,36 +1351,37 @@ func (t *http2Client) keepalive() { | |||
return | |||
} | |||
if len(t.activeStreams) < 1 && !t.kp.PermitWithoutStream { | |||
// If a ping was sent out previously (because there were active | |||
// streams at that point) which wasn't acked and it's timeout |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
internal/transport/http2_client.go
Outdated
// acked). | ||
sleepDuration := minTime(t.kp.Time, timeoutLeft) | ||
timeoutLeft -= sleepDuration | ||
prevNano = time.Now().UnixNano() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it makes sense to set prevNano
to lastRead
all the time.
On 1336: Where the timer has fired, but we have had read activity since the last time we were here, it definitely makes sense to set prevNano
to lastRead
because we are effectively setting the timer to fire kp.Time
seconds after lastRead
. So, the next time we get here, we should check if there has been read activity since then. This also eliminates the extremely unlikely race of having read_activity after our atomic read of t.lr.timeNano
and the setting of prevNano
to time.Now
here (and no read_activty till the timer fires the next time).
Here around 1384 also, it makes sense to set it to lastRead
. It took me some time to convince myself that this works also for the case where we have just come out of dormancy and have sent out a ping, but I feel convinced now. Please let me know if you feel otherwise.
Thanks for this comment.
internal/transport/keepalive_test.go
Outdated
func TestKeepaliveClientFrequency(t *testing.T) { | ||
serverConfig := &ServerConfig{ | ||
KeepalivePolicy: keepalive.EnforcementPolicy{ | ||
MinTime: 2100 * time.Millisecond, // 2.1 seconds |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was initially mistaken to think that the server waits for 2 pingStrikes
within the configured MinTime
before sending a GOAWAY.
Changed it to 1.2 seconds, and the test passed with the race detector for 50 runs.
That is true. I could do it in a follow-up PR. |
internal/transport/http2_client.go
Outdated
@@ -47,6 +47,7 @@ import ( | |||
|
|||
// http2Client implements the ClientTransport interface with HTTP2. | |||
type http2Client struct { | |||
lr lastRead // keep this field 64-bit aligned |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make this int64 and delete the struct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
internal/transport/keepalive_test.go
Outdated
@@ -350,6 +350,53 @@ func TestKeepaliveClientStaysHealthyWithResponsiveServer(t *testing.T) { | |||
} | |||
} | |||
|
|||
// TestKeepaliveClientFrequency creates a server which expects at most 2 client | |||
// pings for every 2.1 seconds, while the client is configured to send a ping |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This text needs to be updated in light of the number changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, didn't realize there was one more comment needing my attention and that I had not updated the thread. PTAL.
Apparently it was my fault because I changed the assignee from me to me instead of me to you. 🤦♂️ |
This commit makes the following changes:
keepalives.
instead of every [Time] period, as mandated by proposal A8.
Proposal A8 is here:
https://github.com/grpc/proposal/blob/master/A8-client-side-keepalive.md