Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reuse message buffer for streaming #1774

Closed
wants to merge 4 commits into from
Closed

Conversation

coocood
Copy link
Contributor

@coocood coocood commented Dec 31, 2017

For #1455

@thelinuxfoundation
Copy link

Thank you for your pull request. Before we can look at your contribution, we need to ensure all contributors are covered by a Contributor License Agreement.

After the following items are addressed, please respond with a new comment here, and the automated system will re-verify.

Regards,
CLA GitHub bot

@coocood
Copy link
Contributor Author

coocood commented Dec 31, 2017

@MakMukhi
The buffer reused is only enabled in ClientStream, not in ServerStream.
This is different than what we have discussed because I think it's very unlikely a client keeps long-lived connections to thousands of servers.

Copy link
Contributor

@MakMukhi MakMukhi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Can you run our benchmarks(

go run benchmark/benchmain/main.go -benchtime=10s -workloads=all \
)

Especially, for streaming with different message sizes and post the before and after results on the PR.

Also, the build is failing, can you look into that too. Thanks for the help. :)

// This may allocate more memory than needed, but avoids allocating memory every time in the case when
// each message is slightly larger than the previous one.
allocLen *= 2
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about:

allocLen := int(math.Pow(2, math.Ceil(math.Log2(length))+1))

Copy link
Contributor Author

@coocood coocood Jan 26, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MakMukhi
I did a micro benchmark, using math functions takes more than 100ns but multiply by 2 in a for loop takes less than 10ns.
For growing 64B to 1MB.

rpc_util_test.go Outdated
func TestBufferReuse(t *testing.T) {
for _, test := range []struct {
reuse bool
sameBuf bool
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't suppose we need two bools here, one would do.

@MakMukhi MakMukhi assigned coocood and unassigned MakMukhi Jan 25, 2018
@coocood
Copy link
Contributor Author

coocood commented Jan 26, 2018

@MakMukhi
I'll post the benchmark result later.

And the failed test is

--- FAIL: TestDialWaitsForServerSettings (5.67s)
	clientconn_test.go:159: Error while dialing. Err: context deadline exceeded
	clientconn_test.go:138: Error while accepting. Err: accept tcp 127.0.0.1:44053: use of closed network connection

It seems not related to this PR, and it only failed on a particular Go version.

@coocood
Copy link
Contributor Author

coocood commented Jan 26, 2018

To focus on the issue I've found, I only post 1M response for now.

The result of after:

Stream-traceMode_false-latency_0s-kbps_0-MTU_0-maxConcurrentCalls_4-reqSize_64B-respSize_1048576B-Compressor_false: 
50_Latency: 6.8084 ms 	90_Latency: 10.3493 ms 	99_Latency: 111.5983 ms 	Avg latency: 16.0996 ms 	Count: 2485 	5123585 Bytes/op	478 Allocs/op	
Histogram (unit: ms)
Count: 2485  Min:   3.6  Max: 114.2  Avg: 16.10
------------------------------------------------------------
[        3.564034,         3.564035)     1    0.0%    0.0%  
[        3.564035,         3.564042)     0    0.0%    0.0%  
[        3.564042,         3.564095)     0    0.0%    0.0%  
[        3.564095,         3.564514)     0    0.0%    0.0%  
[        3.564514,         3.567793)     0    0.0%    0.0%  
[        3.567793,         3.593464)     0    0.0%    0.0%  
[        3.593464,         3.794469)     3    0.1%    0.2%  
[        3.794469,         5.368326)   237    9.5%    9.7%  #
[        5.368326,        17.691521)  2020   81.3%   91.0%  ########
[       17.691521,       114.181329)   223    9.0%  100.0%  #
[      114.181329,              inf)     1    0.0%  100.0%  

The result of before

Stream-traceMode_false-latency_0s-kbps_0-MTU_0-maxConcurrentCalls_4-reqSize_64B-respSize_1048576B-Compressor_false: 
50_Latency: 7.5285 ms 	90_Latency: 11.0990 ms 	99_Latency: 112.7684 ms 	Avg latency: 15.2175 ms 	Count: 2629 	5964215 Bytes/op	449 Allocs/op	
Histogram (unit: ms)
Count: 2629  Min:   3.9  Max: 116.1  Avg: 15.22
------------------------------------------------------------
[        3.883256,         3.883257)     1    0.0%    0.0%  
[        3.883257,         3.883264)     0    0.0%    0.0%  
[        3.883264,         3.883318)     0    0.0%    0.0%  
[        3.883318,         3.883738)     0    0.0%    0.0%  
[        3.883738,         3.887039)     0    0.0%    0.0%  
[        3.887039,         3.912927)     0    0.0%    0.0%  
[        3.912927,         4.115956)     3    0.1%    0.2%  
[        4.115956,         5.708251)   311   11.8%   12.0%  #
[        5.708251,        18.196155)  2118   80.6%   92.5%  ########
[       18.196155,       116.135119)   195    7.4%  100.0%  #
[      116.135119,              inf)     1    0.0%  100.0%  

The result is odd, reuse buffer branch allocates more than before.
Then I removed the double length, allocate exactly the same length if it's more than cap, the result is much better:

Stream-traceMode_false-latency_0s-kbps_0-MTU_0-maxConcurrentCalls_4-reqSize_64B-respSize_1048576B-Compressor_false: 
50_Latency: 6.6709 ms 	90_Latency: 8.8538 ms 	99_Latency: 109.6082 ms 	Avg latency: 8.3528 ms 	Count: 4790 	5062523 Bytes/op	432 Allocs/op	
Histogram (unit: ms)
Count: 4790  Min:   3.4  Max: 114.9  Avg: 8.35
------------------------------------------------------------
[        3.432767,         3.432768)     1    0.0%    0.0%  
[        3.432768,         3.432775)     0    0.0%    0.0%  
[        3.432775,         3.432828)     0    0.0%    0.0%  
[        3.432828,         3.433248)     0    0.0%    0.0%  
[        3.433248,         3.436538)     0    0.0%    0.0%  
[        3.436538,         3.462321)     1    0.0%    0.0%  
[        3.462321,         3.664369)     4    0.1%    0.1%  
[        3.664369,         5.247720)   378    7.9%    8.0%  #
[        5.247720,        17.655699)  4338   90.6%   98.6%  #########
[       17.655699,       114.891160)    67    1.4%  100.0%  
[      114.891160,              inf)     1    0.0%  100.0%  

@coocood
Copy link
Contributor Author

coocood commented Jan 26, 2018

@MakMukhi
It seems like the allocation number and average latency not stable on each bench test.

It has nothing to do with double length.

I ran many times, the 50_latency reduced about 9%.

Here is the result of different response size:

After:

Stream-traceMode_false-latency_0s-kbps_0-MTU_0-maxConcurrentCalls_4-reqSize_64B-respSize_1024B-Compressor_false:
50_Latency: 159.7260 µs 	90_Latency: 222.1620 µs 	99_Latency: 485.1390 µs 	Avg latency: 172.4190 µs 	Count: 231633 	8518 Bytes/op	55 Allocs/op
Histogram (unit: µs)
Count: 231633  Min:  75.9  Max: 1593.7  Avg: 172.42
------------------------------------------------------------
[     75.897000,      75.898000)       1    0.0%    0.0%
[     75.898000,      75.901862)       0    0.0%    0.0%
[     75.901862,      75.920638)       0    0.0%    0.0%
[     75.920638,      76.011923)       0    0.0%    0.0%
[     76.011923,      76.455742)       0    0.0%    0.0%
[     76.455742,      78.613530)       4    0.0%    0.0%
[     78.613530,      89.104400)     158    0.1%    0.1%
[     89.104400,     140.109594)   60321   26.0%   26.1%  ###
[    140.109594,     388.089960)  167790   72.4%   98.5%  #######
[    388.089960,    1593.737000)    3358    1.4%  100.0%
[   1593.737000,            inf)       1    0.0%  100.0%

Stream-traceMode_false-latency_0s-kbps_0-MTU_0-maxConcurrentCalls_4-reqSize_64B-respSize_65535B-Compressor_false:
50_Latency: 656.3540 µs 	90_Latency: 1211.1790 µs 	99_Latency: 102631.8410 µs 	Avg latency: 3609.9480 µs 	Count: 11080 	375379 Bytes/op83 Allocs/op
Histogram (unit: µs)
Count: 11080  Min: 263.9  Max: 104126.2  Avg: 3609.95
------------------------------------------------------------
[      263.865000,       263.866000)      1    0.0%    0.0%
[      263.866000,       263.872775)      0    0.0%    0.0%
[      263.872775,       263.925455)      0    0.0%    0.0%
[      263.925455,       264.335059)      0    0.0%    0.0%
[      264.335059,       267.519856)      2    0.0%    0.0%
[      267.519856,       292.282632)      1    0.0%    0.0%
[      292.282632,       484.820832)   2172   19.6%   19.6%  ##
[      484.820832,      1981.864555)   8495   76.7%   96.3%  ########
[     1981.864555,     13621.839964)     97    0.9%   97.2%
[    13621.839964,    104126.229000)    311    2.8%  100.0%
[   104126.229000,              inf)      1    0.0%  100.0%

Stream-traceMode_false-latency_0s-kbps_0-MTU_0-maxConcurrentCalls_4-reqSize_64B-respSize_1048576B-Compressor_false:
50_Latency: 6.6519 ms 	90_Latency: 9.0018 ms 	99_Latency: 109.1591 ms 	Avg latency: 8.4781 ms 	Count: 4719 	5153053 Bytes/op	430 Allocs/op
Histogram (unit: ms)
Count: 4719  Min:   3.2  Max: 113.8  Avg: 8.48
------------------------------------------------------------
[        3.162232,         3.162233)     1    0.0%    0.0%
[        3.162233,         3.162240)     0    0.0%    0.0%
[        3.162240,         3.162293)     0    0.0%    0.0%
[        3.162293,         3.162712)     0    0.0%    0.0%
[        3.162712,         3.165991)     0    0.0%    0.0%
[        3.165991,         3.191661)     0    0.0%    0.0%
[        3.191661,         3.392662)     1    0.0%    0.0%
[        3.392662,         4.966478)   192    4.1%    4.1%
[        4.966478,        17.289304)  4451   94.3%   98.4%  #########
[       17.289304,       113.775874)    73    1.5%  100.0%
[      113.775874,              inf)     1    0.0%  100.0%

Before:

Stream-traceMode_false-latency_0s-kbps_0-MTU_0-maxConcurrentCalls_4-reqSize_64B-respSize_1024B-Compressor_false:
50_Latency: 162.9540 µs 	90_Latency: 233.4670 µs 	99_Latency: 525.4040 µs 	Avg latency: 177.6840 µs 	Count: 224778 	9666 Bytes/op	56 Allocs/op
Histogram (unit: µs)
Count: 224778  Min:  71.2  Max: 2502.2  Avg: 177.68
------------------------------------------------------------
[     71.225000,      71.226000)       1    0.0%    0.0%
[     71.226000,      71.230123)       0    0.0%    0.0%
[     71.230123,      71.251246)       0    0.0%    0.0%
[     71.251246,      71.359461)       1    0.0%    0.0%
[     71.359461,      71.913856)       0    0.0%    0.0%
[     71.913856,      74.754073)       2    0.0%    0.0%
[     74.754073,      89.304767)     189    0.1%    0.1%
[     89.304767,     163.849307)  114519   50.9%   51.0%  #####
[    163.849307,     545.747826)  108026   48.1%   99.1%  #####
[    545.747826,    2502.249000)    2039    0.9%  100.0%
[   2502.249000,            inf)       1    0.0%  100.0%

Stream-traceMode_false-latency_0s-kbps_0-MTU_0-maxConcurrentCalls_4-reqSize_64B-respSize_65535B-Compressor_false:
50_Latency: 777.7100 µs 	90_Latency: 1360.8740 µs 	99_Latency: 102505.5090 µs 	Avg latency: 3696.4090 µs 	Count: 10821 	444362 Bytes/op84 Allocs/op
Histogram (unit: µs)
Count: 10821  Min: 304.8  Max: 105989.1  Avg: 3696.41
------------------------------------------------------------
[      304.817000,       304.818000)      1    0.0%    0.0%
[      304.818000,       304.824790)      0    0.0%    0.0%
[      304.824790,       304.877689)      0    0.0%    0.0%
[      304.877689,       305.289792)      0    0.0%    0.0%
[      305.289792,       308.500213)      0    0.0%    0.0%
[      308.500213,       333.510505)      7    0.1%    0.1%
[      333.510505,       528.349317)   2143   19.8%   19.9%  ##
[      528.349317,      2046.210983)   8222   76.0%   95.9%  ########
[     2046.210983,     13870.877808)    144    1.3%   97.2%
[    13870.877808,    105989.118000)    303    2.8%  100.0%
[   105989.118000,              inf)      1    0.0%  100.0%

Stream-traceMode_false-latency_0s-kbps_0-MTU_0-maxConcurrentCalls_4-reqSize_64B-respSize_1048576B-Compressor_false:
50_Latency: 7.5659 ms 	90_Latency: 12.1183 ms 	99_Latency: 112.5573 ms 	Avg latency: 13.3949 ms 	Count: 3005 	5975819 Bytes/op	435 Allocs/op
Histogram (unit: ms)
Count: 3005  Min:   3.7  Max: 117.6  Avg: 13.39
------------------------------------------------------------
[        3.671680,         3.671681)     1    0.0%    0.0%
[        3.671681,         3.671688)     0    0.0%    0.0%
[        3.671688,         3.671742)     0    0.0%    0.0%
[        3.671742,         3.672165)     0    0.0%    0.0%
[        3.672165,         3.675489)     0    0.0%    0.0%
[        3.675489,         3.701603)     0    0.0%    0.0%
[        3.701603,         3.906753)     3    0.1%    0.1%
[        3.906753,         5.518406)   198    6.6%    6.7%  #
[        5.518406,        18.179525)  2626   87.4%   94.1%  #########
[       18.179525,       117.645016)   176    5.9%  100.0%  #
[      117.645016,              inf)     1    0.0%  100.0%

Copy link
Member

@dfawley dfawley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI: you could also do buffer re-use in the serverStream (server.go) with presumably similar benefit.

@coocood
Copy link
Contributor Author

coocood commented Jan 27, 2018

@dfawley
OK, I'll try to do it.

Since some server applications need to maintain thousands of idle client connections, in this case, keeping a buffer increases the memory usage dramatically. So we need to add a server option to disable it.

@coocood coocood changed the title Reuse message buffer for ClientStream Reuse message buffer for Streaming Jan 30, 2018
@coocood coocood changed the title Reuse message buffer for Streaming Reuse message buffer for streaming Jan 30, 2018
@coocood
Copy link
Contributor Author

coocood commented Jan 30, 2018

@dfawley
I enabled the server side buffer, and here is the benchmark result.

Only enable client-side buffer:

Stream-traceMode_false-latency_0s-kbps_0-MTU_0-maxConcurrentCalls_4-reqSize_65536B-respSize_1024B-Compressor_false:
50_Latency: 875.3420 µs 	90_Latency: 1349.9730 µs 	99_Latency: 103020.3270 µs 	Avg latency: 3807.4580 µs 	Count: 10594 	444244 Bytes/op84 Allocs/op
Histogram (unit: µs)
Count: 10594  Min: 302.8  Max: 105551.6  Avg: 3807.46
------------------------------------------------------------
[      302.849000,       302.850000)      1    0.0%    0.0%
[      302.850000,       302.856787)      0    0.0%    0.0%
[      302.856787,       302.909634)      0    0.0%    0.0%
[      302.909634,       303.321142)      0    0.0%    0.0%
[      303.321142,       306.525459)      0    0.0%    0.0%
[      306.525459,       331.476744)      2    0.0%    0.0%
[      331.476744,       525.766696)   1602   15.1%   15.2%  ##
[      525.766696,      2038.658137)   8594   81.1%   96.3%  ########
[     2038.658137,     13819.197919)     91    0.9%   97.1%
[    13819.197919,    105551.568000)    303    2.9%  100.0%
[   105551.568000,              inf)      1    0.0%  100.0%

Enable client-side and server-side buffer:

Stream-traceMode_false-latency_0s-kbps_0-MTU_0-maxConcurrentCalls_4-reqSize_65536B-respSize_1024B-Compressor_false:
50_Latency: 695.1510 µs 	90_Latency: 1253.2250 µs 	99_Latency: 102783.2740 µs 	Avg latency: 3156.0240 µs 	Count: 12752 	377945 Bytes/op83 Allocs/op
Histogram (unit: µs)
Count: 12752  Min: 260.8  Max: 106054.7  Avg: 3156.02
------------------------------------------------------------
[      260.767000,       260.768000)      1    0.0%    0.0%
[      260.768000,       260.774791)      0    0.0%    0.0%
[      260.774791,       260.827703)      0    0.0%    0.0%
[      260.827703,       261.239955)      0    0.0%    0.0%
[      261.239955,       264.451911)      0    0.0%    0.0%
[      264.451911,       289.477036)      2    0.0%    0.0%
[      289.477036,       484.453870)   2218   17.4%   17.4%  ##
[      484.453870,      2003.565754)  10113   79.3%   96.7%  ########
[     2003.565754,     13839.335561)    122    1.0%   97.7%
[    13839.335561,    106054.694000)    295    2.3%  100.0%
[   106054.694000,              inf)      1    0.0%  100.0%

@MakMukhi
Copy link
Contributor

MakMukhi commented Feb 6, 2018

Hey @coocood We recently came with an idea to get rid of this buffer entirely. Currently, the transport reads data into individual buffers and puts them on recvBuffer (which is sort of an infinite buffer). The GRPC layer looks at data header to see how big the message is and then goes about allocating this buffer to ask transport to fill it up.
We are thinking about making the transport do all that work; look at the header and figure out the message size and allocate memory for the whole message instead of small data chunks. Later, GRPC layer will accept a byte slice from the transport instead of asking the transport to fill one.
This will reduce multiple allocations and one copy for every message. However, this also requires updating the stream level flow control.

In lieu of this new plan, I'm going to go ahead and close this PR. We do greatly appreciate your time and effort that you've put in.

@MakMukhi MakMukhi closed this Feb 6, 2018
@coocood
Copy link
Contributor Author

coocood commented Feb 7, 2018

@MakMukhi
The new design sounds good.
How long does it take to implement?

@MakMukhi
Copy link
Contributor

MakMukhi commented Feb 9, 2018 via email

@lock lock bot locked as resolved and limited conversation to collaborators Jan 18, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants