Reuse message buffer for streaming #1774

coocood · 2017-12-31T12:32:00Z

thelinuxfoundation · 2017-12-31T12:32:02Z

Thank you for your pull request. Before we can look at your contribution, we need to ensure all contributors are covered by a Contributor License Agreement.

After the following items are addressed, please respond with a new comment here, and the automated system will re-verify.

User @coocood isn't covered by a CLA. They will need to complete the form at https://identity.linuxfoundation.org/projects/cncf

Regards,
CLA GitHub bot

For grpc#1455

coocood · 2017-12-31T12:52:14Z

@MakMukhi
The buffer reused is only enabled in ClientStream, not in ServerStream.
This is different than what we have discussed because I think it's very unlikely a client keeps long-lived connections to thousands of servers.

MakMukhi

Looks good. Can you run our benchmarks(

grpc-go/benchmark/benchmain/main.go

Line 24 in 0848a09

go run benchmark/benchmain/main.go -benchtime=10s -workloads=all \

)

Especially, for streaming with different message sizes and post the before and after results on the PR.

Also, the build is failing, can you look into that too. Thanks for the help. :)

MakMukhi · 2018-01-24T23:25:19Z

rpc_util.go

+		// This may allocate more memory than needed, but avoids allocating memory every time in the case when
+		// each message is slightly larger than the previous one.
+		allocLen *= 2
+	}


How about:

allocLen := int(math.Pow(2, math.Ceil(math.Log2(length))+1))

@MakMukhi
I did a micro benchmark, using math functions takes more than 100ns but multiply by 2 in a for loop takes less than 10ns.
For growing 64B to 1MB.

MakMukhi · 2018-01-25T00:08:17Z

rpc_util_test.go

+func TestBufferReuse(t *testing.T) {
+	for _, test := range []struct {
+		reuse   bool
+		sameBuf bool


I don't suppose we need two bools here, one would do.

coocood · 2018-01-26T03:36:38Z

@MakMukhi
I'll post the benchmark result later.

And the failed test is

--- FAIL: TestDialWaitsForServerSettings (5.67s)
	clientconn_test.go:159: Error while dialing. Err: context deadline exceeded
	clientconn_test.go:138: Error while accepting. Err: accept tcp 127.0.0.1:44053: use of closed network connection

It seems not related to this PR, and it only failed on a particular Go version.

coocood · 2018-01-26T04:57:54Z

To focus on the issue I've found, I only post 1M response for now.

The result of after:

Stream-traceMode_false-latency_0s-kbps_0-MTU_0-maxConcurrentCalls_4-reqSize_64B-respSize_1048576B-Compressor_false: 
50_Latency: 6.8084 ms 	90_Latency: 10.3493 ms 	99_Latency: 111.5983 ms 	Avg latency: 16.0996 ms 	Count: 2485 	5123585 Bytes/op	478 Allocs/op	
Histogram (unit: ms)
Count: 2485  Min:   3.6  Max: 114.2  Avg: 16.10
------------------------------------------------------------
[        3.564034,         3.564035)     1    0.0%    0.0%  
[        3.564035,         3.564042)     0    0.0%    0.0%  
[        3.564042,         3.564095)     0    0.0%    0.0%  
[        3.564095,         3.564514)     0    0.0%    0.0%  
[        3.564514,         3.567793)     0    0.0%    0.0%  
[        3.567793,         3.593464)     0    0.0%    0.0%  
[        3.593464,         3.794469)     3    0.1%    0.2%  
[        3.794469,         5.368326)   237    9.5%    9.7%  #
[        5.368326,        17.691521)  2020   81.3%   91.0%  ########
[       17.691521,       114.181329)   223    9.0%  100.0%  #
[      114.181329,              inf)     1    0.0%  100.0%

The result of before

Stream-traceMode_false-latency_0s-kbps_0-MTU_0-maxConcurrentCalls_4-reqSize_64B-respSize_1048576B-Compressor_false: 
50_Latency: 7.5285 ms 	90_Latency: 11.0990 ms 	99_Latency: 112.7684 ms 	Avg latency: 15.2175 ms 	Count: 2629 	5964215 Bytes/op	449 Allocs/op	
Histogram (unit: ms)
Count: 2629  Min:   3.9  Max: 116.1  Avg: 15.22
------------------------------------------------------------
[        3.883256,         3.883257)     1    0.0%    0.0%  
[        3.883257,         3.883264)     0    0.0%    0.0%  
[        3.883264,         3.883318)     0    0.0%    0.0%  
[        3.883318,         3.883738)     0    0.0%    0.0%  
[        3.883738,         3.887039)     0    0.0%    0.0%  
[        3.887039,         3.912927)     0    0.0%    0.0%  
[        3.912927,         4.115956)     3    0.1%    0.2%  
[        4.115956,         5.708251)   311   11.8%   12.0%  #
[        5.708251,        18.196155)  2118   80.6%   92.5%  ########
[       18.196155,       116.135119)   195    7.4%  100.0%  #
[      116.135119,              inf)     1    0.0%  100.0%

The result is odd, reuse buffer branch allocates more than before.
Then I removed the double length, allocate exactly the same length if it's more than cap, the result is much better:

Stream-traceMode_false-latency_0s-kbps_0-MTU_0-maxConcurrentCalls_4-reqSize_64B-respSize_1048576B-Compressor_false: 
50_Latency: 6.6709 ms 	90_Latency: 8.8538 ms 	99_Latency: 109.6082 ms 	Avg latency: 8.3528 ms 	Count: 4790 	5062523 Bytes/op	432 Allocs/op	
Histogram (unit: ms)
Count: 4790  Min:   3.4  Max: 114.9  Avg: 8.35
------------------------------------------------------------
[        3.432767,         3.432768)     1    0.0%    0.0%  
[        3.432768,         3.432775)     0    0.0%    0.0%  
[        3.432775,         3.432828)     0    0.0%    0.0%  
[        3.432828,         3.433248)     0    0.0%    0.0%  
[        3.433248,         3.436538)     0    0.0%    0.0%  
[        3.436538,         3.462321)     1    0.0%    0.0%  
[        3.462321,         3.664369)     4    0.1%    0.1%  
[        3.664369,         5.247720)   378    7.9%    8.0%  #
[        5.247720,        17.655699)  4338   90.6%   98.6%  #########
[       17.655699,       114.891160)    67    1.4%  100.0%  
[      114.891160,              inf)     1    0.0%  100.0%

coocood · 2018-01-26T05:23:38Z

@MakMukhi
It seems like the allocation number and average latency not stable on each bench test.

It has nothing to do with double length.

I ran many times, the 50_latency reduced about 9%.

Here is the result of different response size:

After:

Stream-traceMode_false-latency_0s-kbps_0-MTU_0-maxConcurrentCalls_4-reqSize_64B-respSize_1024B-Compressor_false:
50_Latency: 159.7260 µs 	90_Latency: 222.1620 µs 	99_Latency: 485.1390 µs 	Avg latency: 172.4190 µs 	Count: 231633 	8518 Bytes/op	55 Allocs/op
Histogram (unit: µs)
Count: 231633  Min:  75.9  Max: 1593.7  Avg: 172.42
------------------------------------------------------------
[     75.897000,      75.898000)       1    0.0%    0.0%
[     75.898000,      75.901862)       0    0.0%    0.0%
[     75.901862,      75.920638)       0    0.0%    0.0%
[     75.920638,      76.011923)       0    0.0%    0.0%
[     76.011923,      76.455742)       0    0.0%    0.0%
[     76.455742,      78.613530)       4    0.0%    0.0%
[     78.613530,      89.104400)     158    0.1%    0.1%
[     89.104400,     140.109594)   60321   26.0%   26.1%  ###
[    140.109594,     388.089960)  167790   72.4%   98.5%  #######
[    388.089960,    1593.737000)    3358    1.4%  100.0%
[   1593.737000,            inf)       1    0.0%  100.0%

Stream-traceMode_false-latency_0s-kbps_0-MTU_0-maxConcurrentCalls_4-reqSize_64B-respSize_65535B-Compressor_false:
50_Latency: 656.3540 µs 	90_Latency: 1211.1790 µs 	99_Latency: 102631.8410 µs 	Avg latency: 3609.9480 µs 	Count: 11080 	375379 Bytes/op83 Allocs/op
Histogram (unit: µs)
Count: 11080  Min: 263.9  Max: 104126.2  Avg: 3609.95
------------------------------------------------------------
[      263.865000,       263.866000)      1    0.0%    0.0%
[      263.866000,       263.872775)      0    0.0%    0.0%
[      263.872775,       263.925455)      0    0.0%    0.0%
[      263.925455,       264.335059)      0    0.0%    0.0%
[      264.335059,       267.519856)      2    0.0%    0.0%
[      267.519856,       292.282632)      1    0.0%    0.0%
[      292.282632,       484.820832)   2172   19.6%   19.6%  ##
[      484.820832,      1981.864555)   8495   76.7%   96.3%  ########
[     1981.864555,     13621.839964)     97    0.9%   97.2%
[    13621.839964,    104126.229000)    311    2.8%  100.0%
[   104126.229000,              inf)      1    0.0%  100.0%

Stream-traceMode_false-latency_0s-kbps_0-MTU_0-maxConcurrentCalls_4-reqSize_64B-respSize_1048576B-Compressor_false:
50_Latency: 6.6519 ms 	90_Latency: 9.0018 ms 	99_Latency: 109.1591 ms 	Avg latency: 8.4781 ms 	Count: 4719 	5153053 Bytes/op	430 Allocs/op
Histogram (unit: ms)
Count: 4719  Min:   3.2  Max: 113.8  Avg: 8.48
------------------------------------------------------------
[        3.162232,         3.162233)     1    0.0%    0.0%
[        3.162233,         3.162240)     0    0.0%    0.0%
[        3.162240,         3.162293)     0    0.0%    0.0%
[        3.162293,         3.162712)     0    0.0%    0.0%
[        3.162712,         3.165991)     0    0.0%    0.0%
[        3.165991,         3.191661)     0    0.0%    0.0%
[        3.191661,         3.392662)     1    0.0%    0.0%
[        3.392662,         4.966478)   192    4.1%    4.1%
[        4.966478,        17.289304)  4451   94.3%   98.4%  #########
[       17.289304,       113.775874)    73    1.5%  100.0%
[      113.775874,              inf)     1    0.0%  100.0%

Before:

Stream-traceMode_false-latency_0s-kbps_0-MTU_0-maxConcurrentCalls_4-reqSize_64B-respSize_1024B-Compressor_false:
50_Latency: 162.9540 µs 	90_Latency: 233.4670 µs 	99_Latency: 525.4040 µs 	Avg latency: 177.6840 µs 	Count: 224778 	9666 Bytes/op	56 Allocs/op
Histogram (unit: µs)
Count: 224778  Min:  71.2  Max: 2502.2  Avg: 177.68
------------------------------------------------------------
[     71.225000,      71.226000)       1    0.0%    0.0%
[     71.226000,      71.230123)       0    0.0%    0.0%
[     71.230123,      71.251246)       0    0.0%    0.0%
[     71.251246,      71.359461)       1    0.0%    0.0%
[     71.359461,      71.913856)       0    0.0%    0.0%
[     71.913856,      74.754073)       2    0.0%    0.0%
[     74.754073,      89.304767)     189    0.1%    0.1%
[     89.304767,     163.849307)  114519   50.9%   51.0%  #####
[    163.849307,     545.747826)  108026   48.1%   99.1%  #####
[    545.747826,    2502.249000)    2039    0.9%  100.0%
[   2502.249000,            inf)       1    0.0%  100.0%

Stream-traceMode_false-latency_0s-kbps_0-MTU_0-maxConcurrentCalls_4-reqSize_64B-respSize_65535B-Compressor_false:
50_Latency: 777.7100 µs 	90_Latency: 1360.8740 µs 	99_Latency: 102505.5090 µs 	Avg latency: 3696.4090 µs 	Count: 10821 	444362 Bytes/op84 Allocs/op
Histogram (unit: µs)
Count: 10821  Min: 304.8  Max: 105989.1  Avg: 3696.41
------------------------------------------------------------
[      304.817000,       304.818000)      1    0.0%    0.0%
[      304.818000,       304.824790)      0    0.0%    0.0%
[      304.824790,       304.877689)      0    0.0%    0.0%
[      304.877689,       305.289792)      0    0.0%    0.0%
[      305.289792,       308.500213)      0    0.0%    0.0%
[      308.500213,       333.510505)      7    0.1%    0.1%
[      333.510505,       528.349317)   2143   19.8%   19.9%  ##
[      528.349317,      2046.210983)   8222   76.0%   95.9%  ########
[     2046.210983,     13870.877808)    144    1.3%   97.2%
[    13870.877808,    105989.118000)    303    2.8%  100.0%
[   105989.118000,              inf)      1    0.0%  100.0%

Stream-traceMode_false-latency_0s-kbps_0-MTU_0-maxConcurrentCalls_4-reqSize_64B-respSize_1048576B-Compressor_false:
50_Latency: 7.5659 ms 	90_Latency: 12.1183 ms 	99_Latency: 112.5573 ms 	Avg latency: 13.3949 ms 	Count: 3005 	5975819 Bytes/op	435 Allocs/op
Histogram (unit: ms)
Count: 3005  Min:   3.7  Max: 117.6  Avg: 13.39
------------------------------------------------------------
[        3.671680,         3.671681)     1    0.0%    0.0%
[        3.671681,         3.671688)     0    0.0%    0.0%
[        3.671688,         3.671742)     0    0.0%    0.0%
[        3.671742,         3.672165)     0    0.0%    0.0%
[        3.672165,         3.675489)     0    0.0%    0.0%
[        3.675489,         3.701603)     0    0.0%    0.0%
[        3.701603,         3.906753)     3    0.1%    0.1%
[        3.906753,         5.518406)   198    6.6%    6.7%  #
[        5.518406,        18.179525)  2626   87.4%   94.1%  #########
[       18.179525,       117.645016)   176    5.9%  100.0%  #
[      117.645016,              inf)     1    0.0%  100.0%

dfawley

FYI: you could also do buffer re-use in the serverStream (server.go) with presumably similar benefit.

coocood · 2018-01-27T04:51:44Z

@dfawley
OK, I'll try to do it.

Since some server applications need to maintain thousands of idle client connections, in this case, keeping a buffer increases the memory usage dramatically. So we need to add a server option to disable it.

Conflicts: server.go

coocood · 2018-01-30T05:37:39Z

@dfawley
I enabled the server side buffer, and here is the benchmark result.

Only enable client-side buffer:

Stream-traceMode_false-latency_0s-kbps_0-MTU_0-maxConcurrentCalls_4-reqSize_65536B-respSize_1024B-Compressor_false:
50_Latency: 875.3420 µs 	90_Latency: 1349.9730 µs 	99_Latency: 103020.3270 µs 	Avg latency: 3807.4580 µs 	Count: 10594 	444244 Bytes/op84 Allocs/op
Histogram (unit: µs)
Count: 10594  Min: 302.8  Max: 105551.6  Avg: 3807.46
------------------------------------------------------------
[      302.849000,       302.850000)      1    0.0%    0.0%
[      302.850000,       302.856787)      0    0.0%    0.0%
[      302.856787,       302.909634)      0    0.0%    0.0%
[      302.909634,       303.321142)      0    0.0%    0.0%
[      303.321142,       306.525459)      0    0.0%    0.0%
[      306.525459,       331.476744)      2    0.0%    0.0%
[      331.476744,       525.766696)   1602   15.1%   15.2%  ##
[      525.766696,      2038.658137)   8594   81.1%   96.3%  ########
[     2038.658137,     13819.197919)     91    0.9%   97.1%
[    13819.197919,    105551.568000)    303    2.9%  100.0%
[   105551.568000,              inf)      1    0.0%  100.0%

Enable client-side and server-side buffer:

Stream-traceMode_false-latency_0s-kbps_0-MTU_0-maxConcurrentCalls_4-reqSize_65536B-respSize_1024B-Compressor_false:
50_Latency: 695.1510 µs 	90_Latency: 1253.2250 µs 	99_Latency: 102783.2740 µs 	Avg latency: 3156.0240 µs 	Count: 12752 	377945 Bytes/op83 Allocs/op
Histogram (unit: µs)
Count: 12752  Min: 260.8  Max: 106054.7  Avg: 3156.02
------------------------------------------------------------
[      260.767000,       260.768000)      1    0.0%    0.0%
[      260.768000,       260.774791)      0    0.0%    0.0%
[      260.774791,       260.827703)      0    0.0%    0.0%
[      260.827703,       261.239955)      0    0.0%    0.0%
[      261.239955,       264.451911)      0    0.0%    0.0%
[      264.451911,       289.477036)      2    0.0%    0.0%
[      289.477036,       484.453870)   2218   17.4%   17.4%  ##
[      484.453870,      2003.565754)  10113   79.3%   96.7%  ########
[     2003.565754,     13839.335561)    122    1.0%   97.7%
[    13839.335561,    106054.694000)    295    2.3%  100.0%
[   106054.694000,              inf)      1    0.0%  100.0%

MakMukhi · 2018-02-06T23:03:52Z

Hey @coocood We recently came with an idea to get rid of this buffer entirely. Currently, the transport reads data into individual buffers and puts them on recvBuffer (which is sort of an infinite buffer). The GRPC layer looks at data header to see how big the message is and then goes about allocating this buffer to ask transport to fill it up.
We are thinking about making the transport do all that work; look at the header and figure out the message size and allocate memory for the whole message instead of small data chunks. Later, GRPC layer will accept a byte slice from the transport instead of asking the transport to fill one.
This will reduce multiple allocations and one copy for every message. However, this also requires updating the stream level flow control.

In lieu of this new plan, I'm going to go ahead and close this PR. We do greatly appreciate your time and effort that you've put in.

coocood · 2018-02-07T04:34:36Z

@MakMukhi
The new design sounds good.
How long does it take to implement?

MakMukhi · 2018-02-09T02:22:08Z

It shouldn't take too long once started. But, currently I'm working on a precursor to it and will only be able to get to it in a few weeks down the line. Hopefully, by end of this quarter.

…

On Tue, Feb 6, 2018 at 8:34 PM, Ewan Chou ***@***.***> wrote: @MakMukhi <https://github.com/makmukhi> The new design sounds good. How long does it take to implement? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1774 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ATtnRy0AWwQ6xNwaDtY3mp1uzYVMdtnBks5tSSfmgaJpZM4RP1rc> .

Reuse message buffer for ClientStream

ca30e36

For grpc#1455

coocood force-pushed the reuse-buf branch from dbd875d to ca30e36 Compare December 31, 2017 12:47

dfawley assigned MakMukhi Jan 4, 2018

dfawley requested a review from MakMukhi January 4, 2018 23:37

MakMukhi reviewed Jan 25, 2018

View reviewed changes

MakMukhi assigned coocood and unassigned MakMukhi Jan 25, 2018

*: simplify test

cfd1cbf

dfawley reviewed Jan 26, 2018

View reviewed changes

coocood added 2 commits January 30, 2018 13:28

*: enable buffer on server side and add server option to disable it.

47fd570

Merge commit '0bd008f5fadb62d228f12b18d016709e8139a7af' into reuse-buf

b7438c0

Conflicts: server.go

coocood changed the title ~~Reuse message buffer for ClientStream~~ Reuse message buffer for Streaming Jan 30, 2018

coocood changed the title ~~Reuse message buffer for Streaming~~ Reuse message buffer for streaming Jan 30, 2018

MakMukhi closed this Feb 6, 2018

scotthew1 mentioned this pull request Mar 7, 2018

transport: reuse memory on read path #1455

Closed

lock bot locked as resolved and limited conversation to collaborators Jan 18, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reuse message buffer for streaming #1774

Reuse message buffer for streaming #1774

coocood commented Dec 31, 2017

thelinuxfoundation commented Dec 31, 2017

coocood commented Dec 31, 2017

MakMukhi left a comment •

edited

Loading

MakMukhi Jan 24, 2018

coocood Jan 26, 2018 •

edited

Loading

MakMukhi Jan 25, 2018

coocood commented Jan 26, 2018 •

edited

Loading

coocood commented Jan 26, 2018

coocood commented Jan 26, 2018

dfawley left a comment

coocood commented Jan 27, 2018 •

edited

Loading

coocood commented Jan 30, 2018

MakMukhi commented Feb 6, 2018

coocood commented Feb 7, 2018

MakMukhi commented Feb 9, 2018 via email

Reuse message buffer for streaming #1774

Reuse message buffer for streaming #1774

Conversation

coocood commented Dec 31, 2017

thelinuxfoundation commented Dec 31, 2017

coocood commented Dec 31, 2017

MakMukhi left a comment • edited Loading

Choose a reason for hiding this comment

MakMukhi Jan 24, 2018

Choose a reason for hiding this comment

coocood Jan 26, 2018 • edited Loading

Choose a reason for hiding this comment

MakMukhi Jan 25, 2018

Choose a reason for hiding this comment

coocood commented Jan 26, 2018 • edited Loading

coocood commented Jan 26, 2018

coocood commented Jan 26, 2018

dfawley left a comment

Choose a reason for hiding this comment

coocood commented Jan 27, 2018 • edited Loading

coocood commented Jan 30, 2018

MakMukhi commented Feb 6, 2018

coocood commented Feb 7, 2018

MakMukhi commented Feb 9, 2018 via email

MakMukhi left a comment •

edited

Loading

coocood Jan 26, 2018 •

edited

Loading

coocood commented Jan 26, 2018 •

edited

Loading

coocood commented Jan 27, 2018 •

edited

Loading