-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
minor perf: reduce fine grained buffer access when encoding HTTP1 headers #19115
Conversation
…uest Signed-off-by: wbpcode <[email protected]>
Signed-off-by: wbpcode <[email protected]>
nice! Can you add both unit tests and a microbenchmark test to this PR? You can see a nearby microbenchmark test in test/common/http/header_map_impl_speed_test.cc , bazel target //test/common/http:header_map_impl_speed_test , which you could copy into the http1 subdir to capture the performance improvement directly. Thanks! |
Get it. And thanks very much for all you kind and detailed sugesstions. It's very helpful. (^_^) |
const uint64_t basic_header_size = method->value().size() + (host ? host->value().size() : 0) + | ||
(path ? path->value().size() : 0); | ||
|
||
conn_buffer_helper.reserveBuffer(4096 + basic_header_size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it somewhere to ensure the size of headers
is less than 4096 bytes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need not to ensure it. We always will reserve enough space for encoding.
void StreamEncoderImpl::encodeHeader(absl::string_view key, absl::string_view value) {
ASSERT(!key.empty());
const uint64_t header_size = key.size() + value.size() + 4;
auto& conn_buffer_helper = connection_.bufferHelper();
conn_buffer_helper.reserveBuffer(header_size); // <-- reserve enough space for every header.
conn_buffer_helper.writeToBuffer(key);
conn_buffer_helper.writeToBuffer(':');
conn_buffer_helper.writeToBuffer(' ');
conn_buffer_helper.writeToBuffer(value);
conn_buffer_helper.writeToBuffer(CRLF);
bytes_meter_->addHeaderBytesSent(header_size);
}
Reading through the history of this PR and the one you linked to, it sounds like the biggest contributor to the performance degradation is the repeated checking of watermarks, correct? If that's the case, an alternative that would keep the code-cleanup of #9825 would be to add an API for temporarily disabling those checks (startTransaction()/stopTransaction()?). If we choose to keep going down this path, we'll need to validate that this use of the buffer reservation API isn't resulting in a worse set of Slices in the buffer, which could also negatively affect performance. The call to reserve 4096 bytes will most likely always create a new Slice, and I'm not sure what the state of the buffer likely is before that call. |
I hadn't been checking out the flame fraphs I assumed it was fragmented allocations not checking watermark bounds. that should be pretty cheap or something we can address without adding back all the complexity. @wbpcode can you clue me on what causes the gains? |
another thought is to add an API to add an ordered collection of string_views to a Buffer Instance. That could iterate through the collection twice: once to get the total size so we can reserve enough contiguous space for it, the other to append all the chunks without need for intermediate checking. I would use some sort of templated method if possible (annoying cause it's a virtual class...) so that the same code can work whether it is supplied a C array, std::vector, std::array, or whatever, meaning at the call site you could call buffer_->appendStrings({header, ": ", value, "\r"}); without having to create allocate a temp vector. You might need a non-virtual wrapper with templated helpers. |
/wait |
also I want to re-emphasize that it would be really helpful to get a repro of this issue using Nighthawk or even just a microbenchmark so we could make sure Envoy is operating in its sweet-spot, and that the proposed changes are making a realistic case go faster. I think this change probably is worth it but it'd be better to repro it. |
@ggreenway Thanks and here is some replies.
Nope. Although the repeated checking of watermarks brings some overhead. But from the callgrind data, the repeated buffer
We always maintain sequential writing. In other words, in the worst case, a slice is created at the end of the buffer and submitted. And when a slice is created, unless the exist slice does not have enough space, otherwise we will not submit and create a new slice. Judging from the test results, this PR's effect is good, and the throughput has been increased by about 4%. (Same config and condition in the #19103 ) |
@alyssawilk Thanks and here is some replies. Yes. Single watermark check is very cheap. But for 100, 000 request, it would be 6, 200, 000 watermark checks. And the repeat (6, 200, 000) buffer |
I have test the latest binary of this PR with wrk and the throughput has been increased by about 4% . And I can run some tests with nighthawk this weekend. In addition, until the weekend, I will update this PR. |
This makes sense; there's a lot of per-string-segment gook even in simple OwnedImpl::add(). |
hm yeah. While I wouldn't block this change I'd be more inclined to address by improving the buffer add functionality to be more efficient for small writes, than wrapper HTTP/1.1 code this way. I feel like if we improve add we'll improve perf for the cases discussed but also for things like trickle attack where I think the current fix wouldn't improve things? |
FWIW, the reason the code was originally the way it was is exactly because of this type of benchmark. Way back in the day I measured the same thing and noticed the same type of perf delta by doing this type of batching at the codec level. Whether it's actually worth it or not for real world use cases to go back to some variant of what we had before is a different question. |
It would be a great job. If necessary, I can also help at that time. I am now interested in most performance optimization issues. 😄 But I read the source code of OwnImpl, I think this job is not easy. The current OwnImpl::addImpl has been implemented very well, and it will be very difficult to obtain a larger performance improvement. This is not because OwnImpl is not good, but because the work undertaken by OwnImpl itself is very complicated. In addition, this RP can also alleviate the overhead introduced by the watermark check. Although it's not too big, it shouldn't be ignored after the check is executed more frequently. |
I need to look at this more closely, but it seems that the story is that high-watermark checks add measurable overhead. I wonder if a mitigation would be to perform header writes to a new buffer and then move it's contents to the output buffer. This way we eliminate all the high watermark checks in the critical section. The change as is does appear very risky as indicated by the risk assessment in the PR description. #9825 attempted to eliminate crashes due to incorrect reservation sizes and this PR re-introduces some of those human error risks. As mentioned in earlier comments, it would be good to have a continuous loadtest that shows the performance benefits of a change like this one and allows us to compare those benefits with alternate implementations like the write to temporary OwnedImpl buffer and then move its contents to the output buffer. |
This PR try to alleviate the fine-grains buffer write problem introduced by the HTTP headers encoding. I think this problem exists in all HTTP1 requests. Of course, in different scenarios, the effect of this PR may be different. For example, in some very complex scenarios, the effect of this PR will not be obvious, because other modules may become a bottleneck. But this does not mean that this PR is useless, and it will at least not make things worse.
Regarding the meaning of this sentence, I am a little confused. 🤔 |
@anatolebeuzon Thanks for this comment.
I thinks this won't help. Because the frequent fine-grains
🤔 can you give some more info about this problem? then we can try to solve it in this PR. #9825 has no detailed description about the problem. |
PR #9825 is a defense-in-depth fix for GHSA-gxvv-x4p2-rppp which was previously fixed by b3f42a4 I can try to implement the idea that I proposed above, but I'll need some help on the profiles front. I assume we don't have a micro benchmark that covers serialization of HTTP/1 request/response to output buffers in cases with a small/medium/large number of headers. At least part of the problem is the WatermarkBuffer implementation of checkHighAndOverflowWatermarks being called on every add operation, your benchmark shows this. |
Yes. But this part is not as significant as the problems caused by frequent fine-grains However, if we can optimize the entire OwnImpl, the value should be far greater than this PR. Because at that time all modules that used OwnImpl could benefit from it, not just codec.
I want to create a simple test set for continuous tracking of #19103. But need to wait until this weekend. At that time, I can share it with you. 😄 |
I look forward to more data. I have some optimizations in mind, but work on them has been blocked on there being more repeatable Envoy OSS loadtests. |
Signed-off-by: wbpcode <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delayed response, my holiday vacation was busier than expected.
envoy/buffer/buffer.h
Outdated
*dst += v.size(); | ||
}; | ||
|
||
total_size_to_write = (absl::string_view(args).size() + ...); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that the multiple conversions of args to string_view on lines 478 and 482 may have some performance impact in the case where args contains "const char*". I think that it may be possible to address this issue by changing the addFragments signature to:
template size_t addFragments(absl::string_view&&... args) {
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Variable-length template parameters do not seem to support this. Or we need to modify our interface to accept a container, but there may be additional memory allocation and copying. 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that might make sense. It would be a little less syntactically convenient, but you don't need to use dynamic arrays. You can just use a begin/end ptr/size at the interface, and write things like:
const absl::string_view args[] = {name, ": ", value, "\r\n"};
addFragments(args, args + arraysize(args));
or
addFragments(args, arraysize(args));
this would also work for std::vector if that's more convenient, at the cost of some allocs:
std::vector<absl::string_view> args{name, ": ", value, "\r\n"};
addFragments(&args[0], &args[0] + args.size());
or
addFragments(&args[0], args.size());
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can do a quick test for this. Because we need to consider the cost of building the container itself. 🤔
Or we can just ensure all args in the original addFragments
must have data()
and size()
methods. It would be a little less syntactically convenient because the caller need convert char*
to absl::string_view
by itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think you'll need to allocate any memory for the first option, a C Array like that will just extend the stack.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WDYT of having the model for HTTP1 where you collect a single vector of Spans for all the fragments of all the headers, and use a single addSpans calls that does the watermark check on the aggregated size? Does that simplify things?
Then we need create a vector to store all the key/values string view. Although frequent watermark checks will bring some overhead, I think copying all string view to the heap (64 bytes for single header) may bring more overhead.
In addition, I haven't figured out how to deal with the situation where a large number of fragments are written at the same time in a single addFragments
. If the back slice overflows, then we will downgrade to use addImpl
.
Although our performance is definitely not worse than repeated add()
, there is not much performance gain when overflow occurs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RE copying string-views; that's pretty small (16 bytes per).
RE heap: we have a limited number of headers, right? You could put a string_view[MAX_HEADERS] on the stack; no need to heap it.
RE add vs addFragments: as discussed in previous comments I'm suggesting we not have the policy of bailing into add mode when you spill out of the back slice. Instead, with the giant array of fragments, you can figure out before adding anything exactly how many bytes are needed, set up the slices as required, and do the watermarking once. I think it will be a cleaner/faster model.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For HTTP1, if we modify the codec to the mode you mentioned, the benefits may not be particularly significant. The overhead of watermark check is not very large, and in the current implementation, we have reduced a lot (current overhead is about 1/4 of the original). And for HTTP1, back slice overflow does not happen frequently.
Instead, with the giant array of fragments, you can figure out before adding anything exactly how many bytes are needed, set up the slices as required, and do the watermarking once. I think it will be a cleaner/faster model.
This suggestion is valuable. When the back slice does not have enough contiguous memory, the remaining memory in the back slice can be filled first. After that, for a large amount of unwritten data, contiguous memory can be allocated. In this way addFragments
can handle a large amount of fragmented data. 🤔
I think we can add a TODO for this. Then I will optimize it in subsequent PRs. This PR has really taken too long. orz
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not just an optimization tweak -- it's a much simpler model, isn't it?
With the API that you have you have to have the caller do addFragments calls, and then end with an add(). That's potentially error-prone and I believe @ggreenway has commented about this.
Moreover the current API you have, as @antoniovicente observed, may have excess conversions from const char* to string_view which would result in extra calls to strlen.
With the Span API you can make the interface much simpler and more performant. So although I know this PR has gone on for a while, we iterated quite a bit and learned a lot and got some really good data. We are trying to make Envoy fast and safe, and it's very useful to do this well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jmarantz I am a little confused. 🤣 I have updated this PR to use new absl::Span<absl::string_view>
API. Because as you said, it's cleaner, simper, and even faster. So the problems of @ggreenway and @antoniovicente have been solved. 🤣 Do you noticed the latest commit?
The optimization I mentioned above is ref the special case where the back slice has no continuous memory. We still can do some more perf improvements in that case.
source/common/buffer/buffer_impl.h
Outdated
if (slices_.empty()) { | ||
slices_.emplace_back(Slice(size, account_)); | ||
length_ += size; | ||
slices_.back().reservable_ += size; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The "length_ += size;" and "slices_.back().reservable_ += size;" in here are a bit odd. inlineReserve is not only reserving, but also committing bytes to the buffer before the caller of inlineReserve has written the bytes to the appropriate buffer.
I take it that this is an optimization. I think that the method name should be changed to reflect this reserve+commit semantic that your implementation depends on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
get it.
Signed-off-by: wbpcode <[email protected]>
Signed-off-by: wbpcode <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this looks great! Sorry I missed the last awesome commit.
Just some small nits but I think this is ready for Alyssa.
slices_.emplace_back(Slice(total_size_to_write, account_)); | ||
} | ||
|
||
Slice& back = slices_.back(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add TODO to improve performance as discussed in the comments, rather than the downgrade strategy we have here? You can add here the detail about contiguous memory also, that you provided in that comment thread.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
get it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks absolutely fantastic! Thanks for all your patience and iteration @wbpcode and reviewers!
couple of questions from me below, and could you take a look at the PR description and see if it's still up to date?
@jmarantz : as you've done most of the reviewing I'd appreciate you weighing in on your comfort level landing this before we cut the release next week
@@ -231,7 +227,7 @@ void StreamEncoderImpl::encodeHeadersBase(const RequestOrResponseHeaderMap& head | |||
} | |||
} | |||
|
|||
connection_.addToBuffer(CRLF); | |||
connection_.buffer().add(CRLF); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this newly added comment still true? Reading the code it looks like watermarks are applied now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this deserves an approval from @antoniovicente &/or @KBaichoo as they have worked more in the buffer code than I do. At a high level this direction makes sense to me but this code is at the heart of things and deserves another pass from them.
Thanks for the patience @wbpcode |
source/common/buffer/buffer_impl.h
Outdated
|
||
class OwnedImpl; | ||
private: | ||
friend OwnedImpl; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to make OwnedImpl a friend? I think we can get away with using the public interface via reservation, commit to the slice in https://github.com/envoyproxy/envoy/pull/19115/files#diff-0fdef7bd42dfa48ed5a015ddb85be158a514d61a82a5a17c1cfa34c6d744e5f8R604-R610
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be removed now that we use the public interface? Thanks!
source/common/buffer/buffer_impl.cc
Outdated
uint8_t* mem = back.base_ + back.reservable_; | ||
for (const auto& fragment : fragments) { | ||
memcpy(mem, fragment.data(), fragment.size()); // NOLINT(safe-memcpy) | ||
mem += fragment.size(); | ||
} | ||
length_ += total_size_to_write; | ||
back.reservable_ += total_size_to_write; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we just use the existing API on Slice to create a reservation via reserve()
memcpy the fragments into that and then do back.commit(reservation)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
then there must be some more unnecessary check 🤔 It is minor update and I can try update it and check the benchmark result.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
look like it is ok.
@@ -231,7 +227,7 @@ void StreamEncoderImpl::encodeHeadersBase(const RequestOrResponseHeaderMap& head | |||
} | |||
} | |||
|
|||
connection_.addToBuffer(CRLF); | |||
connection_.buffer().add(CRLF); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is no longer true and should be reverted
Signed-off-by: wbpcode <[email protected]>
/retest |
Retrying Azure Pipelines: |
@KBaichoo Hi, I have update this RP, can you take a look again when you have free time? 😄 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great otherwise :)
source/common/buffer/buffer_impl.h
Outdated
|
||
class OwnedImpl; | ||
private: | ||
friend OwnedImpl; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be removed now that we use the public interface? Thanks!
source/common/buffer/buffer_impl.cc
Outdated
back.commit<false>(reservation); | ||
length_ += total_size_to_copy; | ||
} else { | ||
// Downgrade to using `addImpl` for not enough memory in the back slice. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/for/if
test/common/buffer/buffer_test.cc
Outdated
|
||
auto slice_vec = buffer.getRawSlices(); | ||
|
||
EXPECT_EQ(slice_vec.size(), 5); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
likewise here as in https://github.com/envoyproxy/envoy/pull/19115/files#r780540013
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
get it.
source/common/buffer/buffer_impl.h
Outdated
@@ -17,6 +17,8 @@ | |||
namespace Envoy { | |||
namespace Buffer { | |||
|
|||
class OwnedImpl; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can remove the forward declaration since we no longer need to friend the class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right.
Signed-off-by: wbpcode <[email protected]>
/retest |
Retrying Azure Pipelines: |
/retest |
Retrying Azure Pipelines: |
…-update-for-perf-3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the optimization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome! I'm inclined to wait until after the release tomorrow to merge unless someone objects, as it's a pretty major change to some very core code.
…ders (envoyproxy#19115) Commit Message: minor perf: reduce fine grained buffer access when encoding HTTP1 headers Additional Description: The HTTP1 codec performs a large number of buffer APIs calls during the encoding of HTTP headers, which introduces an additional CPU overhead (3-4%). This PR implements a buffer helper as a cache to reduce direct buffer writes to improving the overall performance. Check this envoyproxy#19103 (comment) for some more related info. This PR reverts envoyproxy#9825. At the same time, I also try my best to make the code simpler and easier to maintain. Risk Level: High. Testing: Waiting. Docs Changes: N/A. Release Notes: N/A. Signed-off-by: wbpcode <[email protected]> Signed-off-by: Josh Perry <[email protected]>
Commit Message: minor perf: reduce fine grained buffer access when encoding HTTP1 headers
Additional Description:
The HTTP1 codec performs a large number of buffer APIs calls during the encoding of HTTP headers, which introduces an additional CPU overhead (3-4%). This PR implements a buffer helper as a cache to reduce direct buffer writes to improving the overall performance.
Check this #19103 (comment) for some more related info.
This PR reverts #9825. At the same time, I also try my best to make the code simpler and easier to maintain.
Risk Level: High.
Testing: Waiting.
Docs Changes: N/A.
Release Notes: N/A.
Some benchmark results:
The total amount of data written in each test is the same, 64MB. The figure shows the performance of using different APIs and different basic write unit sizes.
In
addFrags/{n}
,n
represents the number of units written by the addFragments API in batches. For example, when the write unit is 16 bytes, addFrags/3 means to use addFragments to write 3 strings of 16 bytes at one time.Y-axis: number of nanoseconds to complete 64mb data writes. Higher Y-axis is slower.
X-axis: basic write unit bytes size.
add
can only writes one basic unit one time andaddFragments
can writes multiple units one time.From the graph, addFrags/5 gives the best performance in most all cases.