Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: reduce the CPU usage of HTTP/1.1 header parser #2880

Closed
brian-pane opened this issue Mar 22, 2018 · 12 comments
Closed

perf: reduce the CPU usage of HTTP/1.1 header parser #2880

brian-pane opened this issue Mar 22, 2018 · 12 comments

Comments

@brian-pane
Copy link
Contributor

Description:
From a callgraph profile (perf record -g) of an Internet-facing Envoy instance serving mostly HTTP/1.1 traffic, I found that the parsing of request headers and building of the internal HeaderMap structure is somewhat CPU-intensive.

Here is the relevant part of the perf report --sort parent output, showing the percentage of total non-idle clock cycles spent in function+children:

-   86.12%    86.12%  [other]
   - 80.94% start_thread
      - _ZZN5Envoy6Thread6ThreadC4ESt8functionIFvvEEENUlPvE_4_FUNES5_
         - 80.89% Envoy::Server::WorkerImpl::threadRoutine
            - 80.73% event_base_loop
	       - 75.65% event_process_active_single_queue.isra.29
                  - 62.25% Envoy::Event::FileEventImpl::assignEvents(unsigned int)::{lambda(int, short, void*)#1}::_FUN
                     - 62.04% Envoy::Network::ConnectionImpl::onFileEvent
			- 41.59% Envoy::Network::ConnectionImpl::onReadReady
                           - 34.33% Envoy::Network::FilterManagerImpl::onContinueReading
                              - 19.49% Envoy::Http::ConnectionManagerImpl::onData
                                 - 18.05% Envoy::Http::Http1::ConnectionImpl::dispatch
                                    - 17.62% Envoy::Http::Http1::ConnectionImpl::dispatchSlice
                                       - 17.35% http_parser_execute
                                          - 8.54% Envoy::Http::Http1::ConnectionImpl::onHeadersCompleteBase
                                             - 8.44% Envoy::Http::Http1::ServerConnectionImpl::onHeadersComplete
                                                - 8.01% Envoy::Http::ConnectionManagerImpl::ActiveStream::decodeHeaders
                                                   - 6.30% Envoy::Http::ConnectionManagerImpl::ActiveStream::decodeHeaders
                                                      - 4.54% Envoy::Router::Filter::decodeHeaders
                                                         - 2.35% Envoy::Http::Http1::ConnPoolImpl::newStream
                                                            - 1.82% Envoy::Http::Http1::ConnPoolImpl::attachRequestToClient
                                                               - 1.17% Envoy::Router::Filter::UpstreamRequest::onPoolReady
                                                                  - 1.12% Envoy::Http::StreamEncoderWrapper::encodeHeaders
                                                                       0.91% Envoy::Http::Http1::RequestStreamEncoderImpl::encodeHeaders
                                                        0.69% Envoy::Http::IpTaggingFilter::decodeHeaders
                                                     0.75% Envoy::Http::ConnectionManagerUtility::mutateRequestHeaders
                                                     0.53% Envoy::Http::ConnectionManagerImpl::ActiveStream::refreshCachedRoute
                                          - 3.52% Envoy::Http::Http1::ConnectionImpl::{lambda(http_parser*)#7}::_FUN
                                             - Envoy::Http::Http1::ServerConnectionImpl::onMessageComplete
                                                - 2.51% Envoy::Http::ConnectionManagerImpl::ActiveStream::decodeHeaders
                                                   - 1.82% Envoy::Http::ConnectionManagerImpl::ActiveStream::decodeHeaders
                                                      - 1.23% Envoy::Router::Filter::decodeHeaders
                                                           0.53% Envoy::Http::Http1::ConnPoolImpl::newStream
                                                - 0.75% Envoy::Http::ConnectionManagerImpl::ActiveStream::decodeData
                                                   - 0.64% Envoy::Router::Filter::decodeData
                                                      	0.53% Envoy::Router::Filter::onRequestComplete
                                          - 1.33% Envoy::Http::Http1::ConnectionImpl::{lambda(http_parser*, char const*, unsigned long)#3}::_FUN
                                             - Envoy::Http::Http1::ConnectionImpl::onHeaderField
                                                - 1.17% Envoy::Http::Http1::ConnectionImpl::completeLastHeader
                                                     0.69% Envoy::Http::HeaderMapImpl::insertByKey
                                          - 1.33% Envoy::Http::Http1::ConnectionImpl::{lambda(http_parser*)#1}::_FUN
                                             - 1.28% Envoy::Http::Http1::ServerConnectionImpl::onMessageBegin
                                                - Envoy::Http::ConnectionManagerImpl::newStream
                                                     0.69% Envoy::Server::Configuration::HttpConnectionManagerConfig::createFilterChain
                                          - 1.12% Envoy::Http::Http1::ConnectionImpl::{lambda(http_parser*, char const*, unsigned long)#6}::_FUN
                                             - Envoy::Http::Http1::ServerConnectionImpl::onBody
                                                  0.64% Envoy::Http::ConnectionManagerImpl::ActiveStream::decodeData
@mattklein123
Copy link
Member

FYI I have spent considerable time heavily optimizing this code path. There are unlikely to be any low hanging fruit here -- HTTP proxy servers end up spending an incredible amount of time dealing with headers. With that said, I would love to have someone else take a look to see if they can find any wins.

@brian-pane
Copy link
Contributor Author

brian-pane commented Mar 23, 2018

I'll sign up to study the performance of http_parser.
I'm looking at a source-line-level profile of the http_parser library now. It might be possible to speed up the parser by vectorizing more of it with memchr. Conveniently, that library comes with a benchmark program, which will make it easy to validate potential optimizations.

@mattklein123
Copy link
Member

@brian-pane FYI I didn't look at http_parser at all. Yes it's possible that there might be some vectorizing wins. I only optimized the layer on top (custom header string, custom data structure for storing headers, and reducing as many copies as possible).

@brian-pane
Copy link
Contributor Author

An idea for anybody who has time to look into HeaderMapImpl: determine whether reducing the size of that struct, and/or grouping commonly accessed O(1)-time header fields within the struct so they’re in the same cache line, will help performance by improving L1 cache hits. (The approach we used in Proxygen might also be of interest if cache footprint turns out to be an issue: https://github.com/facebook/proxygen/blob/master/proxygen/lib/http/HTTPHeaders.h)

@mattklein123
Copy link
Member

@brian-pane yeah agreed, would love some analysis on the size of the struct, the size of the header string interning buffer, cache alignment, etc. Have never had time to do any of that.

@alyssawilk
Copy link
Contributor

Other potential from our overoptimized !Envoy header parsing: we take the raw header buffer and simply annotate header locations with a string-view equivalent (zero copy for header parsing) and having hash-map lookups for referencing headers (better than list walking for the non O(1) headers)

@brian-pane
Copy link
Contributor Author

The PR that I just submitted to the http_parser project should help a bit with the CPU usage. I wasn't able to vectorize the parsing code, but I was able to reduce the average number of conditional branches per input character and also replace some memory reads/writes with register operations.

The parsing code might still be vectorizable. A common operation is to find the next instance of any of a small set of characters. I can think of a way to do that in x86 using N+1 SSE registers, where N is the cardinality of the set, but I'd rather avoid adding any assembly language to the codebase if possible.

@georgi-d
Copy link
Contributor

One other option to look at is the picohttpparser which according to the benchmark on its page should be about 3-5x faster than http-parser

@brian-pane
Copy link
Contributor Author

Thanks for the pointer; I'll take a look at picohttpparser.

@stale
Copy link

stale bot commented Jul 25, 2018

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or other activity occurs. Thank you for your contributions.

@stale stale bot added the stale stalebot believes this issue/PR has not been touched recently label Jul 25, 2018
@ggreenway
Copy link
Contributor

@brian-pane Is there still more work you'd like to do on this issue, or should we consider it fixed with #3505?

@stale stale bot removed the stale stalebot believes this issue/PR has not been touched recently label Jul 25, 2018
@brian-pane
Copy link
Contributor Author

There's still room for improvement in the HTTP/1 parser speed, but I don't think I'll have time to work on it in the near future. So I'm ok with closing this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants