-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
some performance results of envoy's different versions #19103
Comments
Have you ever profiled Envoy with tools like perf, so that we can find some hotspots from the result? |
I will do it this weekend. 😄 |
This feels like a duplicate of #13412. |
Yep. Looks like the degrading never stops. |
Here are some flame graphs. v1.17.4: v1.12.2: v1.20.x |
There are no obvious hotspots from the flame graph, just a very uniform slowdown. More encapsulation and abstraction gradually reduces the performance of Envoy. |
I'm not sure if I buy that incremental changes in encapsulation and abstraction are likely to cause that much slowdown. I glanced at one of the flame-graphs and it's hard to know what to improve there. It might help to capture some perf-graphs from an instrumented binary that can provide more detail on what's going on. The purely sampled view we get from these flame-graphs might (for example, as a guess) hide some effects of changes in the way we make the networking system calls. E.g. we spend a lot of time in the 1.20 flamegraph in the kernel. Are we making larger numbers of calls for smaller chunks of data, for any reason? It would be nice to get a controlled repro of this, using NIghthawk preferably, and then run envoy under |
I will try to do more investigate. |
Related to @jmarantz comment, over time we have generally moved to secure by default configuration. This almost universally is at odds with "horse race" benchmarks. So we will need to very carefully tease apart any differences. The work is useful but it's a lot more complicated than just looking at flame graphs. I also agree with @jmarantz that trying to run a reproducer under something like cachegrind will be a lot more informative. |
Just to point this out: Matt suggests cachegrind and I suggested callgrind: they are related and both very useful:
|
@mattklein123 @jmarantz Thanks very much for all your suggestions. I will put more effort into tracking and investigating this issue recently.🌷 At present, I still subjectively think that it may caused by a large number of small adjustments. For example, #13423 introduces a small update to replace all white space chars in the response code details which will bring some minor new overhead. Assuming there are many similar PRs, the accumulation of these small overheads will still have a large enough impact. And because they are very scattered, it may also be difficult to locate. But this is just my personal guess. We still need more research to identify the problem and try to solve it. Of course, if it is just because of some more secure default configuration, then at most only the documentation needs to be updated. |
2021/11/27 callgrind.out file of v1.12.2 vs v1.20.x (100, 000 HTTP1.1 request with wrk). https://drive.google.com/drive/folders/1EWTjixvN43O8u24a_rJF8S6A4ePvFwC1?usp=sharing First point: |
Great -- would you be able to supply the parameters passed to Nighthawk (preferably) or whatever tool you were using to generate the load for these traces? Thanks! |
This data is great! It really shows the profiles look very different. Just sorting by 'self' and comparing the two views provides a ton of possible issues to go explore. Did you use an "OptDebug" build or was this a stock build or something? |
I used simple
|
I had not heard of |
I used the no stripped binary that build with |
Yes, I am generally used to using wrk, but it seems that wrk cannot generate a fixed amount of load. |
Nighthawk (https://github.com/envoyproxy/nighthawk) is really what we want to converge to, as you get explicit control of open/closed loop, ability to generate http2, async vs concurrent, an indication of how many requests succeeded/failed, etc. By "fixed amount of load" does this mean a finite number of requests? Or a fixed rate of requests? The docker images are definitely not in OptDebug mode :) They are probably simply optimized, which is OK, but we'll have a lot less details on call-stack, where in functions time is being spent, etc. |
Thanks, I will try it in the coming test.
I see. Thanks. 🌷 I will try to build new binary with you suggested compile args and do some more investigates. But recently I only have enough time on weekends. |
No worries -- the data you supplied is great. We'll definitely want to repro with NH though so we understand what we are comparing :) My suspicion is that you've found something real, and based on the traces I looked at, there were changes in either tcmalloc's implementation, the way we allocate buffers, or both. The What did you have running on localhost:8080? An Apache server or something? |
A multi-processes Nginx that will return simple short string directly.
In fact, nope. Nevertheless, the new version of I've created a PR #19115 to try to solve this problem. |
@ztgoto What HTTP protocol is being used? HTTP/1, or HTTP/2? In particular for HTTP2 in https://www.envoyproxy.io/docs/envoy/latest/api-v3/config/core/v3/protocol.proto#config-core-v3-http2protocoloptions |
@jmarantz I use envoyproxy/envoy:v1.22.2 (--network host) for testing. The config file is as stated above, I don't know if there is anything wrong. envoy
nginx
|
Hi, Flame graph: Envoy config:
Thanks in advance. |
Update from 2024:
|
2024: envoy static_resources:
listeners:
- name: listener
address:
socket_address: {address: 0.0.0.0, port_value: 9101}
filter_chains:
- filters:
- name: envoy.filters.network.http_connection_manager
typed_config:
"@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
stat_prefix: http_test
codec_type: AUTO
generate_request_id: false
route_config:
name: route
virtual_hosts:
- name: test
domains: ["*"]
routes:
- match: { prefix: "/" }
direct_response:
status: 200
body:
inline_string: "{\"message\":\"hello\"}"
response_headers_to_add:
- header:
key: "Content-Type"
value: "application/json"
# route:
# cluster: auth
http_filters:
- name: envoy.filters.http.router
typed_config:
"@type": type.googleapis.com/envoy.extensions.filters.http.router.v3.Router
dynamic_stats: false
stats_config:
stats_matcher:
reject_all: true nginx
result: envoy:
nginx:
|
I have benchmarked Envoy with |
@zhxie |
Hi @howardjohn, what is the |
A few notes:
One thing I noticed is that you specify concurrency=1 for Envoy, so it will allocate only one worker thread. In contrast your nginx config specifies 1 process. I am a little unclear on nginx's default behavior, but I think it does have a threadpool. Both nginx and envoy will use async i/o to multiplex many requests over a single thread, but I'm not sure how many threads nginx is using in your benchmark. Envoy will mostly use 1 (in addition to the 'main' thread for handling admin requests and config updates). |
|
My understanding is that if Nginx uses multi-threading when the worker_processes is set to 1, then the CPU monitoring of the Linux system should exceed 100%. Of course, I'm not sure if my understanding is correct. |
Nginx uses process as the unit of concurrency. So, 1 process == 1 worker for nginx. cc @jmarantz |
We don't really need to compare to Nginx though -- Envoy from a few years back has the same threading/process model and is ~2x faster... |
FWIW my team's product does our own benchmarking of our deployment and we have not seen that degradation. |
I'm also wondering about test methdology. Can we repro the older test results from saved docker packages? |
Yes -- see #19103 (comment). I reproduced it today again as well and see the same results. Note config.yaml is the one from the original issue. Here are some more results with nighthawk as a client FWIW, similar trend
(note: fully acknowledge there are more robust ways to measure performance, but at the differences we are talking about I think its fair to say there is a substantial change) |
This is a feature I never used. haha |
It's no doubt the change is exist and it's pretty hard to to optimize it back. Orz. |
Something else I've noticed about Envoy (v.1.30) is that it tends to completely collapse under high load (i.e. success rate suddenly drops to 0% and oscillates between 100% and 0% success rate every 10 minutes) whereas Nginx will have an increased response time but the success rate will be little impacted. We had to completely rollback our edge proxy adoption of Envoy due to this reason. Another issue we noticed is that in some locations with poor network performance, Envoy's average CPU utilization would only reach 85% before they it began to collapse. This does not happen with Nginx. We are actually ok with Envoy potentially using 10-20% more CPU than Nginx but its tendency to completely collapse under high load made us give up our Envoy adoption for our edge footprint. It's very tough for us to debug what causes Envoy to collapse like this because a flamegraph does not show why Envoy does not fully utilize all of its allocated CPU cores. |
@triplewy would you be able to take some CPU flame-graphs to see where the system is bottlenecking? Have you looked at watchdog timouts stats, or epoll histograms? One possible fail-point in Envoy -- this is true of nginx also IIUC -- is that threads are precious and if anything being called in the data plane unexpectedly blocks that can cause problems. One thing that comes to mind is access logs. Do you have those enabled? What format are you using? |
How is the memory situation under load? Just checking if you have Load Shed
Points enabled? (
https://www.envoyproxy.io/docs/envoy/latest/configuration/operations/overload_manager/overload_manager#load-shed-points
).
Debug/trace level logs can throw insights into reason for a request failure.
…On Sun, Feb 2, 2025 at 10:22 AM Joshua Marantz ***@***.***> wrote:
@triplewy <https://github.com/triplewy> would you be able to take some
CPU flame-graphs to see where the system is bottlenecking? Have you looked
at watchdog timouts stats, or epoll histograms?
One possible fail-point in Envoy -- this is true of nginx also IIUC -- is
that threads are precious and if anything being called in the data plane
unexpectedly blocks that can cause problems.
One thing that comes to mind is access logs. Do you have those enabled?
What format are you using?
—
Reply to this email directly, view it on GitHub
<#19103 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AU5KKBEXQXB6HAGKDGM3UET2NYZ45AVCNFSM6AAAAABKJ6U3WCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMMRZGQZTQOJRHE>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
@jmarantz I re-deployed Envoy to a few of our edge clusters and was able to reproduce the oscillation behavior. What seems to be happening is:
During all of this, the event dispatcher and watchdog stats are relatively stable at normal levels. The concerning aspects of this are:
If we can figure out why upstream connections are immediately closed in these situations, we may be able to completely prevent Envoy from collapsing. Below is a CPU flamegraph of our Envoy instance during busy time. We have access logs enabled but so does our Nginx setup. envoy_perf.svg |
I did some simple tests today and found that envoy's performance seems to be getting worse and worse. Although I know features should always come at some cost. But those costs seem to be too much.
And although I know that performance is not the first goal of Envoy. But with continuous enhancement of Envoy, the performance seems to degrade too quickly. Here are some simple results of single envoy worker:
I think we should at least prevent it from further deterioration, and at the same time find ways to optimize it.
Here is my config yaml:
Back-end: multi-processes Nginx that will return simple short string.
Client command: wrk -c 200 -t 2 -d 180s http://localhost:9090/
The text was updated successfully, but these errors were encountered: