some performance results of envoy's different versions #19103

wbpcode · 2021-11-25T08:31:41Z

I did some simple tests today and found that envoy's performance seems to be getting worse and worse. Although I know features should always come at some cost. But those costs seem to be too much.

And although I know that performance is not the first goal of Envoy. But with continuous enhancement of Envoy, the performance seems to degrade too quickly. Here are some simple results of single envoy worker:

Version	QPS
v1.12.4	26921.48
v1.13.3	25182.18
v1.14.7	23732.31
v1.15.5	21010.66
v1.16.5	19116.81
v1.17.4	17804.78
v1.18.4	16953.67
v1.19.1	16046.59
v1.20.0	15949.65

I think we should at least prevent it from further deterioration, and at the same time find ways to optimize it.

Here is my config yaml:

static_resources:
  listeners:
    - address:
        socket_address:
          address: 0.0.0.0
          port_value: 9090
      filter_chains:
        - filters:
            - name: envoy.http_connection_manager
              typed_config:
                "@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
                codec_type: auto
                stat_prefix: ingress_http
                route_config:
                  name: local_route
                  virtual_hosts:
                    - name: host-one # prefix route
                      domains:
                        - "*"
                      routes:
                        - match:
                            prefix: "/"
                          route:
                            cluster: httpbin
                http_filters:
                  - name: envoy.router
                    typed_config:
                      "@type": type.googleapis.com/envoy.extensions.filters.http.router.v3.Router
                      dynamic_stats: false
                      suppress_envoy_headers: true
  clusters:
    - name: httpbin
      connect_timeout: 5s
      type: strict_dns
      lb_policy: round_robin
      load_assignment:
        cluster_name: httpbin
        endpoints:
          - lb_endpoints:
              - endpoint:
                  address:
                    socket_address:
                      address: localhost
                      port_value: 8080

admin:
  access_log_path: "/dev/null"
  address:
    socket_address:
      address: 0.0.0.0
      port_value: 20000

Back-end: multi-processes Nginx that will return simple short string.

Client command: wrk -c 200 -t 2 -d 180s http://localhost:9090/

The text was updated successfully, but these errors were encountered:

wbpcode · 2021-11-25T08:32:05Z

cc @mattklein123 @rojkov

zhxie · 2021-11-25T09:15:19Z

Have you ever profiled Envoy with tools like perf, so that we can find some hotspots from the result?

wbpcode · 2021-11-25T09:19:31Z

Have you ever profiled Envoy with tools like perf, so that we can find some hotspots from the result?

I will do it this weekend. 😄

rojkov · 2021-11-25T10:19:32Z

This feels like a duplicate of #13412.

wbpcode · 2021-11-25T10:37:49Z

This feels like a duplicate of #13412.

Yep. Looks like the degrading never stops.

wbpcode · 2021-11-25T14:04:01Z

Here are some flame graphs. ~~I only have two different binary files with code symbols~~. Considering that there has been a huge gap in performance between them, I think it is enough as a reference.

v1.17.4:
https://drive.google.com/file/d/1aZuC54PmIXsQu7k88jBeU-HENOEiQXgT/view?usp=sharing

v1.12.2:
https://drive.google.com/file/d/1o6kV8T2J5nCs3m7lHaXSdXaudjzNJ9Sv/view?usp=sharing

v1.20.x
https://drive.google.com/file/d/1lU539aRFzOCrR16EFTjKwu0MtzV03_kr/view?usp=sharing

cc @mattklein123

wbpcode · 2021-11-26T02:40:59Z

There are no obvious hotspots from the flame graph, just a very uniform slowdown. More encapsulation and abstraction gradually reduces the performance of Envoy.

jmarantz · 2021-11-26T15:15:08Z

I'm not sure if I buy that incremental changes in encapsulation and abstraction are likely to cause that much slowdown. I glanced at one of the flame-graphs and it's hard to know what to improve there.

It might help to capture some perf-graphs from an instrumented binary that can provide more detail on what's going on. The purely sampled view we get from these flame-graphs might (for example, as a guess) hide some effects of changes in the way we make the networking system calls. E.g. we spend a lot of time in the 1.20 flamegraph in the kernel. Are we making larger numbers of calls for smaller chunks of data, for any reason?

It would be nice to get a controlled repro of this, using NIghthawk preferably, and then run envoy under callgrind or compiled with various perf tools, so we can see call-counts for various functions. Using this we could compare how we are structuring our system calls to the earlier version of Envoy.

wbpcode · 2021-11-26T15:25:08Z

I'm not sure if I buy that incremental changes in encapsulation and abstraction are likely to cause that much slowdown. I glanced at one of the flame-graphs and it's hard to know what to improve there.

It might help to capture some perf-graphs from an instrumented binary that can provide more detail on what's going on. The purely sampled view we get from these flame-graphs might (for example, as a guess) hide some effects of changes in the way we make the networking system calls. E.g. we spend a lot of time in the 1.20 flamegraph in the kernel. Are we making larger numbers of calls for smaller chunks of data, for any reason?

It would be nice to get a controlled repro of this, using NIghthawk preferably, and then run envoy under callgrind or compiled with various perf tools, so we can see call-counts for various functions. Using this we could compare how we are structuring our system calls to the earlier version of Envoy.

I will try to do more investigate.

mattklein123 · 2021-11-26T15:41:22Z

Related to @jmarantz comment, over time we have generally moved to secure by default configuration. This almost universally is at odds with "horse race" benchmarks. So we will need to very carefully tease apart any differences. The work is useful but it's a lot more complicated than just looking at flame graphs. I also agree with @jmarantz that trying to run a reproducer under something like cachegrind will be a lot more informative.

jmarantz · 2021-11-26T16:10:11Z

Just to point this out: Matt suggests cachegrind and I suggested callgrind: they are related and both very useful:

they both are generated using the valgrind infrastructure (--tool=callgrind vs --tool=cachegrind)
cachegrind has a lot more detail of the simulated processor cache effects of each line of code
callgrind has more detail about time spent and call-counts at a function-level
they both benefit from an OptDebug compilation. I use --compilation_mode=opt --cxxopt=-g --cxxopt=-ggdb3 but those options might be dated as I haven't been able to do much direct technical work lately
they both generate reports that can be visualized with kcachegrind
they both run much slower than real time and I wouldn't run production traffic through them. But it should be no problem for nighthawk or any other synthetic load tool

wbpcode · 2021-11-27T02:54:07Z

@mattklein123 @jmarantz Thanks very much for all your suggestions. I will put more effort into tracking and investigating this issue recently.🌷

At present, I still subjectively think that it may caused by a large number of small adjustments. For example, #13423 introduces a small update to replace all white space chars in the response code details which will bring some minor new overhead. Assuming there are many similar PRs, the accumulation of these small overheads will still have a large enough impact. And because they are very scattered, it may also be difficult to locate.

But this is just my personal guess. We still need more research to identify the problem and try to solve it.

Of course, if it is just because of some more secure default configuration, then at most only the documentation needs to be updated.

wbpcode · 2021-11-27T07:37:59Z

2021/11/27 callgrind.out file of v1.12.2 vs v1.20.x (100, 000 HTTP1.1 request with wrk).

https://drive.google.com/drive/folders/1EWTjixvN43O8u24a_rJF8S6A4ePvFwC1?usp=sharing

First point: encodeFormattedHeader introduced ~3% external CPU overhead.
Root cause: Too much fine grained buffer API access.
Related PR: #9825

jmarantz · 2021-11-29T13:55:20Z

Great -- would you be able to supply the parameters passed to Nighthawk (preferably) or whatever tool you were using to generate the load for these traces?

Thanks!

jmarantz · 2021-11-29T14:01:11Z

This data is great! It really shows the profiles look very different. Just sorting by 'self' and comparing the two views provides a ton of possible issues to go explore.

Did you use an "OptDebug" build or was this a stock build or something?

wbpcode · 2021-11-29T14:01:39Z

I used simple hey to generate 100, 000 http1 request. Here is the command.

hey -n 100000 -c 20 http://localhost:9090.

jmarantz · 2021-11-29T14:03:11Z

I had not heard of hey. Is that this? https://github.com/rakyll/hey

wbpcode · 2021-11-29T14:04:31Z

Did you use an "OptDebug" build or was this a stock build or something?

I used the no stripped binary that build with ci/do_ci.sh bazel.release. It finally worked, but I'm not sure it is OptDebug mode.

wbpcode · 2021-11-29T14:05:35Z

I had not heard of hey. Is that this? https://github.com/rakyll/hey

Yes, I am generally used to using wrk, but it seems that wrk cannot generate a fixed amount of load.

jmarantz · 2021-11-29T14:12:34Z

Nighthawk (https://github.com/envoyproxy/nighthawk) is really what we want to converge to, as you get explicit control of open/closed loop, ability to generate http2, async vs concurrent, an indication of how many requests succeeded/failed, etc.

By "fixed amount of load" does this mean a finite number of requests? Or a fixed rate of requests?

The docker images are definitely not in OptDebug mode :) They are probably simply optimized, which is OK, but we'll have a lot less details on call-stack, where in functions time is being spent, etc.

wbpcode · 2021-11-29T14:20:56Z

Nighthawk (https://github.com/envoyproxy/nighthawk) is really what we want to converge to, as you get explicit control of open/closed loop, ability to generate http2, async vs concurrent, an indication of how many requests succeeded/failed, etc.

Thanks, I will try it in the coming test.

The docker images are definitely not in OptDebug mode :) They are probably simply optimized, which is OK, but we'll have a lot less details on call-stack, where in functions time is being spent, etc.

I see. Thanks. 🌷 I will try to build new binary with you suggested compile args and do some more investigates. But recently I only have enough time on weekends.

jmarantz · 2021-11-29T14:25:37Z

No worries -- the data you supplied is great. We'll definitely want to repro with NH though so we understand what we are comparing :)

My suspicion is that you've found something real, and based on the traces I looked at, there were changes in either tcmalloc's implementation, the way we allocate buffers, or both. The encodeFormattedHeader hot-spot probably deserves a quick look also as that is a relatively recent feature. Did you wind up turning that on?

What did you have running on localhost:8080? An Apache server or something?

wbpcode · 2021-11-29T14:58:02Z

What did you have running on localhost:8080? An Apache server or something?

A multi-processes Nginx that will return simple short string directly.

Did you wind up turning that on?

In fact, nope. Nevertheless, the new version of encodeFormatedHeader/encodeHeader still introduces more overhead. The reason is that encodeHeaders call encodeFormatedHeader several times and write some fine-grained data (a character, a space, a header name, etc.) to the buffer (watermark buffer) directly.
The new version does not use a local cache to speed up the writing of these fine-grained data.

I've created a PR #19115 to try to solve this problem.

hobbytp · 2021-11-30T08:43:33Z

@wbpcode do you mean the issue only exists in HTTP1 or you only test it by using HTTP1, I ask this because I found in your PR #19115, you only fix HTTP1 codes, thanks for clarification.

KBaichoo · 2022-09-01T17:55:33Z

@ztgoto What HTTP protocol is being used? HTTP/1, or HTTP/2? In particular for HTTP2 in https://www.envoyproxy.io/docs/envoy/latest/api-v3/config/core/v3/protocol.proto#config-core-v3-http2protocoloptions
max_concurrent_streams defaults to 2147483647 which I've seen this skew some benchmarks prior.

ztgoto · 2022-09-02T03:23:30Z

@ztgoto PTAL @ https://www.envoyproxy.io/docs/envoy/latest/faq/performance/how_to_benchmark_envoy

In particular, can you disable circuit-breaking and re-run your benchmark (per recommendations in the benchmarking doc above). It's not clear to me if that's a bottleneck in your case, but you have not configured circuit-breaking on the nginx side so it seems appropriate to keep the behavior consistent.

Also, did you build your own Envoy from source? Or use a pre-built package? If you built it yourself did you use -c opt?

@jmarantz I use envoyproxy/envoy:v1.22.2 (--network host) for testing. The config file is as stated above, I don't know if there is anything wrong.

envoy

wrk -t 8 -c 32  -d 60s --latency 'http://127.0.0.1:9104/hello'
Running 1m test @ http://127.0.0.1:9104/hello
  8 threads and 32 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     2.76ms  378.73us   6.53ms   66.32%
    Req/Sec     1.45k    65.16     2.12k    71.77%
  Latency Distribution
     50%    2.69ms
     75%    3.02ms
     90%    3.33ms
     99%    3.67ms
  693601 requests in 1.00m, 156.77MB read
Requests/sec:  11553.84
Transfer/sec:      2.61MB


   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                          
160976 101       20   0 2247104  54512  19432 S 100.3  0.1   2:13.85 envoy                                                            
 32221 root      20   0  706612   8384   1160 S  32.8  0.0   0:04.63 wrk

nginx

wrk -t 8 -c 32  -d 60s --latency 'http://127.0.0.1:9103/hello'
Running 1m test @ http://127.0.0.1:9103/hello
  8 threads and 32 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     1.12ms  107.53us  13.65ms   78.66%
    Req/Sec     3.56k   202.05     4.44k    76.00%
  Latency Distribution
     50%    1.12ms
     75%    1.18ms
     90%    1.24ms
     99%    1.37ms
  1701221 requests in 1.00m, 379.64MB read
Requests/sec:  28351.42
Transfer/sec:      6.33MB

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                          
 58685 nobody    20   0   46588   2752   1116 R 100.0  0.0   0:10.66 nginx                                                            
 60538 root      20   0  701960   3728   1136 S  89.4  0.0   0:09.88 wrk

scheruku-in · 2023-04-26T06:02:13Z

Hi,
We have upgraded our envoy version which is ~5 years old to latest v1.25.3. We have executed few perf tests and noticed that there seems to be 60% increase in CPU and 5x increase in memory compared to earlier version of envoy with same test. CPU went up-to 400m and memory was ~120M. Would some one please review the flame graph and share the findings if this can be further optimized? Thanks in advance.

Flame graph:
https://github.com/scheruku-in/Envoy_Perf_Test/blob/main/envoy_highcpu.svg

Envoy config:

{
  "listeners": [
    {
      "address": "tcp://0.0.0.0:17600",
      "ssl_context": {
        "ca_cert_file": "envoycacert.pem",
        "cert_chain_file": "cacertchain.pem",
        "private_key_file": "key.pem",
        "alpn_protocols": "h2,http/1.1",
        "alt_alpn_protocols": "http/1.1"
      }, 
      "filters": [
        {
          "type": "read",
          "name": "http_connection_manager",
          "config": {
            "access_log": [
              {
                "format": "[%START_TIME%]  \"%REQ(X-FORWARDED-FOR)%\" - \"%REQ(USER-AGENT)%\"  \"%REQ(:METHOD)% %REQ(X-ENVOY-ORIGINAL-PATH?:PATH)% %PROTOCOL%\" %RESPONSE_CODE% %BYTES_RECEIVED% %BYTES_SENT%  x-request-id = \"%REQ(X-REQUEST-ID)%\" x-global-transaction-id = \"%REQ(X-GLOBAL-TRANSACTION-ID)%\" \"%REQ(:AUTHORITY)%\" \"%UPSTREAM_HOST%\" \"%UPSTREAM_CLUSTER%\"    rt=\"%DURATION%\" uct=\"%RESP(X-ENVOY-UPSTREAM-SERVICE-TIME)%\" \n",
                "path": "/dev/stdout"
              }],
            "generate_request_id": true,
            "codec_type": "auto",
            "idle_timeout_s": 120,
            "stat_prefix": "ingress_http",
            "route_config": {
              "virtual_hosts": [
                {
                  "name": "service",
                  "domains": ["*"],
                  "require_ssl": "all",
                  "routes": [
                    {
                      "timeout_ms": 120000,
                      "retry_policy": {
                          "retry_on": "gateway-error,connect-failure",
                          "num_retries": 120
                          },
     		          "prefix": "/",
     		          "cluster_header" : "<cluster-header>"
                    }
                  ]
                }
              ]
            },
            "filters": [
              {
                "type": "decoder",
                "name": "router",
                "config": {}
              }
            ]
          }
        }
      ]
    }
  ],
  "admin": {
    "access_log_path": "/dev/stdout",
    "address": "tcp://127.0.0.1:8001"
  },
  "cluster_manager": {
    "clusters": [
    ],
    "cds": {
      "cluster": {
        "name": "cds",
        "connect_timeout_ms": 120000,
        "type": "strict_dns",
        "lb_type": "round_robin",
        "hosts": [
          {
            "url": "tcp://localhost:8081"
          }
        ]
      },
      "refresh_delay_ms": 100
    },
    "outlier_detection": {
      "event_log_path": "/dev/stdout"
    }
  }
}

Thanks in advance.

howardjohn · 2024-07-03T16:49:43Z

Update from 2024:

Version	QPS
1.13	65741
1.14	60302
1.15	52966
1.16	49947
1.17	45257
1.18	42121
1.19	41800
1.20	39946
1.21	39649
1.22	47829
1.23	45302
1.24	42135
1.25	41102
1.26	39932
1.27	41539
1.28	36995
1.29	37273
1.30	37962

crane-get-file () {
  crane export $1 - | tar -Oxf - $2
}

for i in {13..30}; do
  crane-get-file envoyproxy/envoy:v1.$i-latest usr/local/bin/envoy > envoy-$i
  chmod +x envoy-$i
done

for i in {13..30}; do
  echo "STARTING $i"
  ./envoy-$i -c config.yaml --concurrency 1 --disable-hot-restart -l off &
  p=$!
  sleep .1
  benchtool -q 0 -d 10 localhost:9090#envoy-1.$i >> res
  kill -9 $p
  echo "ENDING $i"
done

ztgoto · 2024-07-09T02:32:11Z

2024:
Returns the results directly,
The stress testing tool uses wrk
version:
envoy：1.30.4
nginx: 1.26.1
config:

envoy
./envoy --concurrency 1 -c ./envoy.yaml

static_resources:
  listeners:
  - name: listener
    address:
      socket_address: {address: 0.0.0.0, port_value: 9101}
    filter_chains:
    - filters:
      - name: envoy.filters.network.http_connection_manager
        typed_config:
          "@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
          stat_prefix: http_test
          codec_type: AUTO
          generate_request_id: false
          route_config:
            name: route
            virtual_hosts:
            - name: test
              domains: ["*"]
              routes:
              - match: { prefix: "/" }
                direct_response:
                  status: 200
                  body:
                    inline_string: "{\"message\":\"hello\"}"
                response_headers_to_add:
                - header:
                    key: "Content-Type"
                    value: "application/json"
              #  route:
              #    cluster: auth
          http_filters:
          - name: envoy.filters.http.router
            typed_config:
              "@type": type.googleapis.com/envoy.extensions.filters.http.router.v3.Router
              dynamic_stats: false

stats_config:
  stats_matcher:
    reject_all: true

nginx

worker_processes  1;

events {
    worker_connections  1024;
}


http {
    include       mime.types;
    default_type  application/octet-stream;

    access_log off;

    sendfile        on;

    keepalive_timeout  65;


    server {
        listen       9101;
        server_name  127.0.0.1;


        location / {
            default_type application/json;
            return 200 '{"message":"hello"}';
        }

        location = /50x.html {
            root   html;
        }

    }


}

result:

envoy:

./wrk -t 8 -c 1000 -d 60s 'http://127.0.0.1:9101'
Running 1m test @ http://127.0.0.1:9101
  8 threads and 1000 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    69.34ms    7.74ms 123.77ms   64.42%
    Req/Sec     1.81k   521.16     4.62k    38.42%
  864304 requests in 1.00m, 117.05MB read
Requests/sec:  14382.08
Transfer/sec:      1.95MB

nginx:

./wrk -t 8 -c 1000 -d 60s 'http://127.0.0.1:9101'
Running 1m test @ http://127.0.0.1:9101
  8 threads and 1000 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    39.28ms   99.20ms   2.00s    95.30%
    Req/Sec     6.09k   505.22     7.36k    72.21%
  2908817 requests in 1.00m, 479.91MB read
  Socket errors: connect 0, read 4672, write 0, timeout 13
Requests/sec:  48404.51
Transfer/sec:      7.99MB

Flame graph
envoy:

nginx:

zhxie · 2024-07-09T02:52:59Z

2024: Returns the results directly, The stress testing tool uses wrk version: envoy：1.30.4 nginx: 1.26.1 config:

I have benchmarked Envoy with direct_response once with Fortio with keep-alive disabled. I noticed that Envoy does not close the connection immediately after sending the response, which leads to performance degradation. Your scenario differs from mine, but direct_response isn't commonly used and may not be fully optimized. I suggest testing Envoy in its typical workload as a router.

ztgoto · 2024-07-09T03:52:07Z

@zhxie
I've done tests in routing scenarios before, and the results have been published above, and the performance gap is relatively large

philippeboyd · 2025-01-08T20:56:37Z

Hi @howardjohn, what is the benchtool CLI that you used? Not sure which program it's referring to.

jmarantz · 2025-01-08T21:07:43Z

A few notes:

Envoy team has published https://www.envoyproxy.io/docs/envoy/latest/faq/performance/how_to_benchmark_envoy -- please read!
Envoy has a native benchmarking tool called Nighthawk, which we prefer to wrk and other tools for a variety of reasons: https://github.com/envoyproxy/nighthawk

One thing I noticed is that you specify concurrency=1 for Envoy, so it will allocate only one worker thread. In contrast your nginx config specifies 1 process. I am a little unclear on nginx's default behavior, but I think it does have a threadpool.

https://serverfault.com/questions/1098107/what-is-the-ideal-value-for-threads-on-thread-pool-in-nginx-config

Both nginx and envoy will use async i/o to multiplex many requests over a single thread, but I'm not sure how many threads nginx is using in your benchmark. Envoy will mostly use 1 (in addition to the 'main' thread for handling admin requests and config updates).

howardjohn · 2025-01-08T21:15:27Z

Hi @howardjohn, what is the benchtool CLI that you used? Not sure which program it's referring to.

docker run --rm --init -it --network=host howardjohn/benchtool -- just a small wrapper around fortio basically, nothing too fancy

ztgoto · 2025-01-09T07:29:55Z

My understanding is that if Nginx uses multi-threading when the worker_processes is set to 1, then the CPU monitoring of the Linux system should exceed 100%. Of course, I'm not sure if my understanding is correct.

wbpcode · 2025-01-09T09:26:05Z

One thing I noticed is that you specify concurrency=1 for Envoy, so it will allocate only one worker thread. In contrast your nginx config specifies 1 process. I am a little unclear on nginx's default behavior, but I think it does have a threadpool.

Nginx uses process as the unit of concurrency. So, 1 process == 1 worker for nginx.

cc @jmarantz

jmarantz · 2025-01-09T16:12:28Z

https://www.f5.com/company/blog/nginx/thread-pools-boost-performance-9x

howardjohn · 2025-01-09T16:16:49Z

We don't really need to compare to Nginx though -- Envoy from a few years back has the same threading/process model and is ~2x faster...

jmarantz · 2025-01-09T16:18:02Z

FWIW my team's product does our own benchmarking of our deployment and we have not seen that degradation.

jmarantz · 2025-01-09T16:22:45Z

I'm also wondering about test methdology. Can we repro the older test results from saved docker packages?

howardjohn · 2025-01-09T16:24:52Z

I'm also wondering about test methdology. Can we repro the older test results from saved docker packages?

Yes -- see #19103 (comment). I reproduced it today again as well and see the same results.

Note config.yaml is the one from the original issue.

Here are some more results with nighthawk as a client FWIW, similar trend

DEST        CLIENT     QPS    CONS  DUR  PAYLOAD  SUCCESS  THROUGHPUT   P50      P90      P99
# 30k fixed QPS
envoy-1.13  nighthawk  30000  16    10   0        299999   30000.00qps  0.029ms  0.065ms  0.475ms
envoy-1.32  nighthawk  30000  16    10   0        299997   30000.00qps  0.090ms  0.277ms  0.661ms
# 60k fixed QPS. Note we fail to hit the target with 1.32 so latency should be ignored
envoy-1.13  nighthawk  60000  16    10   0        599997   59999.93qps  0.040ms  0.222ms  0.389ms
envoy-1.32  nighthawk  60000  16    10   0        438224   43823.96qps  0.338ms  0.408ms  0.638ms
# 20k fixed QPS with larger payloads
envoy-1.13  nighthawk  20000  16    10   1024     199999   19999.99qps  0.029ms  0.049ms  0.354ms
envoy-1.32  nighthawk  20000  16    10   1024     199999   19999.98qps  0.056ms  0.143ms  0.793ms

(note: fully acknowledge there are more robust ways to measure performance, but at the differences we are talking about I think its fair to say there is a substantial change)

wbpcode · 2025-01-13T09:54:39Z

https://www.f5.com/company/blog/nginx/thread-pools-boost-performance-9x

This is a feature I never used. haha

wbpcode · 2025-01-13T09:57:38Z

I'm also wondering about test methdology. Can we repro the older test results from saved docker packages?

Yes -- see #19103 (comment). I reproduced it today again as well and see the same results.

Note config.yaml is the one from the original issue.

Here are some more results with nighthawk as a client FWIW, similar trend
DEST        CLIENT     QPS    CONS  DUR  PAYLOAD  SUCCESS  THROUGHPUT   P50      P90      P99
# 30k fixed QPS
envoy-1.13  nighthawk  30000  16    10   0        299999   30000.00qps  0.029ms  0.065ms  0.475ms
envoy-1.32  nighthawk  30000  16    10   0        299997   30000.00qps  0.090ms  0.277ms  0.661ms
# 60k fixed QPS. Note we fail to hit the target with 1.32 so latency should be ignored
envoy-1.13  nighthawk  60000  16    10   0        599997   59999.93qps  0.040ms  0.222ms  0.389ms
envoy-1.32  nighthawk  60000  16    10   0        438224   43823.96qps  0.338ms  0.408ms  0.638ms
# 20k fixed QPS with larger payloads
envoy-1.13  nighthawk  20000  16    10   1024     199999   19999.99qps  0.029ms  0.049ms  0.354ms
envoy-1.32  nighthawk  20000  16    10   1024     199999   19999.98qps  0.056ms  0.143ms  0.793ms
(note: fully acknowledge there are more robust ways to measure performance, but at the differences we are talking about I think its fair to say there is a substantial change)

It's no doubt the change is exist and it's pretty hard to to optimize it back. Orz.

triplewy · 2025-02-02T03:16:35Z

Something else I've noticed about Envoy (v.1.30) is that it tends to completely collapse under high load (i.e. success rate suddenly drops to 0% and oscillates between 100% and 0% success rate every 10 minutes) whereas Nginx will have an increased response time but the success rate will be little impacted. We had to completely rollback our edge proxy adoption of Envoy due to this reason.

Another issue we noticed is that in some locations with poor network performance, Envoy's average CPU utilization would only reach 85% before they it began to collapse. This does not happen with Nginx.

We are actually ok with Envoy potentially using 10-20% more CPU than Nginx but its tendency to completely collapse under high load made us give up our Envoy adoption for our edge footprint.

It's very tough for us to debug what causes Envoy to collapse like this because a flamegraph does not show why Envoy does not fully utilize all of its allocated CPU cores.

jmarantz · 2025-02-02T15:22:27Z

@triplewy would you be able to take some CPU flame-graphs to see where the system is bottlenecking? Have you looked at watchdog timouts stats, or epoll histograms?

One possible fail-point in Envoy -- this is true of nginx also IIUC -- is that threads are precious and if anything being called in the data plane unexpectedly blocks that can cause problems.

One thing that comes to mind is access logs. Do you have those enabled? What format are you using?

krajshiva · 2025-02-02T15:43:43Z

How is the memory situation under load? Just checking if you have Load Shed Points enabled? ( https://www.envoyproxy.io/docs/envoy/latest/configuration/operations/overload_manager/overload_manager#load-shed-points ). Debug/trace level logs can throw insights into reason for a request failure.

…

On Sun, Feb 2, 2025 at 10:22 AM Joshua Marantz ***@***.***> wrote: @triplewy <https://github.com/triplewy> would you be able to take some CPU flame-graphs to see where the system is bottlenecking? Have you looked at watchdog timouts stats, or epoll histograms? One possible fail-point in Envoy -- this is true of nginx also IIUC -- is that threads are precious and if anything being called in the data plane unexpectedly blocks that can cause problems. One thing that comes to mind is access logs. Do you have those enabled? What format are you using? — Reply to this email directly, view it on GitHub <#19103 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AU5KKBEXQXB6HAGKDGM3UET2NYZ45AVCNFSM6AAAAABKJ6U3WCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMMRZGQZTQOJRHE> . You are receiving this because you were mentioned.Message ID: ***@***.***>

triplewy · 2025-02-04T07:50:42Z

@jmarantz I re-deployed Envoy to a few of our edge clusters and was able to reproduce the oscillation behavior. What seems to be happening is:

Significant spike in downstream active connections causes CPU spike
Higher CPU load leads to higher upstream response time
Higher upstream response time leads to upstream pending requests
Upstream pending requests leads to spike in new upstream connections (We only use H1 for upstream)
Upstream connections are immediately closed after they are created
Upstream connection churn leads to failed connections and continued increasing upstream RT
Elevated upstream RT causes downstream clients to send resets, leading to closed upstream connections and drop in success rate.
Downstream active connections drops, Envoy recovers, and process repeats.

During all of this, the event dispatcher and watchdog stats are relatively stable at normal levels. The concerning aspects of this are:

Envoy's CPU utilization is unable to remain consistently high, preventing our auto-scaler from increasing the instance count. We could use another stat to autoscale but this doesn't enforce Envoy to fully utilize its allocated resources.
Envoy tends to completely collapse rather than slow down processing time. The driving force behind this seems to be the immediate closing of new upstream connections but we are unsure why this happens. Using a circuit breaker to prevent excessive upstream connections simply exacerbates the problem since requests are still pending.

If we can figure out why upstream connections are immediately closed in these situations, we may be able to completely prevent Envoy from collapsing.

Below is a CPU flamegraph of our Envoy instance during busy time. We have access logs enabled but so does our Nginx setup. envoy_perf.svg

wbpcode added the triage Issue requires triage label Nov 25, 2021

rojkov added area/perf investigate Potential bug that needs verification and removed triage Issue requires triage labels Nov 25, 2021

wbpcode mentioned this issue Nov 29, 2021

minor perf: reduce fine grained buffer access when encoding HTTP1 headers #19115

Merged

mattklein123 added the help wanted Needs help! label Nov 29, 2021

soulxu mentioned this issue Apr 23, 2023

Envoy performance issue due to increase in CPU #26868

Closed

This was referenced Jun 23, 2023

common inline map registry support #27928

Closed

http: round-trip recording in stream info adds 2% CPU overhead in a benchmark #28133

Closed

wbpcode mentioned this issue Jul 13, 2023

Benchmarking Envoy RPS #28318

Closed

eaufavor mentioned this issue Mar 19, 2024

Why is it that a CPU running at full capacity only has less than 2000RPS when I test throughput on Linux x86 system? cloudflare/pingora#143

Closed

wbpcode mentioned this issue Aug 14, 2024

Add Envoy::ExecutionContext. #31937

Merged

KBaichoo mentioned this issue Jan 15, 2025

Performance: Add the option to disable huffman encoding for HTTP2 in certain scenarios. #38025

Open

some performance results of envoy's different versions #19103

some performance results of envoy's different versions #19103

Comments

wbpcode commented Nov 25, 2021 • edited Loading

wbpcode commented Nov 25, 2021

zhxie commented Nov 25, 2021

wbpcode commented Nov 25, 2021 • edited Loading

rojkov commented Nov 25, 2021

wbpcode commented Nov 25, 2021

wbpcode commented Nov 25, 2021 • edited Loading

wbpcode commented Nov 26, 2021

jmarantz commented Nov 26, 2021 • edited Loading

wbpcode commented Nov 26, 2021

mattklein123 commented Nov 26, 2021

jmarantz commented Nov 26, 2021

wbpcode commented Nov 27, 2021

wbpcode commented Nov 27, 2021 • edited Loading

jmarantz commented Nov 29, 2021

jmarantz commented Nov 29, 2021

wbpcode commented Nov 29, 2021

jmarantz commented Nov 29, 2021

wbpcode commented Nov 29, 2021 • edited Loading

wbpcode commented Nov 29, 2021

jmarantz commented Nov 29, 2021

wbpcode commented Nov 29, 2021

jmarantz commented Nov 29, 2021

wbpcode commented Nov 29, 2021 • edited Loading

hobbytp commented Nov 30, 2021

KBaichoo commented Sep 1, 2022

ztgoto commented Sep 2, 2022

scheruku-in commented Apr 26, 2023

howardjohn commented Jul 3, 2024

ztgoto commented Jul 9, 2024 • edited Loading

zhxie commented Jul 9, 2024

ztgoto commented Jul 9, 2024

philippeboyd commented Jan 8, 2025

jmarantz commented Jan 8, 2025

howardjohn commented Jan 8, 2025

ztgoto commented Jan 9, 2025

wbpcode commented Jan 9, 2025

jmarantz commented Jan 9, 2025

howardjohn commented Jan 9, 2025

jmarantz commented Jan 9, 2025

jmarantz commented Jan 9, 2025

howardjohn commented Jan 9, 2025 • edited Loading

wbpcode commented Jan 13, 2025

wbpcode commented Jan 13, 2025

triplewy commented Feb 2, 2025 • edited Loading

jmarantz commented Feb 2, 2025

krajshiva commented Feb 2, 2025 via email

triplewy commented Feb 4, 2025 • edited Loading

wbpcode commented Nov 25, 2021 •

edited

Loading

wbpcode commented Nov 25, 2021 •

edited

Loading

wbpcode commented Nov 25, 2021 •

edited

Loading

jmarantz commented Nov 26, 2021 •

edited

Loading

wbpcode commented Nov 27, 2021 •

edited

Loading

wbpcode commented Nov 29, 2021 •

edited

Loading

wbpcode commented Nov 29, 2021 •

edited

Loading

ztgoto commented Jul 9, 2024 •

edited

Loading

howardjohn commented Jan 9, 2025 •

edited

Loading

triplewy commented Feb 2, 2025 •

edited

Loading

triplewy commented Feb 4, 2025 •

edited

Loading