Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

some performance results of envoy's different versions #19103

Open
wbpcode opened this issue Nov 25, 2021 · 66 comments
Open

some performance results of envoy's different versions #19103

wbpcode opened this issue Nov 25, 2021 · 66 comments
Labels
area/perf help wanted Needs help! investigate Potential bug that needs verification

Comments

@wbpcode
Copy link
Member

wbpcode commented Nov 25, 2021

I did some simple tests today and found that envoy's performance seems to be getting worse and worse. Although I know features should always come at some cost. But those costs seem to be too much.

And although I know that performance is not the first goal of Envoy. But with continuous enhancement of Envoy, the performance seems to degrade too quickly. Here are some simple results of single envoy worker:

Version QPS
v1.12.4 26921.48
v1.13.3 25182.18
v1.14.7 23732.31
v1.15.5 21010.66
v1.16.5 19116.81
v1.17.4 17804.78
v1.18.4 16953.67
v1.19.1 16046.59
v1.20.0 15949.65

I think we should at least prevent it from further deterioration, and at the same time find ways to optimize it.

Here is my config yaml:

static_resources:
  listeners:
    - address:
        socket_address:
          address: 0.0.0.0
          port_value: 9090
      filter_chains:
        - filters:
            - name: envoy.http_connection_manager
              typed_config:
                "@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
                codec_type: auto
                stat_prefix: ingress_http
                route_config:
                  name: local_route
                  virtual_hosts:
                    - name: host-one # prefix route
                      domains:
                        - "*"
                      routes:
                        - match:
                            prefix: "/"
                          route:
                            cluster: httpbin
                http_filters:
                  - name: envoy.router
                    typed_config:
                      "@type": type.googleapis.com/envoy.extensions.filters.http.router.v3.Router
                      dynamic_stats: false
                      suppress_envoy_headers: true
  clusters:
    - name: httpbin
      connect_timeout: 5s
      type: strict_dns
      lb_policy: round_robin
      load_assignment:
        cluster_name: httpbin
        endpoints:
          - lb_endpoints:
              - endpoint:
                  address:
                    socket_address:
                      address: localhost
                      port_value: 8080

admin:
  access_log_path: "/dev/null"
  address:
    socket_address:
      address: 0.0.0.0
      port_value: 20000

Back-end: multi-processes Nginx that will return simple short string.

Client command: wrk -c 200 -t 2 -d 180s http://localhost:9090/

@wbpcode wbpcode added the triage Issue requires triage label Nov 25, 2021
@wbpcode
Copy link
Member Author

wbpcode commented Nov 25, 2021

cc @mattklein123 @rojkov

@zhxie
Copy link
Contributor

zhxie commented Nov 25, 2021

Have you ever profiled Envoy with tools like perf, so that we can find some hotspots from the result?

@wbpcode
Copy link
Member Author

wbpcode commented Nov 25, 2021

Have you ever profiled Envoy with tools like perf, so that we can find some hotspots from the result?

I will do it this weekend. 😄

@rojkov rojkov added area/perf investigate Potential bug that needs verification and removed triage Issue requires triage labels Nov 25, 2021
@rojkov
Copy link
Member

rojkov commented Nov 25, 2021

This feels like a duplicate of #13412.

@wbpcode
Copy link
Member Author

wbpcode commented Nov 25, 2021

This feels like a duplicate of #13412.

Yep. Looks like the degrading never stops.

@wbpcode
Copy link
Member Author

wbpcode commented Nov 25, 2021

Here are some flame graphs. I only have two different binary files with code symbols. Considering that there has been a huge gap in performance between them, I think it is enough as a reference.

v1.17.4:
https://drive.google.com/file/d/1aZuC54PmIXsQu7k88jBeU-HENOEiQXgT/view?usp=sharing

v1.12.2:
https://drive.google.com/file/d/1o6kV8T2J5nCs3m7lHaXSdXaudjzNJ9Sv/view?usp=sharing

v1.20.x
https://drive.google.com/file/d/1lU539aRFzOCrR16EFTjKwu0MtzV03_kr/view?usp=sharing

cc @mattklein123

@wbpcode
Copy link
Member Author

wbpcode commented Nov 26, 2021

There are no obvious hotspots from the flame graph, just a very uniform slowdown. More encapsulation and abstraction gradually reduces the performance of Envoy.

@jmarantz
Copy link
Contributor

jmarantz commented Nov 26, 2021

I'm not sure if I buy that incremental changes in encapsulation and abstraction are likely to cause that much slowdown. I glanced at one of the flame-graphs and it's hard to know what to improve there.

It might help to capture some perf-graphs from an instrumented binary that can provide more detail on what's going on. The purely sampled view we get from these flame-graphs might (for example, as a guess) hide some effects of changes in the way we make the networking system calls. E.g. we spend a lot of time in the 1.20 flamegraph in the kernel. Are we making larger numbers of calls for smaller chunks of data, for any reason?

It would be nice to get a controlled repro of this, using NIghthawk preferably, and then run envoy under callgrind or compiled with various perf tools, so we can see call-counts for various functions. Using this we could compare how we are structuring our system calls to the earlier version of Envoy.

@wbpcode
Copy link
Member Author

wbpcode commented Nov 26, 2021

I'm not sure if I buy that incremental changes in encapsulation and abstraction are likely to cause that much slowdown. I glanced at one of the flame-graphs and it's hard to know what to improve there.

It might help to capture some perf-graphs from an instrumented binary that can provide more detail on what's going on. The purely sampled view we get from these flame-graphs might (for example, as a guess) hide some effects of changes in the way we make the networking system calls. E.g. we spend a lot of time in the 1.20 flamegraph in the kernel. Are we making larger numbers of calls for smaller chunks of data, for any reason?

It would be nice to get a controlled repro of this, using NIghthawk preferably, and then run envoy under callgrind or compiled with various perf tools, so we can see call-counts for various functions. Using this we could compare how we are structuring our system calls to the earlier version of Envoy.

I will try to do more investigate.

@mattklein123
Copy link
Member

Related to @jmarantz comment, over time we have generally moved to secure by default configuration. This almost universally is at odds with "horse race" benchmarks. So we will need to very carefully tease apart any differences. The work is useful but it's a lot more complicated than just looking at flame graphs. I also agree with @jmarantz that trying to run a reproducer under something like cachegrind will be a lot more informative.

@jmarantz
Copy link
Contributor

Just to point this out: Matt suggests cachegrind and I suggested callgrind: they are related and both very useful:

  • they both are generated using the valgrind infrastructure (--tool=callgrind vs --tool=cachegrind)
  • cachegrind has a lot more detail of the simulated processor cache effects of each line of code
  • callgrind has more detail about time spent and call-counts at a function-level
  • they both benefit from an OptDebug compilation. I use --compilation_mode=opt --cxxopt=-g --cxxopt=-ggdb3 but those options might be dated as I haven't been able to do much direct technical work lately
  • they both generate reports that can be visualized with kcachegrind
  • they both run much slower than real time and I wouldn't run production traffic through them. But it should be no problem for nighthawk or any other synthetic load tool

@wbpcode
Copy link
Member Author

wbpcode commented Nov 27, 2021

@mattklein123 @jmarantz Thanks very much for all your suggestions. I will put more effort into tracking and investigating this issue recently.🌷

At present, I still subjectively think that it may caused by a large number of small adjustments. For example, #13423 introduces a small update to replace all white space chars in the response code details which will bring some minor new overhead. Assuming there are many similar PRs, the accumulation of these small overheads will still have a large enough impact. And because they are very scattered, it may also be difficult to locate.

But this is just my personal guess. We still need more research to identify the problem and try to solve it.

Of course, if it is just because of some more secure default configuration, then at most only the documentation needs to be updated.

@wbpcode
Copy link
Member Author

wbpcode commented Nov 27, 2021

2021/11/27 callgrind.out file of v1.12.2 vs v1.20.x (100, 000 HTTP1.1 request with wrk).

https://drive.google.com/drive/folders/1EWTjixvN43O8u24a_rJF8S6A4ePvFwC1?usp=sharing

v1.20.x

v1.12.2

First point: encodeFormattedHeader introduced ~3% external CPU overhead.
Root cause: Too much fine grained buffer API access.
Related PR: #9825

@jmarantz
Copy link
Contributor

Great -- would you be able to supply the parameters passed to Nighthawk (preferably) or whatever tool you were using to generate the load for these traces?

Thanks!

@jmarantz
Copy link
Contributor

This data is great! It really shows the profiles look very different. Just sorting by 'self' and comparing the two views provides a ton of possible issues to go explore.

Did you use an "OptDebug" build or was this a stock build or something?

@wbpcode
Copy link
Member Author

wbpcode commented Nov 29, 2021

I used simple hey to generate 100, 000 http1 request. Here is the command.

hey -n 100000 -c 20 http://localhost:9090.

@jmarantz
Copy link
Contributor

I had not heard of hey. Is that this? https://github.com/rakyll/hey

@wbpcode
Copy link
Member Author

wbpcode commented Nov 29, 2021

Did you use an "OptDebug" build or was this a stock build or something?

I used the no stripped binary that build with ci/do_ci.sh bazel.release. It finally worked, but I'm not sure it is OptDebug mode.

@wbpcode
Copy link
Member Author

wbpcode commented Nov 29, 2021

I had not heard of hey. Is that this? https://github.com/rakyll/hey

Yes, I am generally used to using wrk, but it seems that wrk cannot generate a fixed amount of load.

@jmarantz
Copy link
Contributor

Nighthawk (https://github.com/envoyproxy/nighthawk) is really what we want to converge to, as you get explicit control of open/closed loop, ability to generate http2, async vs concurrent, an indication of how many requests succeeded/failed, etc.

By "fixed amount of load" does this mean a finite number of requests? Or a fixed rate of requests?

The docker images are definitely not in OptDebug mode :) They are probably simply optimized, which is OK, but we'll have a lot less details on call-stack, where in functions time is being spent, etc.

@wbpcode
Copy link
Member Author

wbpcode commented Nov 29, 2021

Nighthawk (https://github.com/envoyproxy/nighthawk) is really what we want to converge to, as you get explicit control of open/closed loop, ability to generate http2, async vs concurrent, an indication of how many requests succeeded/failed, etc.

Thanks, I will try it in the coming test.

The docker images are definitely not in OptDebug mode :) They are probably simply optimized, which is OK, but we'll have a lot less details on call-stack, where in functions time is being spent, etc.

I see. Thanks. 🌷 I will try to build new binary with you suggested compile args and do some more investigates. But recently I only have enough time on weekends.

@jmarantz
Copy link
Contributor

No worries -- the data you supplied is great. We'll definitely want to repro with NH though so we understand what we are comparing :)

My suspicion is that you've found something real, and based on the traces I looked at, there were changes in either tcmalloc's implementation, the way we allocate buffers, or both. The encodeFormattedHeader hot-spot probably deserves a quick look also as that is a relatively recent feature. Did you wind up turning that on?

What did you have running on localhost:8080? An Apache server or something?

@wbpcode
Copy link
Member Author

wbpcode commented Nov 29, 2021

What did you have running on localhost:8080? An Apache server or something?

A multi-processes Nginx that will return simple short string directly.

Did you wind up turning that on?

In fact, nope. Nevertheless, the new version of encodeFormatedHeader/encodeHeader still introduces more overhead. The reason is that encodeHeaders call encodeFormatedHeader several times and write some fine-grained data (a character, a space, a header name, etc.) to the buffer (watermark buffer) directly.
The new version does not use a local cache to speed up the writing of these fine-grained data.

I've created a PR #19115 to try to solve this problem.

@mattklein123 mattklein123 added the help wanted Needs help! label Nov 29, 2021
@hobbytp
Copy link

hobbytp commented Nov 30, 2021

@wbpcode do you mean the issue only exists in HTTP1 or you only test it by using HTTP1, I ask this because I found in your PR #19115, you only fix HTTP1 codes, thanks for clarification.

@KBaichoo
Copy link
Contributor

KBaichoo commented Sep 1, 2022

@ztgoto What HTTP protocol is being used? HTTP/1, or HTTP/2? In particular for HTTP2 in https://www.envoyproxy.io/docs/envoy/latest/api-v3/config/core/v3/protocol.proto#config-core-v3-http2protocoloptions
max_concurrent_streams defaults to 2147483647 which I've seen this skew some benchmarks prior.

@ztgoto
Copy link

ztgoto commented Sep 2, 2022

@ztgoto PTAL @ https://www.envoyproxy.io/docs/envoy/latest/faq/performance/how_to_benchmark_envoy

In particular, can you disable circuit-breaking and re-run your benchmark (per recommendations in the benchmarking doc above). It's not clear to me if that's a bottleneck in your case, but you have not configured circuit-breaking on the nginx side so it seems appropriate to keep the behavior consistent.

Also, did you build your own Envoy from source? Or use a pre-built package? If you built it yourself did you use -c opt?

@jmarantz I use envoyproxy/envoy:v1.22.2 (--network host) for testing. The config file is as stated above, I don't know if there is anything wrong.

envoy

wrk -t 8 -c 32  -d 60s --latency 'http://127.0.0.1:9104/hello'
Running 1m test @ http://127.0.0.1:9104/hello
  8 threads and 32 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     2.76ms  378.73us   6.53ms   66.32%
    Req/Sec     1.45k    65.16     2.12k    71.77%
  Latency Distribution
     50%    2.69ms
     75%    3.02ms
     90%    3.33ms
     99%    3.67ms
  693601 requests in 1.00m, 156.77MB read
Requests/sec:  11553.84
Transfer/sec:      2.61MB


   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                          
160976 101       20   0 2247104  54512  19432 S 100.3  0.1   2:13.85 envoy                                                            
 32221 root      20   0  706612   8384   1160 S  32.8  0.0   0:04.63 wrk  

nginx

wrk -t 8 -c 32  -d 60s --latency 'http://127.0.0.1:9103/hello'
Running 1m test @ http://127.0.0.1:9103/hello
  8 threads and 32 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     1.12ms  107.53us  13.65ms   78.66%
    Req/Sec     3.56k   202.05     4.44k    76.00%
  Latency Distribution
     50%    1.12ms
     75%    1.18ms
     90%    1.24ms
     99%    1.37ms
  1701221 requests in 1.00m, 379.64MB read
Requests/sec:  28351.42
Transfer/sec:      6.33MB

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                          
 58685 nobody    20   0   46588   2752   1116 R 100.0  0.0   0:10.66 nginx                                                            
 60538 root      20   0  701960   3728   1136 S  89.4  0.0   0:09.88 wrk

@scheruku-in
Copy link

Hi,
We have upgraded our envoy version which is ~5 years old to latest v1.25.3. We have executed few perf tests and noticed that there seems to be 60% increase in CPU and 5x increase in memory compared to earlier version of envoy with same test. CPU went up-to 400m and memory was ~120M. Would some one please review the flame graph and share the findings if this can be further optimized? Thanks in advance.

Flame graph:
https://github.com/scheruku-in/Envoy_Perf_Test/blob/main/envoy_highcpu.svg

Envoy config:

{
  "listeners": [
    {
      "address": "tcp://0.0.0.0:17600",
      "ssl_context": {
        "ca_cert_file": "envoycacert.pem",
        "cert_chain_file": "cacertchain.pem",
        "private_key_file": "key.pem",
        "alpn_protocols": "h2,http/1.1",
        "alt_alpn_protocols": "http/1.1"
      }, 
      "filters": [
        {
          "type": "read",
          "name": "http_connection_manager",
          "config": {
            "access_log": [
              {
                "format": "[%START_TIME%]  \"%REQ(X-FORWARDED-FOR)%\" - \"%REQ(USER-AGENT)%\"  \"%REQ(:METHOD)% %REQ(X-ENVOY-ORIGINAL-PATH?:PATH)% %PROTOCOL%\" %RESPONSE_CODE% %BYTES_RECEIVED% %BYTES_SENT%  x-request-id = \"%REQ(X-REQUEST-ID)%\" x-global-transaction-id = \"%REQ(X-GLOBAL-TRANSACTION-ID)%\" \"%REQ(:AUTHORITY)%\" \"%UPSTREAM_HOST%\" \"%UPSTREAM_CLUSTER%\"    rt=\"%DURATION%\" uct=\"%RESP(X-ENVOY-UPSTREAM-SERVICE-TIME)%\" \n",
                "path": "/dev/stdout"
              }],
            "generate_request_id": true,
            "codec_type": "auto",
            "idle_timeout_s": 120,
            "stat_prefix": "ingress_http",
            "route_config": {
              "virtual_hosts": [
                {
                  "name": "service",
                  "domains": ["*"],
                  "require_ssl": "all",
                  "routes": [
                    {
                      "timeout_ms": 120000,
                      "retry_policy": {
                          "retry_on": "gateway-error,connect-failure",
                          "num_retries": 120
                          },
     		          "prefix": "/",
     		          "cluster_header" : "<cluster-header>"
                    }
                  ]
                }
              ]
            },
            "filters": [
              {
                "type": "decoder",
                "name": "router",
                "config": {}
              }
            ]
          }
        }
      ]
    }
  ],
  "admin": {
    "access_log_path": "/dev/stdout",
    "address": "tcp://127.0.0.1:8001"
  },
  "cluster_manager": {
    "clusters": [
    ],
    "cds": {
      "cluster": {
        "name": "cds",
        "connect_timeout_ms": 120000,
        "type": "strict_dns",
        "lb_type": "round_robin",
        "hosts": [
          {
            "url": "tcp://localhost:8081"
          }
        ]
      },
      "refresh_delay_ms": 100
    },
    "outlier_detection": {
      "event_log_path": "/dev/stdout"
    }
  }
}

Thanks in advance.

@howardjohn
Copy link
Contributor

Update from 2024:

Version QPS
1.13 65741
1.14 60302
1.15 52966
1.16 49947
1.17 45257
1.18 42121
1.19 41800
1.20 39946
1.21 39649
1.22 47829
1.23 45302
1.24 42135
1.25 41102
1.26 39932
1.27 41539
1.28 36995
1.29 37273
1.30 37962
crane-get-file () {
  crane export $1 - | tar -Oxf - $2
}

for i in {13..30}; do
  crane-get-file envoyproxy/envoy:v1.$i-latest usr/local/bin/envoy > envoy-$i
  chmod +x envoy-$i
done

for i in {13..30}; do
  echo "STARTING $i"
  ./envoy-$i -c config.yaml --concurrency 1 --disable-hot-restart -l off &
  p=$!
  sleep .1
  benchtool -q 0 -d 10 localhost:9090#envoy-1.$i >> res
  kill -9 $p
  echo "ENDING $i"
done

@ztgoto
Copy link

ztgoto commented Jul 9, 2024

2024:
Returns the results directly,
The stress testing tool uses wrk
version:
envoy:1.30.4
nginx: 1.26.1
config:

envoy
./envoy --concurrency 1 -c ./envoy.yaml

static_resources:
  listeners:
  - name: listener
    address:
      socket_address: {address: 0.0.0.0, port_value: 9101}
    filter_chains:
    - filters:
      - name: envoy.filters.network.http_connection_manager
        typed_config:
          "@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
          stat_prefix: http_test
          codec_type: AUTO
          generate_request_id: false
          route_config:
            name: route
            virtual_hosts:
            - name: test
              domains: ["*"]
              routes:
              - match: { prefix: "/" }
                direct_response:
                  status: 200
                  body:
                    inline_string: "{\"message\":\"hello\"}"
                response_headers_to_add:
                - header:
                    key: "Content-Type"
                    value: "application/json"
              #  route:
              #    cluster: auth
          http_filters:
          - name: envoy.filters.http.router
            typed_config:
              "@type": type.googleapis.com/envoy.extensions.filters.http.router.v3.Router
              dynamic_stats: false

stats_config:
  stats_matcher:
    reject_all: true

nginx

worker_processes  1;

events {
    worker_connections  1024;
}


http {
    include       mime.types;
    default_type  application/octet-stream;

    access_log off;

    sendfile        on;

    keepalive_timeout  65;


    server {
        listen       9101;
        server_name  127.0.0.1;


        location / {
            default_type application/json;
            return 200 '{"message":"hello"}';
        }

        location = /50x.html {
            root   html;
        }

    }


}

result:

envoy:

./wrk -t 8 -c 1000 -d 60s 'http://127.0.0.1:9101'
Running 1m test @ http://127.0.0.1:9101
  8 threads and 1000 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    69.34ms    7.74ms 123.77ms   64.42%
    Req/Sec     1.81k   521.16     4.62k    38.42%
  864304 requests in 1.00m, 117.05MB read
Requests/sec:  14382.08
Transfer/sec:      1.95MB

nginx:

./wrk -t 8 -c 1000 -d 60s 'http://127.0.0.1:9101'
Running 1m test @ http://127.0.0.1:9101
  8 threads and 1000 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    39.28ms   99.20ms   2.00s    95.30%
    Req/Sec     6.09k   505.22     7.36k    72.21%
  2908817 requests in 1.00m, 479.91MB read
  Socket errors: connect 0, read 4672, write 0, timeout 13
Requests/sec:  48404.51
Transfer/sec:      7.99MB

Flame graph
envoy:
envoy-1.30.4

nginx:
nginx-1.26.1

@zhxie
Copy link
Contributor

zhxie commented Jul 9, 2024

2024: Returns the results directly, The stress testing tool uses wrk version: envoy:1.30.4 nginx: 1.26.1 config:

I have benchmarked Envoy with direct_response once with Fortio with keep-alive disabled. I noticed that Envoy does not close the connection immediately after sending the response, which leads to performance degradation. Your scenario differs from mine, but direct_response isn't commonly used and may not be fully optimized. I suggest testing Envoy in its typical workload as a router.

@ztgoto
Copy link

ztgoto commented Jul 9, 2024

@zhxie
I've done tests in routing scenarios before, and the results have been published above, and the performance gap is relatively large

@philippeboyd
Copy link

Hi @howardjohn, what is the benchtool CLI that you used? Not sure which program it's referring to.

@jmarantz
Copy link
Contributor

jmarantz commented Jan 8, 2025

A few notes:

One thing I noticed is that you specify concurrency=1 for Envoy, so it will allocate only one worker thread. In contrast your nginx config specifies 1 process. I am a little unclear on nginx's default behavior, but I think it does have a threadpool.

https://serverfault.com/questions/1098107/what-is-the-ideal-value-for-threads-on-thread-pool-in-nginx-config

Both nginx and envoy will use async i/o to multiplex many requests over a single thread, but I'm not sure how many threads nginx is using in your benchmark. Envoy will mostly use 1 (in addition to the 'main' thread for handling admin requests and config updates).

@howardjohn
Copy link
Contributor

Hi @howardjohn, what is the benchtool CLI that you used? Not sure which program it's referring to.

docker run --rm --init -it --network=host howardjohn/benchtool -- just a small wrapper around fortio basically, nothing too fancy

@ztgoto
Copy link

ztgoto commented Jan 9, 2025

My understanding is that if Nginx uses multi-threading when the worker_processes is set to 1, then the CPU monitoring of the Linux system should exceed 100%. Of course, I'm not sure if my understanding is correct.

@wbpcode
Copy link
Member Author

wbpcode commented Jan 9, 2025

One thing I noticed is that you specify concurrency=1 for Envoy, so it will allocate only one worker thread. In contrast your nginx config specifies 1 process. I am a little unclear on nginx's default behavior, but I think it does have a threadpool.

Nginx uses process as the unit of concurrency. So, 1 process == 1 worker for nginx.

cc @jmarantz

@jmarantz
Copy link
Contributor

jmarantz commented Jan 9, 2025

@howardjohn
Copy link
Contributor

We don't really need to compare to Nginx though -- Envoy from a few years back has the same threading/process model and is ~2x faster...

@jmarantz
Copy link
Contributor

jmarantz commented Jan 9, 2025

FWIW my team's product does our own benchmarking of our deployment and we have not seen that degradation.

@jmarantz
Copy link
Contributor

jmarantz commented Jan 9, 2025

I'm also wondering about test methdology. Can we repro the older test results from saved docker packages?

@howardjohn
Copy link
Contributor

howardjohn commented Jan 9, 2025

I'm also wondering about test methdology. Can we repro the older test results from saved docker packages?

Yes -- see #19103 (comment). I reproduced it today again as well and see the same results.

Note config.yaml is the one from the original issue.

Here are some more results with nighthawk as a client FWIW, similar trend

DEST        CLIENT     QPS    CONS  DUR  PAYLOAD  SUCCESS  THROUGHPUT   P50      P90      P99
# 30k fixed QPS
envoy-1.13  nighthawk  30000  16    10   0        299999   30000.00qps  0.029ms  0.065ms  0.475ms
envoy-1.32  nighthawk  30000  16    10   0        299997   30000.00qps  0.090ms  0.277ms  0.661ms
# 60k fixed QPS. Note we fail to hit the target with 1.32 so latency should be ignored
envoy-1.13  nighthawk  60000  16    10   0        599997   59999.93qps  0.040ms  0.222ms  0.389ms
envoy-1.32  nighthawk  60000  16    10   0        438224   43823.96qps  0.338ms  0.408ms  0.638ms
# 20k fixed QPS with larger payloads
envoy-1.13  nighthawk  20000  16    10   1024     199999   19999.99qps  0.029ms  0.049ms  0.354ms
envoy-1.32  nighthawk  20000  16    10   1024     199999   19999.98qps  0.056ms  0.143ms  0.793ms

(note: fully acknowledge there are more robust ways to measure performance, but at the differences we are talking about I think its fair to say there is a substantial change)

@wbpcode
Copy link
Member Author

wbpcode commented Jan 13, 2025

https://www.f5.com/company/blog/nginx/thread-pools-boost-performance-9x

This is a feature I never used. haha

@wbpcode
Copy link
Member Author

wbpcode commented Jan 13, 2025

I'm also wondering about test methdology. Can we repro the older test results from saved docker packages?

Yes -- see #19103 (comment). I reproduced it today again as well and see the same results.

Note config.yaml is the one from the original issue.

Here are some more results with nighthawk as a client FWIW, similar trend

DEST        CLIENT     QPS    CONS  DUR  PAYLOAD  SUCCESS  THROUGHPUT   P50      P90      P99
# 30k fixed QPS
envoy-1.13  nighthawk  30000  16    10   0        299999   30000.00qps  0.029ms  0.065ms  0.475ms
envoy-1.32  nighthawk  30000  16    10   0        299997   30000.00qps  0.090ms  0.277ms  0.661ms
# 60k fixed QPS. Note we fail to hit the target with 1.32 so latency should be ignored
envoy-1.13  nighthawk  60000  16    10   0        599997   59999.93qps  0.040ms  0.222ms  0.389ms
envoy-1.32  nighthawk  60000  16    10   0        438224   43823.96qps  0.338ms  0.408ms  0.638ms
# 20k fixed QPS with larger payloads
envoy-1.13  nighthawk  20000  16    10   1024     199999   19999.99qps  0.029ms  0.049ms  0.354ms
envoy-1.32  nighthawk  20000  16    10   1024     199999   19999.98qps  0.056ms  0.143ms  0.793ms

(note: fully acknowledge there are more robust ways to measure performance, but at the differences we are talking about I think its fair to say there is a substantial change)

It's no doubt the change is exist and it's pretty hard to to optimize it back. Orz.

@triplewy
Copy link

triplewy commented Feb 2, 2025

Something else I've noticed about Envoy (v.1.30) is that it tends to completely collapse under high load (i.e. success rate suddenly drops to 0% and oscillates between 100% and 0% success rate every 10 minutes) whereas Nginx will have an increased response time but the success rate will be little impacted. We had to completely rollback our edge proxy adoption of Envoy due to this reason.

Another issue we noticed is that in some locations with poor network performance, Envoy's average CPU utilization would only reach 85% before they it began to collapse. This does not happen with Nginx.

We are actually ok with Envoy potentially using 10-20% more CPU than Nginx but its tendency to completely collapse under high load made us give up our Envoy adoption for our edge footprint.

It's very tough for us to debug what causes Envoy to collapse like this because a flamegraph does not show why Envoy does not fully utilize all of its allocated CPU cores.

@jmarantz
Copy link
Contributor

jmarantz commented Feb 2, 2025

@triplewy would you be able to take some CPU flame-graphs to see where the system is bottlenecking? Have you looked at watchdog timouts stats, or epoll histograms?

One possible fail-point in Envoy -- this is true of nginx also IIUC -- is that threads are precious and if anything being called in the data plane unexpectedly blocks that can cause problems.

One thing that comes to mind is access logs. Do you have those enabled? What format are you using?

@krajshiva
Copy link
Contributor

krajshiva commented Feb 2, 2025 via email

@triplewy
Copy link

triplewy commented Feb 4, 2025

@jmarantz I re-deployed Envoy to a few of our edge clusters and was able to reproduce the oscillation behavior. What seems to be happening is:

  1. Significant spike in downstream active connections causes CPU spike
  2. Higher CPU load leads to higher upstream response time
  3. Higher upstream response time leads to upstream pending requests
  4. Upstream pending requests leads to spike in new upstream connections (We only use H1 for upstream)
  5. Upstream connections are immediately closed after they are created
  6. Upstream connection churn leads to failed connections and continued increasing upstream RT
  7. Elevated upstream RT causes downstream clients to send resets, leading to closed upstream connections and drop in success rate.
  8. Downstream active connections drops, Envoy recovers, and process repeats.

Image

During all of this, the event dispatcher and watchdog stats are relatively stable at normal levels. The concerning aspects of this are:

  1. Envoy's CPU utilization is unable to remain consistently high, preventing our auto-scaler from increasing the instance count. We could use another stat to autoscale but this doesn't enforce Envoy to fully utilize its allocated resources.
  2. Envoy tends to completely collapse rather than slow down processing time. The driving force behind this seems to be the immediate closing of new upstream connections but we are unsure why this happens. Using a circuit breaker to prevent excessive upstream connections simply exacerbates the problem since requests are still pending.

If we can figure out why upstream connections are immediately closed in these situations, we may be able to completely prevent Envoy from collapsing.

Below is a CPU flamegraph of our Envoy instance during busy time. We have access logs enabled but so does our Nginx setup. envoy_perf.svg

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/perf help wanted Needs help! investigate Potential bug that needs verification
Projects
None yet
Development

No branches or pull requests