-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Poor RPC performance #5821
Comments
The issue name is misleading. We need to compile the wasm code to call into it. Not sure what expects to happen. You can run with |
I'd expect the less variance in response times even in presence of WASM code compilation for every request. Anyways, I'll try adjusting the |
This is expected. If an RPC call requires a call into the runtime then we need to have that runtime compiled and ready for instantiation (otherwise we can't call into it). So we need to compile it, and that takes time. Although one improvement we could make here is if we'd enable the executor's on-disk cache, which would allow wasmtime to skip compiling a given runtime without having to increase the in-memory runtime cache size as long as it has previously compiled a given runtime at least once before. (So the RPC time would still take a long time the very first time you call it, but any subsequent call for that given runtime, even if the node is restarted, should be relatively fast, since it could then reload the compiled runtime from disk.) We have the code to do that, but IIRC this is currently hardcoded to essentially be always disabled. |
I'm running a Westend RPC node that exhibits a poor performance serving RPC calls (both WS and HTTP).
RPC calls execution time varies greatly on whether it is first or subsequent call. The most notable example being
state_getMetadata
method that can take anywhere from 8ms to 18s. I recorded two profiles usingperf
during the execution ofstate_getMetadata
method and rendered a flamegraph:Slow (first) RPC call

Source: https://transfer.sh/e52S3b/perf-slow.svg
Fast (subsequent) RPC call

Source: https://transfer.sh/RQXIUs/perf-fast.svg
On the first flamegraph most of the CPU time was spent compiling a WASM code and only 2.39% of the time actually executing it. During the WASM compilation I observe a high CPU utilization (100% per vCPU, depending on the number of in-flight requests).
I plotted a graph for the median distribution of the RPC calls execution time over the past two days (y-axis being microseconds)

state_getMetadata
method alone is being executed on average 1.23 times per second.Each time a method execution falls into a
slow RPC call
bucket and requires WASM compilation, a node uses 100% of the vCPU. As there are other methods being called it significantly increases the resources usage and impacts end-user experience. It also leads to a possibility to cause a DoS by utilizing all of the available resources.The above behavior is observed on both GCP VM instance and inside a container running in a K8s cluster
System info 1 (VM):
Debian 10
Linux kernel: 4.19.0-18-cloud-amd64
Polkadot binary v0.9.26
CLI parameters:
polkadot --detailed-log-output --name westend-rpc-1 --unsafe-ws-external --rpc-methods Safe --rpc-cors * --chain westend --listen-addr=/ip4/0.0.0.0/tcp/30333 --public-addr=/ip4/<redacted>/tcp/30333 --in-peers 25 --out-peers 25 --db-cache 512 --telemetry-url wss://<redacted> --pruning=archive -lsync=trace,rpc_metrics=debug --prometheus-external --prometheus-port 9615 --ws-port 9944 --ws-max-connections 5000 --rpc-port 9933
System info 1 (container):
Google Kubernetes engine: v1.22.10-gke.600
Linux kernel 5.10.109+
Polkadot image v0.9.26
CLI parameters:
polkadot --name=westend-rpc-node-0 --base-path=/data/ --chain=westend --database=rocksdb --pruning=archive --prometheus-external --prometheus-port 9615 --unsafe-rpc-external --unsafe-ws-external --rpc-methods=safe --rpc-cors=all --ws-max-connections 5000 --telemetry-url wss://<redacted> --db-cache 512 --in-peers 25 --out-peers 25 -lrpc_metrics=debug --listen-addr=/ip4/0.0.0.0/tcp/30333
The text was updated successfully, but these errors were encountered: