Increased latency in eth calls #12281

stefan-stojanovic-s · 2024-10-11T16:29:00Z

System information

Erigon version: 2.60.8

OS & Version: Alpine Linux v3.20 in K8s

Erigon Command (with flags/config):

erigon
--chain=bor-mainnet
--datadir=/home/erigon/persistence/data
--http.addr=0.0.0.0
--rpc.accessList=/home/erigon/acl-config/acl.json
--rpc.batch.limit=10000
--rpc.txfeecap=100
--http.api=eth,erigon,web3,net,debug,txpool,trace
--http.vhosts=*
--http.corsdomain=null
--http
--ws
--db.pagesize=16384
--ethash.dagdir=/home/erigon/persistence/dag
--maxpeers=100
--private.api.addr=0.0.0.0:9090
--private.api.ratelimit=640000
--rpc.returndata.limit=500000
--metrics
--metrics.addr=0.0.0.0
--healthcheck
--port=30303
--db.size.limit=16TB
--torrent.download.rate=512mb
--torrent.download.slots=5
--bor.milestone=true
--staticpeers
--http.timeouts.read=300s
--state.cache=8192
--pprof
--pprof.addr="0.0.0.0"
--pprof.port=6061
--database.verbosity=4
--http.trace
--verbosity=4
--log.dir.verbosity=4
--bor.heimdall=<url>

Chain/Network: Ethereum / Polygon / BSC

Expected behaviour

No spikes in eth calls

Actual behaviour

For a while now we’ve been witnessing random latency spikes in calls on:

eth_getBlockByNumber
eth_getBlockReceipts
eth_blockNumber
eth_getStorageAt
eth_call
eth_getCode
eth_getBalance
eth_getTransactionReceipt

We are trying to see why do these calls last for up to 20 seconds.

eth_blockNumber usually returns in less then a second.

This happens on Ethereum , Polygon and BSC on Mainnet and it happens on archive mode runs.

The bellow graphs are from archive Erigon running Polygon for the last 28 days

eth_getBlockByNumber

eth_call

eth_blockNumber

Our initial guess was that this happens because of high traffic but that doesn’t seem to be the case.
We’ve had one peek in this range but thats it

We can confirm network and I/O are not bottlenecks in this case as we’ve throughly checked and other non-erigon nodes are operational.

Steps to reproduce the behaviour

To reproduce this, one can try to set up https://github.com/louislam/uptime-kuma and create a monitor, point it to one of the nodes with POST eth_blockNumber periodically. There should be random spikes happening on the graph from time to time.

Backtrace

Example for eth_blockNumber when spike happens in logs.
From our monitoring, the request was initiated at 14:15:27, but as you can see from the logs it answerd at 14:15:35

[WARN] [10-09|14:15:20.647] [rpc] served                             conn=10.42.12.108:54318 method=eth_getBlockByNumber reqid=safe t=72.98µs err="Unknown block"
[INFO] [10-09|14:15:20.647] Served                                   conn=10.42.12.108:54318 t=136.71µs method=eth_getBlockByNumber reqid=safe params="[\"safe\",false]"
[INFO] [10-09|14:15:20.647] Served                                   conn=10.42.12.108:54318 t=151.45µs method=eth_getBlockByNumber reqid=latest params="[\"latest\",false]"
[INFO] [10-09|14:15:20.647] Served                                   conn=10.42.12.108:54318 t=264.19µs method=eth_getBlockByNumber reqid=finalized params="[\"finalized\",false]"
[INFO] [10-09|14:15:20.841] Served                                   conn=10.42.12.247:56092 t=50.086419ms method=trace_replayBlockTransactions reqid=1728483320660 params="[\"0x3BEAD45\",[\"trace\"]]"
[INFO] [10-09|14:15:21.279] Served                                   conn=10.42.21.225:57404 t=128.86µs method=eth_getCode reqid=4 params="[\"0xd3082872F8B06073A021b4602e022d5A070d7cfC\",\"latest\"]"
[INFO] [10-09|14:15:26.952] Served                                   conn=[::1]:44863 t=48.18µs method=net_version reqid=67 params=[]
[INFO] [10-09|14:15:27.015] Served                                   conn=[::1]:43791 t=37.88µs method=net_version reqid=67 params=[]
[DBUG] [10-09|14:15:31.449] [bor.heimdall] Got new milestone         start=62827807 end=62827828
[INFO] [10-09|14:15:35.469] Served                                   conn=10.42.23.156:44640 t=5.41151045s method=eth_blockNumber reqid=1 params=[]
[INFO] [10-09|14:15:35.469] Served                                   conn=10.42.24.232:45902 t=9.714510998s method=eth_blockNumber reqid=1 params=[]
[INFO] [10-09|14:15:35.469] Served                                   conn=10.42.23.156:46988 t=13.782515824s method=eth_blockNumber reqid=1 params=[]
[INFO] [10-09|14:15:35.469] Served                                   conn=10.42.26.189:45034 t=4.040338563s method=eth_getBlockByNumber reqid=1 params="[\"latest\",false]"
[INFO] [10-09|14:15:35.469] Served                                   conn=10.42.27.86:41460 t=2.945821208s method=eth_blockNumber reqid=1 params=[]
[WARN] [10-09|14:15:35.469] [rpc] served                             conn=10.42.21.225:47324 method=trace_callMany reqid=1 t=6.805521594s err=stopped
[INFO] [10-09|14:15:35.469] Served                                   conn=10.42.27.86:41454 t=4.235557869s method=eth_blockNumber reqid=1 params=[]
[WARN] [10-09|14:15:35.469] [rpc] served                             conn=10.42.21.225:57392 method=trace_callMany reqid=1 t=479.222164ms err=stopped

Another occurance

Initiated at :27 answered at :34

[DBUG] [10-09|11:12:19.449] [bor.heimdall] Got new milestone         start=62822670 end=62822690
[WARN] [10-09|11:12:20.204] [rpc] served                             conn=10.42.21.225:60922 method=eth_getBlockByNumber reqid=safe t=68.21µs err="Unknown block"
[INFO] [10-09|11:12:20.204] Served                                   conn=10.42.21.225:60922 t=117.21µs method=eth_getBlockByNumber reqid=safe params="[\"safe\",false]"
[INFO] [10-09|11:12:20.205] Served                                   conn=10.42.21.225:60922 t=521.749µs method=eth_getBlockByNumber reqid=finalized params="[\"finalized\",false]"
[INFO] [10-09|11:12:20.205] Served                                   conn=10.42.21.225:60922 t=1.229418ms method=eth_getBlockByNumber reqid=latest params="[\"latest\",false]"
[WARN] [10-09|11:12:20.291] [rpc] served                             conn=10.42.12.108:54318 method=eth_getBlockByNumber reqid=safe t=48.77µs err="Unknown block"
[INFO] [10-09|11:12:20.291] Served                                   conn=10.42.12.108:54318 t=86.75µs method=eth_getBlockByNumber reqid=safe params="[\"safe\",false]"
[INFO] [10-09|11:12:20.291] Served                                   conn=10.42.12.108:54318 t=135.049µs method=eth_getBlockByNumber reqid=latest params="[\"latest\",false]"
[INFO] [10-09|11:12:20.292] Served                                   conn=10.42.12.108:54318 t=537.749µs method=eth_getBlockByNumber reqid=finalized params="[\"finalized\",false]"
[INFO] [10-09|11:12:21.791] Served                                   conn=10.42.21.225:54890 t=1.507088ms method=eth_getBlockByNumber reqid=2646 params="[\"0x3be993c\",true]"
[INFO] [10-09|11:12:26.914] Served                                   conn=[::1]:41687 t=41.68µs method=net_version reqid=67 params=[]
[INFO] [10-09|11:12:26.978] Served                                   conn=[::1]:37025 t=40.17µs method=net_version reqid=67 params=[]
[DBUG] [10-09|11:12:31.450] [bor.heimdall] Got new milestone         start=62822670 end=62822690
[WARN] [10-09|11:12:34.795] [rpc] served                             conn=10.42.21.225:60922 method=eth_getBlockByNumber reqid=safe t=9.58801108s err="Unknown block"
[INFO] [10-09|11:12:34.795] Served                                   conn=10.42.21.225:60922 t=9.58806791s method=eth_getBlockByNumber reqid=safe params="[\"safe\",false]"
[INFO] [10-09|11:12:34.795] Served                                   conn=10.42.24.237:40436 t=9.453629739s method=eth_blockNumber reqid=1 params=[]
[INFO] [10-09|11:12:34.795] Served                                   conn=10.42.24.232:56466 t=3.607566971s method=eth_blockNumber reqid=1 params=[]
[INFO] [10-09|11:12:34.796] Served                                   conn=[::1]:35809 t=7.813439057s method=eth_getBlockByNumber reqid=1 params="[\"latest\", false]"
[INFO] [10-09|11:12:34.795] Served                                   conn=10.42.21.225:54254 t=2.443219605s method=eth_blockNumber reqid=1 params=[]

The text was updated successfully, but these errors were encountered:

AskAlexSharov · 2024-10-12T07:31:27Z

try --db.read.concurrency=9000

stefan-stojanovic-s · 2024-10-12T21:22:07Z

we've increased it before to --db.read.concurency=16384 but removed the flag as there was no significant results from that
I will for reference do it on 9000 and bring the results on here in 1-2 days

AskAlexSharov · 2024-10-13T10:07:55Z

Then add —pprof:

go tool pprof -png http://127.0.0.1:6060/debug/pprof/profile/?seconds\=60 > cpu.png

go tool pprof -alloc_objects -png http://127.0.0.1:6060/debug/pprof/heap?seconds\=20 > mem.png

stefan-stojanovic-s · 2024-10-14T12:35:09Z

fyi still observing spikes with 9000 on db.read.concurrency

Here is pprof

CPU

MEM

AskAlexSharov · 2024-10-15T08:37:43Z

I see probably it's happening when new file got created - in this case RPCDaemon now does "ReOpen" files - and it's not really necessary. We also working on it in #12299 and heap-profiling a bit optimized here #11710
So, we will fix it Erigon3 first - then will decide if it's easy to backport to Erigon2 or not.

stefan-stojanovic-s · 2024-10-15T13:18:56Z

good to know its spotted
would you suggest using any workarounds in the meantime until this gets resolved ?
can we downgrade nodes before "ReOpen" was introduced ?

AskAlexSharov · 2024-10-16T00:53:42Z

I think it’s very old behavior. Do you see it after recent upgrade?

AskAlexSharov · 2024-10-16T00:56:24Z

@JkLondon hi.

Can you try to port your KnownCfg optimisation to Erigon2?

stefan-stojanovic-s · 2024-10-16T11:51:00Z

We did a node upgrade to 2.60.6 on 28th of August, so It doesn't look like its related to bumping up the version.

I can't say if its happening on versions before 2.60, but its happening for at least 3 months. It seems that it started getting a bit more spikey on second half of September. Not sure what these large spikes were tbh, but even if we ignore them it averages 7-10 seconds for eth_blockNumber call

However, I've noticed that in one of the Discord threads a guy was complaining in May that he received a lot of latency calls when he bumped from 2.58.2 to 2.60.0 and that he had to roll back to resolve hence my question for downgrading

AskAlexSharov · 2024-10-16T14:35:36Z

Thank you for finding it. Seems that support request was not solved. We will work on it.

JkLondon · 2024-10-17T12:27:45Z

@JkLondon hi.

Can you try to port your KnownCfg optimisation to Erigon2?

#12351 have done this optimisation

AskAlexSharov · 2024-10-17T14:26:55Z

I will work on removing need of files re-open (on e3 first then will port to e2): #12332

cherry-pick #12373 for #12281

avinashbo · 2024-10-24T07:42:46Z

@AskAlexSharov Can we expect a release with the fix anytime soon?

AskAlexSharov · 2024-10-26T03:09:30Z

2.60.9 release happened

stefan-stojanovic-s · 2024-10-28T07:59:43Z

will test if the issue persists and get back in here

stefan-stojanovic-s · 2024-11-05T12:41:39Z

Sorry guys, this takes longer then anticipated to test since the underlying base image also changed so we gotta incorporate that . Will have some info for this by EOW

AskAlexSharov · 2024-11-08T02:27:48Z

Thank you

stefan-stojanovic-s · 2024-11-20T18:12:50Z

Hey guys

We've basically tried to make 2.60.10 reduce the latency for across regions and here is what we found out and what we are battling against currently:

A set of our Ethereum nodes see the difference in ETH spikes getting reduced. These nodes have things in common => RPS is usually averaging under 100 RPS and their chaindata is on avg 2.3TB.
The other set of Ethereum nodes don't see any difference in spikes for eth_ calls. What these nodes have in common is RPS on average is higher then 100 RPS and their chaindata is 5.3TB +
None of the Polygon nodes see the difference in latency be it a higher RPS or lower RPS it stays the same.

This points honestly to blockchain size to be connected to this as I see Polygon also has big ledgers on our end ~ 9.9TB.

We were thinking of resyncing the nodes to get the latency up but honestly its not a small amount of work and want to see if that would be the correct step forward

Would appreciate ideas or suggestions

...and sorry for the hold up

AskAlexSharov · 2024-11-26T06:47:40Z

you can copy 2.3TB chaindata to nodes which have 5.3TB chaindata. it's easier than resync.
I see you have --rpc.batch.limit=10000. Maybe you can try increase/decrease --rpc.batch.concurrency?
are your metrics - produced by erigon? or you measure them outside of erigon?
you can try: run separated rpcdaemon (on same datadir): https://github.com/erigontech/erigon/tree/main/cmd/rpcdaemon#getting-started - it will reduce impact of other components (txpool, chain-tip sync, etc...) to rpc (it will have own limits, etc...).

stefan-stojanovic-s · 2024-11-28T14:04:01Z

thanks for suggestions and ideas Alex

Dedicating time to the first point currently. Got no Polygon nodes with "normal" ledger sizes all increased up to 9TBs of data. For Ethereum, I'll copy over deff

Metrics are produced by a proxy service. They just messure the time it took to get the answer from Erigon. Proxies are deployed in the same network as Erigons so the network latency is minimal

For rpc-batch-concurency:
Got a metric that lets me filter by batch requests eg. check if the request was in a batch or not. The response time is the same even if the request is not in batch

For rpcdaemon :
will take a look

stefan-stojanovic-s · 2024-12-10T23:42:29Z

We've noticed improvments when replacing Ethereum ledgers, still not at its best comparable to geth, but its a lot better

wondering what causes these bloated ledgers ? is it debug calls ? Does this mean that each time we get a ledger that is bigger then average size, we have to download a new snapshot version or replace with non-bloated ones to improve latency ?

AskAlexSharov · 2024-12-17T03:53:48Z

may be 2 reasons:

long read transacion (debug_call - is not enough long) - doesn't allow to re-use free space. for example tx-leak: Txpool: db unlimited grow fix #8596
commits without fsync: we also faced this in TxPool (can't find PR now) - but not in chaindata.

To catch slow RPC calls: --rpc.slow=10m
To catch slow/leaked Ro Transactions: SLOW_TX=10m

download a new snapshot - only way to compact db: copy data to new db. for example by: integration mdbx_to_mdbx

Also can:

--sync.loop.prune.limit=10

Also can try Erigon3 - chaindata size there is 20gb. https://github.com/erigontech/erigon?tab=readme-ov-file#erigon3-changes-from-erigon2

stefan-stojanovic-s · 2024-12-27T08:21:09Z

thanks for insights, will get back to the team with this and discuss

small update - fresh Polygon DBs are not helping reduce the latency

stefan-stojanovic-s · 2025-01-08T08:52:56Z

hey @AskAlexSharov a quick question => is Erigon3 production ready ?

AskAlexSharov · 2025-01-08T09:41:22Z

Erigon3 is more stable for Polygon chains than Erigon2. Because it's way-easier for dev team to have/debug/fix problems of BorMainnet on Erigon3 (much easier to get to chain-tip or re-sync).
It's higher chance to get performance issues support on Erigon3.
It supports all RPC methods. And P2P too.
It may greatly increase/reduce latency of Receipt/Log-related RPC methods - depends on your use-case.
Several NodeProviders confirmed that Erigon3 works for them.
We also have several validators on Erigon3.
We are on the way of beta release - for us it means "releases will become more backward compatible. now some releases may advise/require re-sync". But re-sync is fast if you have enough network bandwidth.
We enabled --internalcl by default. But we had work on externalcl support too - but maybe have some issue: POS header downloader of externalcl is broken #13321

Differences: https://github.com/erigontech/erigon?tab=readme-ov-file#erigon3-changes-from-erigon2
Book: https://docs.erigon.tech/

tornadocontrib · 2025-01-08T16:11:16Z

@AskAlexSharov #13320 this one should be also addressed and would be an easy fix, is there any progress to fix the issue?

yperbasis · 2025-01-16T15:09:05Z

Let's make sure that the latency is good at least in E3.

AskAlexSharov added the imp1 High importance label Oct 16, 2024

AskAlexSharov mentioned this issue Oct 22, 2024

e2_60: rpcd don't open files before download complete #12411

Merged

taratorio pushed a commit that referenced this issue Oct 22, 2024

e2_60: rpcd no open before download (#12411)

7512e16

cherry-pick #12373 for #12281

VBulikov assigned AskAlexSharov Oct 25, 2024

VBulikov added this to the 2.60.9-fixes milestone Oct 25, 2024

VBulikov modified the milestones: 2.60.9-fixes, 2.60.10-fixes Oct 30, 2024

VBulikov modified the milestones: 2.60.10-fixes, 2.60.11-fixes Nov 6, 2024

yperbasis modified the milestones: 2.61.0-fixes, 2.61.1-fixes Dec 17, 2024

yperbasis added performance erigon3 labels Jan 16, 2025

yperbasis modified the milestones: 2.61.1-fixes, 3.0.0-beta2 Jan 16, 2025

AskAlexSharov modified the milestones: 3.0.0-beta2, 3.0.0 Feb 3, 2025

yperbasis modified the milestones: 3.0.0, 3.1.0 Feb 12, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Increased latency in eth calls #12281

Increased latency in eth calls #12281

stefan-stojanovic-s commented Oct 11, 2024

AskAlexSharov commented Oct 12, 2024

stefan-stojanovic-s commented Oct 12, 2024

AskAlexSharov commented Oct 13, 2024

stefan-stojanovic-s commented Oct 14, 2024

AskAlexSharov commented Oct 15, 2024 •

edited

Loading

stefan-stojanovic-s commented Oct 15, 2024

AskAlexSharov commented Oct 16, 2024

AskAlexSharov commented Oct 16, 2024

stefan-stojanovic-s commented Oct 16, 2024

AskAlexSharov commented Oct 16, 2024

JkLondon commented Oct 17, 2024

AskAlexSharov commented Oct 17, 2024

avinashbo commented Oct 24, 2024

AskAlexSharov commented Oct 26, 2024

stefan-stojanovic-s commented Oct 28, 2024

stefan-stojanovic-s commented Nov 5, 2024

AskAlexSharov commented Nov 8, 2024

stefan-stojanovic-s commented Nov 20, 2024

AskAlexSharov commented Nov 26, 2024

stefan-stojanovic-s commented Nov 28, 2024

stefan-stojanovic-s commented Dec 10, 2024

AskAlexSharov commented Dec 17, 2024

stefan-stojanovic-s commented Dec 27, 2024

stefan-stojanovic-s commented Jan 8, 2025

AskAlexSharov commented Jan 8, 2025

tornadocontrib commented Jan 8, 2025

yperbasis commented Jan 16, 2025

Increased latency in eth calls #12281

Increased latency in eth calls #12281

Comments

stefan-stojanovic-s commented Oct 11, 2024

System information

Expected behaviour

Actual behaviour

Steps to reproduce the behaviour

Backtrace

AskAlexSharov commented Oct 12, 2024

stefan-stojanovic-s commented Oct 12, 2024

AskAlexSharov commented Oct 13, 2024

stefan-stojanovic-s commented Oct 14, 2024

AskAlexSharov commented Oct 15, 2024 • edited Loading

stefan-stojanovic-s commented Oct 15, 2024

AskAlexSharov commented Oct 16, 2024

AskAlexSharov commented Oct 16, 2024

stefan-stojanovic-s commented Oct 16, 2024

AskAlexSharov commented Oct 16, 2024

JkLondon commented Oct 17, 2024

AskAlexSharov commented Oct 17, 2024

avinashbo commented Oct 24, 2024

AskAlexSharov commented Oct 26, 2024

stefan-stojanovic-s commented Oct 28, 2024

stefan-stojanovic-s commented Nov 5, 2024

AskAlexSharov commented Nov 8, 2024

stefan-stojanovic-s commented Nov 20, 2024

AskAlexSharov commented Nov 26, 2024

stefan-stojanovic-s commented Nov 28, 2024

stefan-stojanovic-s commented Dec 10, 2024

AskAlexSharov commented Dec 17, 2024

stefan-stojanovic-s commented Dec 27, 2024

stefan-stojanovic-s commented Jan 8, 2025

AskAlexSharov commented Jan 8, 2025

tornadocontrib commented Jan 8, 2025

yperbasis commented Jan 16, 2025

AskAlexSharov commented Oct 15, 2024 •

edited

Loading