Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increased latency in eth calls #12281

Open
stefan-stojanovic-s opened this issue Oct 11, 2024 · 27 comments
Open

Increased latency in eth calls #12281

stefan-stojanovic-s opened this issue Oct 11, 2024 · 27 comments
Assignees
Labels
Milestone

Comments

@stefan-stojanovic-s
Copy link

System information

Erigon version: 2.60.8

OS & Version: Alpine Linux v3.20 in K8s

Erigon Command (with flags/config):

erigon
--chain=bor-mainnet
--datadir=/home/erigon/persistence/data
--http.addr=0.0.0.0
--rpc.accessList=/home/erigon/acl-config/acl.json
--rpc.batch.limit=10000
--rpc.txfeecap=100
--http.api=eth,erigon,web3,net,debug,txpool,trace
--http.vhosts=*
--http.corsdomain=null
--http
--ws
--db.pagesize=16384
--ethash.dagdir=/home/erigon/persistence/dag
--maxpeers=100
--private.api.addr=0.0.0.0:9090
--private.api.ratelimit=640000
--rpc.returndata.limit=500000
--metrics
--metrics.addr=0.0.0.0
--healthcheck
--port=30303
--db.size.limit=16TB
--torrent.download.rate=512mb
--torrent.download.slots=5
--bor.milestone=true
--staticpeers
--http.timeouts.read=300s
--state.cache=8192
--pprof
--pprof.addr="0.0.0.0"
--pprof.port=6061
--database.verbosity=4
--http.trace
--verbosity=4
--log.dir.verbosity=4
--bor.heimdall=<url>

Chain/Network: Ethereum / Polygon / BSC

Expected behaviour

No spikes in eth calls

Actual behaviour

For a while now we’ve been witnessing random latency spikes in calls on:

  • eth_getBlockByNumber
  • eth_getBlockReceipts
  • eth_blockNumber
  • eth_getStorageAt
  • eth_call
  • eth_getCode
  • eth_getBalance
  • eth_getTransactionReceipt

We are trying to see why do these calls last for up to 20 seconds.

eth_blockNumber usually returns in less then a second.

This happens on Ethereum , Polygon and BSC on Mainnet and it happens on archive mode runs.

The bellow graphs are from archive Erigon running Polygon for the last 28 days

eth_getBlockByNumber

eth_getBlockByNumber

eth_call

eth_call

eth_blockNumber

eth_blockNumber

Our initial guess was that this happens because of high traffic but that doesn’t seem to be the case.
We’ve had one peek in this range but thats it

traffic

We can confirm network and I/O are not bottlenecks in this case as we’ve throughly checked and other non-erigon nodes are operational.

Steps to reproduce the behaviour

To reproduce this, one can try to set up https://github.com/louislam/uptime-kuma and create a monitor, point it to one of the nodes with POST eth_blockNumber periodically. There should be random spikes happening on the graph from time to time.

Backtrace

Example for eth_blockNumber when spike happens in logs.
From our monitoring, the request was initiated at 14:15:27, but as you can see from the logs it answerd at 14:15:35

[WARN] [10-09|14:15:20.647] [rpc] served                             conn=10.42.12.108:54318 method=eth_getBlockByNumber reqid=safe t=72.98µs err="Unknown block"
[INFO] [10-09|14:15:20.647] Served                                   conn=10.42.12.108:54318 t=136.71µs method=eth_getBlockByNumber reqid=safe params="[\"safe\",false]"
[INFO] [10-09|14:15:20.647] Served                                   conn=10.42.12.108:54318 t=151.45µs method=eth_getBlockByNumber reqid=latest params="[\"latest\",false]"
[INFO] [10-09|14:15:20.647] Served                                   conn=10.42.12.108:54318 t=264.19µs method=eth_getBlockByNumber reqid=finalized params="[\"finalized\",false]"
[INFO] [10-09|14:15:20.841] Served                                   conn=10.42.12.247:56092 t=50.086419ms method=trace_replayBlockTransactions reqid=1728483320660 params="[\"0x3BEAD45\",[\"trace\"]]"
[INFO] [10-09|14:15:21.279] Served                                   conn=10.42.21.225:57404 t=128.86µs method=eth_getCode reqid=4 params="[\"0xd3082872F8B06073A021b4602e022d5A070d7cfC\",\"latest\"]"
[INFO] [10-09|14:15:26.952] Served                                   conn=[::1]:44863 t=48.18µs method=net_version reqid=67 params=[]
[INFO] [10-09|14:15:27.015] Served                                   conn=[::1]:43791 t=37.88µs method=net_version reqid=67 params=[]
[DBUG] [10-09|14:15:31.449] [bor.heimdall] Got new milestone         start=62827807 end=62827828
[INFO] [10-09|14:15:35.469] Served                                   conn=10.42.23.156:44640 t=5.41151045s method=eth_blockNumber reqid=1 params=[]
[INFO] [10-09|14:15:35.469] Served                                   conn=10.42.24.232:45902 t=9.714510998s method=eth_blockNumber reqid=1 params=[]
[INFO] [10-09|14:15:35.469] Served                                   conn=10.42.23.156:46988 t=13.782515824s method=eth_blockNumber reqid=1 params=[]
[INFO] [10-09|14:15:35.469] Served                                   conn=10.42.26.189:45034 t=4.040338563s method=eth_getBlockByNumber reqid=1 params="[\"latest\",false]"
[INFO] [10-09|14:15:35.469] Served                                   conn=10.42.27.86:41460 t=2.945821208s method=eth_blockNumber reqid=1 params=[]
[WARN] [10-09|14:15:35.469] [rpc] served                             conn=10.42.21.225:47324 method=trace_callMany reqid=1 t=6.805521594s err=stopped
[INFO] [10-09|14:15:35.469] Served                                   conn=10.42.27.86:41454 t=4.235557869s method=eth_blockNumber reqid=1 params=[]
[WARN] [10-09|14:15:35.469] [rpc] served                             conn=10.42.21.225:57392 method=trace_callMany reqid=1 t=479.222164ms err=stopped

Another occurance

Initiated at :27 answered at :34

[DBUG] [10-09|11:12:19.449] [bor.heimdall] Got new milestone         start=62822670 end=62822690
[WARN] [10-09|11:12:20.204] [rpc] served                             conn=10.42.21.225:60922 method=eth_getBlockByNumber reqid=safe t=68.21µs err="Unknown block"
[INFO] [10-09|11:12:20.204] Served                                   conn=10.42.21.225:60922 t=117.21µs method=eth_getBlockByNumber reqid=safe params="[\"safe\",false]"
[INFO] [10-09|11:12:20.205] Served                                   conn=10.42.21.225:60922 t=521.749µs method=eth_getBlockByNumber reqid=finalized params="[\"finalized\",false]"
[INFO] [10-09|11:12:20.205] Served                                   conn=10.42.21.225:60922 t=1.229418ms method=eth_getBlockByNumber reqid=latest params="[\"latest\",false]"
[WARN] [10-09|11:12:20.291] [rpc] served                             conn=10.42.12.108:54318 method=eth_getBlockByNumber reqid=safe t=48.77µs err="Unknown block"
[INFO] [10-09|11:12:20.291] Served                                   conn=10.42.12.108:54318 t=86.75µs method=eth_getBlockByNumber reqid=safe params="[\"safe\",false]"
[INFO] [10-09|11:12:20.291] Served                                   conn=10.42.12.108:54318 t=135.049µs method=eth_getBlockByNumber reqid=latest params="[\"latest\",false]"
[INFO] [10-09|11:12:20.292] Served                                   conn=10.42.12.108:54318 t=537.749µs method=eth_getBlockByNumber reqid=finalized params="[\"finalized\",false]"
[INFO] [10-09|11:12:21.791] Served                                   conn=10.42.21.225:54890 t=1.507088ms method=eth_getBlockByNumber reqid=2646 params="[\"0x3be993c\",true]"
[INFO] [10-09|11:12:26.914] Served                                   conn=[::1]:41687 t=41.68µs method=net_version reqid=67 params=[]
[INFO] [10-09|11:12:26.978] Served                                   conn=[::1]:37025 t=40.17µs method=net_version reqid=67 params=[]
[DBUG] [10-09|11:12:31.450] [bor.heimdall] Got new milestone         start=62822670 end=62822690
[WARN] [10-09|11:12:34.795] [rpc] served                             conn=10.42.21.225:60922 method=eth_getBlockByNumber reqid=safe t=9.58801108s err="Unknown block"
[INFO] [10-09|11:12:34.795] Served                                   conn=10.42.21.225:60922 t=9.58806791s method=eth_getBlockByNumber reqid=safe params="[\"safe\",false]"
[INFO] [10-09|11:12:34.795] Served                                   conn=10.42.24.237:40436 t=9.453629739s method=eth_blockNumber reqid=1 params=[]
[INFO] [10-09|11:12:34.795] Served                                   conn=10.42.24.232:56466 t=3.607566971s method=eth_blockNumber reqid=1 params=[]
[INFO] [10-09|11:12:34.796] Served                                   conn=[::1]:35809 t=7.813439057s method=eth_getBlockByNumber reqid=1 params="[\"latest\", false]"
[INFO] [10-09|11:12:34.795] Served                                   conn=10.42.21.225:54254 t=2.443219605s method=eth_blockNumber reqid=1 params=[]
@AskAlexSharov
Copy link
Collaborator

try --db.read.concurrency=9000

@stefan-stojanovic-s
Copy link
Author

we've increased it before to --db.read.concurency=16384 but removed the flag as there was no significant results from that
I will for reference do it on 9000 and bring the results on here in 1-2 days

@AskAlexSharov
Copy link
Collaborator

Then add —pprof:

go tool pprof -png http://127.0.0.1:6060/debug/pprof/profile/?seconds\=60 > cpu.png

go tool pprof -alloc_objects -png http://127.0.0.1:6060/debug/pprof/heap?seconds\=20 > mem.png

@stefan-stojanovic-s
Copy link
Author

fyi still observing spikes with 9000 on db.read.concurrency

Here is pprof

CPU
cpu

MEM
mem

@AskAlexSharov
Copy link
Collaborator

AskAlexSharov commented Oct 15, 2024

I see probably it's happening when new file got created - in this case RPCDaemon now does "ReOpen" files - and it's not really necessary. We also working on it in #12299 and heap-profiling a bit optimized here #11710
So, we will fix it Erigon3 first - then will decide if it's easy to backport to Erigon2 or not.

@stefan-stojanovic-s
Copy link
Author

good to know its spotted
would you suggest using any workarounds in the meantime until this gets resolved ?
can we downgrade nodes before "ReOpen" was introduced ?

@AskAlexSharov
Copy link
Collaborator

I think it’s very old behavior. Do you see it after recent upgrade?

@AskAlexSharov
Copy link
Collaborator

@JkLondon hi.

  • Can you try to port your KnownCfg optimisation to Erigon2?

@stefan-stojanovic-s
Copy link
Author

We did a node upgrade to 2.60.6 on 28th of August, so It doesn't look like its related to bumping up the version.

I can't say if its happening on versions before 2.60, but its happening for at least 3 months. It seems that it started getting a bit more spikey on second half of September. Not sure what these large spikes were tbh, but even if we ignore them it averages 7-10 seconds for eth_blockNumber call

image

However, I've noticed that in one of the Discord threads a guy was complaining in May that he received a lot of latency calls when he bumped from 2.58.2 to 2.60.0 and that he had to roll back to resolve hence my question for downgrading

image
image

@AskAlexSharov
Copy link
Collaborator

Thank you for finding it. Seems that support request was not solved. We will work on it.

@AskAlexSharov AskAlexSharov added the imp1 High importance label Oct 16, 2024
@JkLondon
Copy link
Member

@JkLondon hi.

  • Can you try to port your KnownCfg optimisation to Erigon2?

#12351 have done this optimisation

@AskAlexSharov
Copy link
Collaborator

I will work on removing need of files re-open (on e3 first then will port to e2): #12332

@avinashbo
Copy link

@AskAlexSharov Can we expect a release with the fix anytime soon?

@VBulikov VBulikov added this to the 2.60.9-fixes milestone Oct 25, 2024
@AskAlexSharov
Copy link
Collaborator

2.60.9 release happened

@stefan-stojanovic-s
Copy link
Author

will test if the issue persists and get back in here

@VBulikov VBulikov modified the milestones: 2.60.9-fixes, 2.60.10-fixes Oct 30, 2024
@stefan-stojanovic-s
Copy link
Author

Sorry guys, this takes longer then anticipated to test since the underlying base image also changed so we gotta incorporate that . Will have some info for this by EOW

@VBulikov VBulikov modified the milestones: 2.60.10-fixes, 2.60.11-fixes Nov 6, 2024
@AskAlexSharov
Copy link
Collaborator

Thank you

@stefan-stojanovic-s
Copy link
Author

Hey guys

We've basically tried to make 2.60.10 reduce the latency for across regions and here is what we found out and what we are battling against currently:

  • A set of our Ethereum nodes see the difference in ETH spikes getting reduced. These nodes have things in common => RPS is usually averaging under 100 RPS and their chaindata is on avg 2.3TB.

  • The other set of Ethereum nodes don't see any difference in spikes for eth_ calls. What these nodes have in common is RPS on average is higher then 100 RPS and their chaindata is 5.3TB +

  • None of the Polygon nodes see the difference in latency be it a higher RPS or lower RPS it stays the same.

This points honestly to blockchain size to be connected to this as I see Polygon also has big ledgers on our end ~ 9.9TB.

We were thinking of resyncing the nodes to get the latency up but honestly its not a small amount of work and want to see if that would be the correct step forward

Would appreciate ideas or suggestions

...and sorry for the hold up

@AskAlexSharov
Copy link
Collaborator

  1. you can copy 2.3TB chaindata to nodes which have 5.3TB chaindata. it's easier than resync.
  2. I see you have --rpc.batch.limit=10000. Maybe you can try increase/decrease --rpc.batch.concurrency?
  3. are your metrics - produced by erigon? or you measure them outside of erigon?
  4. you can try: run separated rpcdaemon (on same datadir): https://github.com/erigontech/erigon/tree/main/cmd/rpcdaemon#getting-started - it will reduce impact of other components (txpool, chain-tip sync, etc...) to rpc (it will have own limits, etc...).

@stefan-stojanovic-s
Copy link
Author

thanks for suggestions and ideas Alex

Dedicating time to the first point currently. Got no Polygon nodes with "normal" ledger sizes all increased up to 9TBs of data. For Ethereum, I'll copy over deff

Metrics are produced by a proxy service. They just messure the time it took to get the answer from Erigon. Proxies are deployed in the same network as Erigons so the network latency is minimal

For rpc-batch-concurency:
Got a metric that lets me filter by batch requests eg. check if the request was in a batch or not. The response time is the same even if the request is not in batch

For rpcdaemon :
will take a look

@stefan-stojanovic-s
Copy link
Author

We've noticed improvments when replacing Ethereum ledgers, still not at its best comparable to geth, but its a lot better
image

wondering what causes these bloated ledgers ? is it debug calls ? Does this mean that each time we get a ledger that is bigger then average size, we have to download a new snapshot version or replace with non-bloated ones to improve latency ?

@AskAlexSharov
Copy link
Collaborator

may be 2 reasons:

  1. long read transacion (debug_call - is not enough long) - doesn't allow to re-use free space. for example tx-leak: Txpool: db unlimited grow fix #8596
  2. commits without fsync: we also faced this in TxPool (can't find PR now) - but not in chaindata.

To catch slow RPC calls: --rpc.slow=10m
To catch slow/leaked Ro Transactions: SLOW_TX=10m

download a new snapshot - only way to compact db: copy data to new db. for example by: integration mdbx_to_mdbx

Also can:

  • --sync.loop.prune.limit=10

Also can try Erigon3 - chaindata size there is 20gb. https://github.com/erigontech/erigon?tab=readme-ov-file#erigon3-changes-from-erigon2

@yperbasis yperbasis modified the milestones: 2.61.0-fixes, 2.61.1-fixes Dec 17, 2024
@stefan-stojanovic-s
Copy link
Author

thanks for insights, will get back to the team with this and discuss

small update - fresh Polygon DBs are not helping reduce the latency

@stefan-stojanovic-s
Copy link
Author

hey @AskAlexSharov a quick question => is Erigon3 production ready ?

@AskAlexSharov
Copy link
Collaborator

  • Erigon3 is more stable for Polygon chains than Erigon2. Because it's way-easier for dev team to have/debug/fix problems of BorMainnet on Erigon3 (much easier to get to chain-tip or re-sync).
  • It's higher chance to get performance issues support on Erigon3.
  • It supports all RPC methods. And P2P too.
  • It may greatly increase/reduce latency of Receipt/Log-related RPC methods - depends on your use-case.
  • Several NodeProviders confirmed that Erigon3 works for them.
  • We also have several validators on Erigon3.
  • We are on the way of beta release - for us it means "releases will become more backward compatible. now some releases may advise/require re-sync". But re-sync is fast if you have enough network bandwidth.
  • We enabled --internalcl by default. But we had work on externalcl support too - but maybe have some issue: POS header downloader of externalcl is broken #13321

Differences: https://github.com/erigontech/erigon?tab=readme-ov-file#erigon3-changes-from-erigon2
Book: https://docs.erigon.tech/

@tornadocontrib
Copy link
Contributor

@AskAlexSharov #13320 this one should be also addressed and would be an easy fix, is there any progress to fix the issue?

@yperbasis
Copy link
Member

Let's make sure that the latency is good at least in E3.

@AskAlexSharov AskAlexSharov modified the milestones: 3.0.0-beta2, 3.0.0 Feb 3, 2025
@yperbasis yperbasis modified the milestones: 3.0.0, 3.1.0 Feb 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants