Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: statesync is failing without any notable errors #23740

Closed
1 task done
beer-1 opened this issue Feb 22, 2025 · 7 comments
Closed
1 task done

[Bug]: statesync is failing without any notable errors #23740

beer-1 opened this issue Feb 22, 2025 · 7 comments
Labels

Comments

@beer-1
Copy link
Contributor

beer-1 commented Feb 22, 2025

Is there an existing issue for this?

  • I have searched the existing issues

What happened?

I'm running quite empty chain on eks, and trying to initiate the new rpc nodes using statesync.
but sometimes (in a high posibility) it is failing to sync without errors even the statesync was successful.

Interest thing is same statesync is successful in other replica.
In failed node, when I try the below commands multiple times (sometimes failing), it makes syncing works again.
In succeed replica, when I try the below commands, it makes the succeed replica to be broken with same reason.

mv <apphome>/data/snapshots <apphome>/
<appd> comet unsafe-reset-all --home <apphome>
mv <apphome>/snapshots <apphome>/data/
<appd> snapshots restore 10000 3 --home <apphome>
<appd> comet bootstrap-state --home <apphome>
<appd> start --home <apphome>

It means installed snapshots are not problem, but something related applying snapshot problem.

12:31PM DBG Received chunk, adding to sync chunk=0 format=3 height=10000 module=statesync peer=9b421d23dcbf1d99d6633533c3d2ca91dde61bf2
12:31PM DBG Added chunk to queue chunk=0 format=3 height=10000 module=statesync
12:31PM DBG restoring snapshot module=server store=acc
12:31PM DBG restoring snapshot module=server store=auction
12:31PM DBG restoring snapshot module=server store=authz
12:31PM DBG restoring snapshot module=server store=bank
12:31PM DBG restoring snapshot module=server store=capability
12:31PM DBG restoring snapshot module=server store=consensus
12:31PM DBG restoring snapshot module=server store=crisis
12:31PM DBG restoring snapshot module=server store=feegrant
12:31PM DBG restoring snapshot module=server store=feeibc
12:31PM DBG restoring snapshot module=server store=forwarding
12:31PM DBG restoring snapshot module=server store=group
12:31PM DBG restoring snapshot module=server store=hooks-for-ibc
12:31PM DBG restoring snapshot module=server store=ibc
12:31PM DBG restoring snapshot module=server store=icacontroller
12:31PM DBG restoring snapshot module=server store=icahost
12:31PM DBG restoring snapshot module=server store=intertx
12:31PM DBG restoring snapshot module=server store=marketmap
12:31PM DBG restoring snapshot module=server store=move
12:31PM DBG restoring snapshot module=server store=nonfungibletokentransfer
12:31PM DBG restoring snapshot module=server store=opchild
12:31PM DBG restoring snapshot module=server store=oracle
12:31PM DBG restoring snapshot module=server store=packetfowardmiddleware
12:31PM DBG restoring snapshot module=server store=ratelimit
12:31PM DBG restoring snapshot module=server store=transfer
12:31PM DBG restoring snapshot module=server store=upgrade
12:31PM DBG flushing metadata height=10000 module=server
12:31PM DBG flushing metadata finished height=10000 module=server
12:31PM DBG loadVersion module=server ver=10000
12:31PM DBG loadVersion commitID hash=762ae8c1057afb65240c284e0bc4cea621c16deaa7e00219db63cf572855f2db key="KVStoreKey{0xc000e9fa30, opchild}" module=server ver=10000
12:31PM DBG Finished loading IAVL tree module=server
12:31PM DBG loadVersion commitID hash=d9ab5e9706a026afbc675184d13a7dc8771e2eee845c4fdda949874d04ddaa0e key="KVStoreKey{0xc000e9f7d0, upgrade}" module=server ver=10000
12:31PM DBG Finished loading IAVL tree module=server
12:31PM DBG loadVersion commitID hash=c267f3f2877a9d8af8c361fc205a472f1092150c6f0960badd14f8fdb77311e4 key="KVStoreKey{0xc000e9f810, authz}" module=server ver=10000
12:31PM DBG Finished loading IAVL tree module=server
12:31PM DBG loadVersion commitID hash=fe806d9534ebb35cf1f312d14a9ec8a7230c3f467eebade6931d31fe05ded857 key="KVStoreKey{0xc000e9fae0, marketmap}" module=server ver=10000
12:31PM DBG Finished loading IAVL tree module=server
12:31PM DBG loadVersion commitID hash=3736e8b69c8bca1ad696ad886c7d25f62c04d2bc4d561b3aa5629d71a1890d44 key="KVStoreKey{0xc000e9f7f0, nonfungibletokentransfer}" module=server ver=10000
12:31PM DBG Finished loading IAVL tree module=server
12:31PM DBG loadVersion commitID hash=03ca229e72c26eacf7faa2d05cd7dd480779422aaf7128366ca8804f9216cb58 key="KVStoreKey{0xc000e9f9f0, move}" module=server ver=10000
12:31PM DBG Finished loading IAVL tree module=server
12:31PM DBG loadVersion commitID hash=0acb35b08c0f45c310bc37bb744c21424a6c861626501ba74ef78f632de3131e key="KVStoreKey{0xc000e9fa70, oracle}" module=server ver=10000
12:31PM DBG Finished loading IAVL tree module=server
12:31PM DBG loadVersion commitID hash=cc15b6a536190f329553ad05ee9c89992facb8233eeffef7c799f3a155ff6c89 key="KVStoreKey{0xc000e9f7a0, consensus}" module=server ver=10000
12:31PM DBG Finished loading IAVL tree module=server
12:31PM DBG loadVersion commitID hash=1e692d53c9ac749af7f33723b86aa78791cb0d735c1a2518129b0ccfb66afbdf key="KVStoreKey{0xc000e9fa80, hooks-for-ibc}" module=server ver=10000
12:31PM DBG Finished loading IAVL tree module=server
12:31PM DBG loadVersion commitID hash=e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855 key="KVStoreKey{0xc000e9f9a0, intertx}" module=server ver=10000
12:31PM DBG Finished loading IAVL tree module=server
12:31PM DBG loadVersion commitID hash=e6399686154575bfc1838d859d1dd80d68be24b6bf5d5e5d03b1a98d4f58b3de key="KVStoreKey{0xc000e9f840, icacontroller}" module=server ver=10000
12:31PM DBG Finished loading IAVL tree module=server
12:31PM DBG loadVersion commitID hash=495abe922a8a4fcf67cbf5b439f9d6fe1f05d7a21bc89cfec10942997f2b1f96 key="KVStoreKey{0xc000e9f9b0, feeibc}" module=server ver=10000
12:31PM DBG Finished loading IAVL tree module=server
12:31PM DBG loadVersion commitID hash=27aca110413ed399d36444cc85b266e7b412fe02190b4e3555c6b19776d31f6f key="KVStoreKey{0xc000e9f830, icahost}" module=server ver=10000
12:31PM DBG Finished loading IAVL tree module=server
12:31PM DBG loadVersion commitID hash=38eb3cafc1b78ea8ea0179067f2a52c69bcb8c0520bbc587fbf809891966d680 key="KVStoreKey{0xc000e9f7b0, crisis}" module=server ver=10000
12:31PM DBG Finished loading IAVL tree module=server
12:31PM DBG loadVersion commitID hash=358685dca2c45eb4cb1fed55299dc50803961b9c98bcd77cea3b5a53b26e0b5e key="KVStoreKey{0xc000e9f7c0, ibc}" module=server ver=10000
12:31PM DBG Finished loading IAVL tree module=server
12:31PM DBG loadVersion commitID hash=1084b878a864df017791545fd35c3a5eb9a61eab68acf01da679b6c223c34938 key="KVStoreKey{0xc000e9fa40, auction}" module=server ver=10000
12:31PM DBG Finished loading IAVL tree module=server
12:31PM DBG loadVersion commitID hash=a89113c07ae262e84fa49e6d9111bf7c638679604357fa388526a976cfd021e0 key="KVStoreKey{0xc000e9f790, group}" module=server ver=10000
12:31PM DBG Finished loading IAVL tree module=server
12:31PM DBG loadVersion commitID hash= key="TransientStoreKey{0xc000e9fb00, transient_forwarding}" module=server ver=10000
12:31PM DBG loadVersion commitID hash=90fdae9ef2124b88045deea846624eed6bc9e5e5422fada847b882fb66d81f17 key="KVStoreKey{0xc000e9f7e0, transfer}" module=server ver=10000
12:31PM DBG Finished loading IAVL tree module=server
12:31PM DBG loadVersion commitID hash=b5d6c121d07f8eb85084a00f5966a308aa770a75e164c15bae057ae881c1add5 key="KVStoreKey{0xc000e9f780, bank}" module=server ver=10000
12:31PM DBG Finished loading IAVL tree module=server
12:31PM DBG loadVersion commitID hash=efa7c53f1fcd1e787bf9d36e1a8ac4d668f9c71e514a4182cae006426f6d241b key="KVStoreKey{0xc000e9f820, feegrant}" module=server ver=10000
12:31PM DBG Finished loading IAVL tree module=server
12:31PM DBG loadVersion commitID hash=5920ca6558ccae0cc6b7f1c13c367f1fbef2942422f3bbdabe755f91e0e28b64 key="KVStoreKey{0xc000e9fac0, forwarding}" module=server ver=10000
12:31PM DBG Finished loading IAVL tree module=server
12:31PM DBG loadVersion commitID hash= key="MemoryStoreKey{0xc000e9fb30, memory:capability}" module=server ver=10000
12:31PM DBG loadVersion commitID hash=28d956f6fe740f24c1b23a9e7ae9200ae73c3b9e0e426be3d6812e96dd35f91b key="KVStoreKey{0xc000e9f800, capability}" module=server ver=10000
12:31PM DBG Finished loading IAVL tree module=server
12:31PM DBG loadVersion commitID hash=e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855 key="KVStoreKey{0xc000e9fa60, ratelimit}" module=server ver=10000
12:31PM DBG Finished loading IAVL tree module=server
12:31PM DBG loadVersion commitID hash=60acf91f99ee84940cecbd0a391eecf809dbc2536689159bc8af13c69abf35ff key="KVStoreKey{0xc000e9f770, acc}" module=server ver=10000
12:31PM DBG Finished loading IAVL tree module=server
12:31PM DBG loadVersion commitID hash=a4802765c29252d63291835953abd4cf8eb0ef060a08fe015fed79845d18d522 key="KVStoreKey{0xc000e9fa50, packetfowardmiddleware}" module=server ver=10000
12:31PM DBG Finished loading IAVL tree module=server
12:31PM INF Applied snapshot chunk to ABCI app chunk=0 format=3 height=10000 module=statesync total=1
12:31PM INF Verified ABCI app appHash=FC133010B329762E53880FF7346A35231ABE06ADD4E0C0648C864B63360AC59C height=10000 module=statesync
12:31PM INF Snapshot restored format=3 hash=4A8490004ED938DEE3B3BCEADF436D13FE48E1F176C164F5E33BCEF9C800DD6D height=10000 module=statesync
12:31PM INF service start impl=BlockPool module=blocksync msg="Starting BlockPool service"
12:31PM DBG Consensus ticker inbound=0 lastHeight=10000 module=blocksync numPending=0 outbound=2 total=0
...
12:31PM ERR error in proxyAppConn.FinalizeBlock err="collections: not found: key 'no_key' of type github.com/cosmos/gogoproto/opinit.opchild.v1.Params" module=state
panic: Failed to process committed block (10001:43260E6C6B0CFCBE1683953E19FDA24DD9FA8F994B186F8CFA1CB96E6E5D5B63): collections: not found: key 'no_key' of type github.com/cosmos/gogoproto/opinit.opchild.v1.Params

Cosmos SDK Version

0.50

IAVL Version (using replace)

v1.2.4

How to reproduce?

Use [email protected]

[statesync]
# State sync rapidly bootstraps a new node by discovering, fetching, and restoring a state machine
# snapshot from peers instead of fetching and replaying historical blocks. Requires some peers in
# the network to take and serve state machine snapshots. State sync is not attempted if the node
# has any local state (LastBlockHeight > 0). The node will have a truncated block history,
# starting from the height of the snapshot.
enable = true

# RPC servers (comma-separated) for light client verification of the synced state machine and
# retrieval of state data for node bootstrapping. Also needs a trusted height and corresponding
# header hash obtained from a trusted source, and a period during which validators can be trusted.
#
# For Cosmos SDK-based chains, trust_period should usually be about 2/3 of the unbonding time (~2
# weeks) during which they can be financially punished (slashed) for misbehavior.
rpc_servers = "http://35.186.158.173:26657,http://35.198.199.232:26657"
trust_height = 10000
trust_hash = "20AB02A4DFB9F16E7A9D7E39B706F613FA49993025742CB2F5506FAD8CCA1ACF"
trust_period = "168h0m0s"

Try multiple times will lead you met this error.

@beer-1
Copy link
Contributor Author

beer-1 commented Feb 23, 2025

valid data has all data

$ iaviewer data ./application.db "s/k:opchild/"
Got version: 10000
Printing all keys with hashed values (to detect diff)
  11
    45409A4B7A5B60E9582472D7769946C180740A9D5DC7C5F52921884A7E0C67AC
  12
    CD2662154E6D76B2B2B92E70C0CAC3CCF534F9B74EB5B89819EC509083D00A50
  13
    A6CBBC2BC6224EFF02FE0D71EA8484A44EBC879823081DA85278B1188210DC0E
  14
    CD2662154E6D76B2B2B92E70C0CAC3CCF534F9B74EB5B89819EC509083D00A50
  218000000000000001
    1298A7E687BE20D63B7081F28E7DE8B3FFBF7CC2872C8017301D6398C4F26261
  218000000000000002
    506336618FCC1A0020760E56AF128E3C3AFF680C2CFBAC019ABBFE470E8CE091
  218000000000000003
...
    D0FA9A9366A124821E399998A91759C7E136E5A5BCEB6D89DF5215EAA33451BE
  31EA902A3049E97C6704D1EE92663CB270EA57133C
    52CC849FFFCB8AEAF2E8AFF5A4A3CA37052E4F90002902F20788C9F7E1B0E202
  32EA902A3049E97C6704D1EE92663CB270EA57133C
    B1202C98FF7904896DBBBB771D2AA919AA5FC5C23AD5FD4A52B0019393A1D8A2
  33EA902A3049E97C6704D1EE92663CB270EA57133C
    FBBDE3B3D477DE1904E00C7C2B168E2F5CCEA33478F9D0605252A4673EAFC7D9
Hash: 762AE8C1057AFB65240C284E0BC4CEA621C16DEAA7E00219DB63CF572855F2DB
Size: 2717

but wrong data has only Hash and Size without data entries

$ iaviewer data ./application.db "s/k:opchild/"
Got version: 10000
Printing all keys with hashed values (to detect diff)
Hash: 762AE8C1057AFB65240C284E0BC4CEA621C16DEAA7E00219DB63CF572855F2DB
Size: 2717

also checked iaviewer shape and both have same shape.

@yihuang
Copy link
Collaborator

yihuang commented Feb 23, 2025

I suspect that db write is not persisted somehow when program quit.
I've met before with pebbledb that if db.Close is not called before quit, it could lose some writes, not sure if it's case here though.

@beer-1
Copy link
Contributor Author

beer-1 commented Feb 23, 2025

This problem is sometimes still there, even I try to recover from the snapshot data with <appd> snapshots restore.
and this command was finished without error, so it is normal quit than unexpected exit.

btw I also suspect some timing issue anyway at snapshot recover.

@yihuang
Copy link
Collaborator

yihuang commented Feb 23, 2025

This problem is sometimes still there, even I try to recover from the snapshot data with <appd> snapshots restore. and this command was finished without error, so it is normal quit than unexpected exit.

btw I also suspect some timing issue anyway at snapshot recover.

try to make sure that the db.Close() is called before program exit.

@beer-1
Copy link
Contributor Author

beer-1 commented Feb 23, 2025

after seeing your comment, now I'm suspecting this part.

https://github.com/cosmos/iavl/blob/0bce70d6323dea724c8dc2e76d0a2a02567515d6/import.go#L95-L98

seems it is waiting only Write() not Close().

@beer-1
Copy link
Contributor Author

beer-1 commented Feb 23, 2025

Hey @yihuang you are right, importer was not waiting the batch close.

This commit fixed the issue, I will create iavl issue.

@beer-1
Copy link
Contributor Author

beer-1 commented Mar 9, 2025

resolved with iavl pr merge.

@beer-1 beer-1 closed this as completed Mar 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
No open projects
Status: 📋 Backlog
Development

No branches or pull requests

3 participants