-
Notifications
You must be signed in to change notification settings - Fork 4.6k
Validator panicked handling RPC call with commitment single #11078
Comments
CC: #10910 (comment) |
Thanks, @svenski123 ! @carllin and I were talking about this set of issues last night, and will be making improvements very soon |
hehe yeah, what convenient timing, here's a link to the discussion: https://discord.com/channels/428295358100013066/478692221441409024/732812908555272212 |
(copied from @carllin on Discord)
After voting on slot 4531, the largest confirmed root was 4488, but when we switched forks to vote on slot 4545, the largest confirmed root backtracked to 4487 (prob b/c it had fewer votes on that fork). I think here we should always take max(new_largest_confirmed_root, current_largest_confirmed_root) |
Problem
[Found in TdS Stage 6]
solana catchup --commitment single --follow
crashed TdS validator due tothread 'http.worker110' panicked at 'called
Option::unwrap()on a
Nonevalue', core/src/rpc.rs:117:20
.Validator at the time was under extreme memory pressure at the time as evidenced by pidstat records below (local time is UTC+2).
For commitment levels single, single gossip and max, JsonRpcRequestProcessor::bank() obtains a slot number from the block commitment cache and then looks up the corresponding bank in bank forks which is presumed to exist however in this case it did not.
Tail end of validator log file including trace back is attached to this issue.
Proposed Solution
The JsonRpcRequestProcessor::bank() should return an Err result when the bank cannot be looked up (possibly logging a warning if this situation should not occur in practice).
Review the locking order of bank forks and the commitment cache, perhaps locking the cache first can mitigite the issue. If required, consider releasing/yielding and reacquirign a limited number of times if inconsistency is observed between bank forks and the commitment cache (a short delay in processing an RPC request would likely not be noticed and may be preferable to return a failure response).
20200715-163404-solana-validator-9QxC-sgkv.log
The text was updated successfully, but these errors were encountered: