Skip to content
This repository has been archived by the owner on Jan 22, 2025. It is now read-only.

Validator panicked handling RPC call with commitment single #11078

Closed
svenski123 opened this issue Jul 15, 2020 · 5 comments
Closed

Validator panicked handling RPC call with commitment single #11078

svenski123 opened this issue Jul 15, 2020 · 5 comments
Milestone

Comments

@svenski123
Copy link
Contributor

Problem

[Found in TdS Stage 6]

solana catchup --commitment single --follow crashed TdS validator due to thread 'http.worker110' panicked at 'called Option::unwrap()on aNone value', core/src/rpc.rs:117:20.

Validator at the time was under extreme memory pressure at the time as evidenced by pidstat records below (local time is UTC+2).

For commitment levels single, single gossip and max, JsonRpcRequestProcessor::bank() obtains a slot number from the block commitment cache and then looks up the corresponding bank in bank forks which is presumed to exist however in this case it did not.

Tail end of validator log file including trace back is attached to this issue.

# Time        UID       PID    %usr %system  %guest   %wait    %CPU   CPU  minflt/s  majflt/s     VSZ     RSS   %MEM StkSize  StkRef   kB_rd/s   kB_wr/s kB_ccwr/s iodelay   cswch/s nvcswch/s threads   fd-nr prio policy  Command
06:57:18 PM  1001   4123831  444.50  138.50    0.00    0.00  583.00     6   1769.50      2.50 340883720 58663536  89.39     256      44   1414.00   2520.00      0.00       0      0.00      0.00     261     345    0 NORMAL  solana-validato

# Time        UID       PID    %usr %system  %guest   %wait    %CPU   CPU  minflt/s  majflt/s     VSZ     RSS   %MEM StkSize  StkRef   kB_rd/s   kB_wr/s kB_ccwr/s iodelay   cswch/s nvcswch/s threads   fd-nr prio policy  Command
06:57:20 PM  1001   4123831  409.00  117.50    0.00    0.00  526.50     6   2031.50      3.00 340883720 58672724  89.40     256      44  17698.00   4210.00      0.00       0      0.00      0.00     261     350    0 NORMAL  solana-validato

# Time        UID       PID    %usr %system  %guest   %wait    %CPU   CPU  minflt/s  majflt/s     VSZ     RSS   %MEM StkSize  StkRef   kB_rd/s   kB_wr/s kB_ccwr/s iodelay   cswch/s nvcswch/s threads   fd-nr prio policy  Command
06:57:22 PM  1001   4123831  262.50  103.50    0.00    0.00  366.00     6   4122.00     10.50 340900408 58695724  89.44     256      44     -1.00     -1.00     -1.00       0      0.00      0.00     263      -1    0 NORMAL  solana-validato

# Time        UID       PID    %usr %system  %guest   %wait    %CPU   CPU  minflt/s  majflt/s     VSZ     RSS   %MEM StkSize  StkRef   kB_rd/s   kB_wr/s kB_ccwr/s iodelay   cswch/s nvcswch/s threads   fd-nr prio policy  Command

Proposed Solution

The JsonRpcRequestProcessor::bank() should return an Err result when the bank cannot be looked up (possibly logging a warning if this situation should not occur in practice).

Review the locking order of bank forks and the commitment cache, perhaps locking the cache first can mitigite the issue. If required, consider releasing/yielding and reacquirign a limited number of times if inconsistency is observed between bank forks and the commitment cache (a short delay in processing an RPC request would likely not be noticed and may be preferable to return a failure response).

20200715-163404-solana-validator-9QxC-sgkv.log

@CriesofCarrots
Copy link
Contributor

CC: #10910 (comment)

@CriesofCarrots
Copy link
Contributor

Thanks, @svenski123 ! @carllin and I were talking about this set of issues last night, and will be making improvements very soon

@carllin
Copy link
Contributor

carllin commented Jul 15, 2020

hehe yeah, what convenient timing, here's a link to the discussion: https://discord.com/channels/428295358100013066/478692221441409024/732812908555272212

@carllin carllin added this to the v1.3.0 milestone Jul 21, 2020
@CriesofCarrots
Copy link
Contributor

CriesofCarrots commented Jul 23, 2020

(copied from @carllin on Discord)

  1. On startup from a snapshot bank, let's say 200, the highest_confirmed_root and highest_confirmed_slot will both be something < 200, so that bank won't exist. Ideally in this case I would think anybody asking for MAX confirmation would get a None result (kind of a hassle b/c currently it's assumed JsonRpcRequestProcessor::bank() will return something). I also thought about just defaulting to the snapshot slot, but this may not be safe since at boot, you have not actually verified a supermajority has rooted the snapshot yet, you just got it from some (hopefully) trusted valildator.

  2. The highest_confirmed_slot bug described above ^. I think here we should change the highest_confirmed_slot logic to find the highest such slot that descends from max(highest_confirmed_root, our_current_root). That way the highest_confiirmed_slot won't be purged when set_root is called in the example above ^

  3. Finally the crash on testnet:

[2020-07-15T03:52:13.800246694Z INFO  solana_core::replay_stage] voting: 4531 8136539729935272006
[2020-07-15T03:52:13.837111218Z INFO  solana_runtime::bank_forks] prune non root: 4493 largest_root: Some(4488)
[2020-07-15T03:52:18.480013402Z INFO  solana_core::replay_stage] voting: 4545 9205379094681472998
thread 'http.worker40' panicked at 'Bank 4487 for RPC does not exist', core/src/rpc.rs:147:13
stack backtrace:
[2020-07-15T03:52:18.507592096Z TRACE jsonrpc_core::io] Request: {"id":1,"jsonrpc":"2.0","method":"getFees","params"
:[{"commitment":"max"}]}.

After voting on slot 4531, the largest confirmed root was 4488, but when we switched forks to vote on slot 4545, the largest confirmed root backtracked to 4487 (prob b/c it had fewer votes on that fork). I think here we should always take max(new_largest_confirmed_root, current_largest_confirmed_root)

@CriesofCarrots
Copy link
Contributor

Closed by #11390 in v1.2 and #11188 in v1.2+

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants