Skip to content
This repository has been archived by the owner on Nov 15, 2023. It is now read-only.

Calling api.rpc.chain.getFinalizedHead freezes whole WebSocket JSONRPC session #11239

Open
2 tasks done
krhougs opened this issue Apr 19, 2022 · 11 comments
Open
2 tasks done
Labels
J2-unconfirmed Issue might be valid, but it’s not yet known.

Comments

@krhougs
Copy link

krhougs commented Apr 19, 2022

Is there an existing issue?

  • I have searched the existing issues

Experiencing problems? Have you tried our Stack Exchange first?

  • This is not a support question.

Description of bug

In some case, calling api.rpc.chain.getFinalizedHead in Polkadot.js with the parachain will fail without any error message but a timeout error. After this, whole JSONRPC WebSocket session is frozen, we can get health warnings which indicate that the keepalive requests got no response.

It only happens on the Khala parachain, same rpc method works just fine on the kusama node, reproduced by the miner community.

Related:

Steps to reproduce

No response

@github-actions github-actions bot added the J2-unconfirmed Issue might be valid, but it’s not yet known. label Apr 19, 2022
@krhougs
Copy link
Author

krhougs commented Apr 19, 2022

Logs on the Polkadot.js client:

2022-04-19 08:24:24        RPC-CORE: getFinalizedHead(): BlockHash:: No response received from RPC endpoint in 30s
{"name":"prb","hostname":"prb-data-provider-2-0","pid":46,"level":50,"msg":"Failed to update target: Error: No response received from RPC endpoint in 30s\n    at WsProvider.value (/opt/app/node_modules/@polkadot/rpc-provider/ws/index.cjs:382:32)\n    at Timeout._onTimeout (/opt/app/node_modules/@polkadot/rpc-provider/ws/index.cjs:428:90)\n    at listOnTimeout (node:internal/timers:559:17)\n    at processTimers (node:internal/timers:502:7)","time":"2022-04-19T08:24:24.632Z","v":0}
{"name":"prb","hostname":"prb-data-provider-2-0","pid":52,"level":20,"parentBlockNumber":12318998,"msg":"Fetched parent block.","time":"2022-04-19T08:24:24.862Z","v":0}
{"name":"prb","hostname":"prb-data-provider-2-0","pid":52,"level":20,"parentBlockNumber":12318999,"msg":"Fetched parent block.","time":"2022-04-19T08:24:24.984Z","v":0}
2022-04-19 08:24:29        RPC-CORE: health(): Health:: No response received from RPC endpoint in 30s
2022-04-19 08:24:29        RPC-CORE: health(): Health:: No response received from RPC endpoint in 30s
2022-04-19 08:24:29        RPC-CORE: health(): Health:: No response received from RPC endpoint in 30s
{"name":"prb","hostname":"prb-data-provider-2-0","pid":52,"level":20,"parentBlockNumber":12319000,"msg":"Fetched parent block.","time":"2022-04-19T08:24:31.136Z","v":0}
{"name":"prb","hostname":"prb-data-provider-2-0","pid":13,"level":30,"msg":"Stats: ::ffff:10.42.2.63:9586 { sentPerFiveSec: 399, sentPerSecAvg: 79.8 }","time":"2022-04-19T08:24:34.304Z","v":0}
2022-04-19 08:24:37        RPC-CORE: health(): Health:: No response received from RPC endpoint in 30s
{"name":"prb","hostname":"prb-data-provider-2-0","pid":13,"level":30,"msg":"Stats: ::ffff:10.42.2.63:9586 { sentPerFiveSec: 772, sentPerSecAvg: 154.4 }","time":"2022-04-19T08:24:39.304Z","v":0}
{"name":"prb","hostname":"prb-data-provider-2-0","pid":52,"level":20,"parentBlockNumber":12319001,"msg":"Fetched parent block.","time":"2022-04-19T08:24:40.248Z","v":0}
{"name":"prb","hostname":"prb-data-provider-2-0","pid":13,"level":30,"msg":"Stats: ::ffff:10.42.2.63:9586 { sentPerFiveSec: 975, sentPerSecAvg: 195 }","time":"2022-04-19T08:24:44.305Z","v":0}
{"name":"prb","hostname":"prb-data-provider-2-0","pid":13,"level":30,"msg":"Stats: ::ffff:10.42.2.63:9586 { sentPerFiveSec: 837, sentPerSecAvg: 167.4 }","time":"2022-04-19T08:24:49.305Z","v":0}
{"name":"prb","hostname":"prb-data-provider-2-0","pid":52,"level":20,"parentBlockNumber":12319002,"msg":"Fetched parent block.","time":"2022-04-19T08:24:49.366Z","v":0}
{"name":"prb","hostname":"prb-data-provider-2-0","pid":52,"level":20,"parentBlockNumber":12319003,"msg":"Fetched parent block.","time":"2022-04-19T08:24:52.476Z","v":0}
{"name":"prb","hostname":"prb-data-provider-2-0","pid":13,"level":30,"msg":"Stats: ::ffff:10.42.2.63:9586 { sentPerFiveSec: 825, sentPerSecAvg: 165 }","time":"2022-04-19T08:24:54.304Z","v":0}
{"name":"prb","hostname":"prb-data-provider-2-0","pid":13,"level":30,"msg":"Stats: ::ffff:10.42.2.63:9586 { sentPerFiveSec: 907, sentPerSecAvg: 181.4 }","time":"2022-04-19T08:24:59.305Z","v":0}
{"name":"prb","hostname":"prb-data-provider-2-0","pid":52,"level":20,"parentBlockNumber":12319004,"msg":"Fetched parent block.","time":"2022-04-19T08:25:01.615Z","v":0}
{"name":"prb","hostname":"prb-data-provider-2-0","pid":13,"level":30,"msg":"Stats: ::ffff:10.42.2.63:9586 { sentPerFiveSec: 839, sentPerSecAvg: 167.8 }","time":"2022-04-19T08:25:04.304Z","v":0}
{"name":"prb","hostname":"prb-data-provider-2-0","pid":13,"level":30,"msg":"Stats: ::ffff:10.42.2.63:9586 { sentPerFiveSec: 636, sentPerSecAvg: 127.2 }","time":"2022-04-19T08:25:09.305Z","v":0}
2022-04-19 08:25:09        RPC-CORE: health(): Health:: No response received from RPC endpoint in 30s
2022-04-19 08:25:09        RPC-CORE: health(): Health:: No response received from RPC endpoint in 30s
2022-04-19 08:25:09        RPC-CORE: health(): Health:: No response received from RPC endpoint in 30s

@bkchr
Copy link
Member

bkchr commented Apr 19, 2022

Do you have anymore information when this happens? Is there something happening on the Parachain? Can you reproduce this locally? Can you reproduce this always for the same block?

@krhougs
Copy link
Author

krhougs commented Apr 19, 2022

Background

To mine on Khala/Phala, there is a tool to synchronize block data to and verify them in the TEE enclave. This tool will query api.rpc.chain.getFinalizedHead to get the current finalized height in the synchronization loop.

The issue

This issue is found by our mining community. There are several miners facing this issue but unfortunately we can not reproduce this in our development environment.

In this specific case(Phala-Network/runtime-bridge#83), the user started the tool when both the parent chain and the parachain were synched to the network height. Its jsonrpc session was frozen when querying getFinalizedHead on 2022-04-19T08:23:29.530Z where it turned to the 1503872rd block of the Khala chain.

Since I am not familiar with Substate and Rust, I can only guess that querying getFinalizedHead while the parachain node is finalizing a new coming block would cause a dead lock. Correct me if I'm wrong :)

I can set a watchdog for the jsonrpc session's keepalive failure for a workaround, but it do affect the reliability of the mining toolchains and the earning and income of the miners.

@niklasad1
Copy link
Member

hmm, so you don't have any logs from node that you connected to?

Did you try to drop the WebSocket connection and start a new one?

My hunch is that you run in to #10672

@krhougs
Copy link
Author

krhougs commented Apr 24, 2022

For this case we used RUST_LOG="trace,soketto=warn,jsonrpsee_ws_client=warn,remote-ext=warn,trie=warn,wasmtime_cranelift=warn" but nothing valuable found from the output.

Full log: https://github.com/Phala-Network/runtime-bridge/files/8511920/NODE.%2B.data_provider.zip

@niklasad1
Copy link
Member

Hey again,

Thanks for the logs but there are only node logs from 2022-04-19 08:26 because the actual getFinalizedHead call was made around 2022-04-19 08:24. So can you provide the full logs from the node please?

It should be sufficient to run with rpc node with RUST_LOG="rpc=debug" to begin with.

@bkchr
Copy link
Member

bkchr commented Apr 25, 2022

@niklasad1 something I thought about. Why isn't this "ping-pong" mechanism that polkadot-js uses not backed into the protocol?

@niklasad1
Copy link
Member

niklasad1 commented Apr 25, 2022

something I thought about. Why isn't this "ping-pong" mechanism that polkadot-js uses not backed into the protocol?

Good question, it should be "baked" into the WebSocket protocol but when I checked polkadot-js it doesn't really use WebSocket ping/pong instead the warnings are printed when a call takes longer than some pre-defined timeout but I could be wrong.

soketto supports it and ws-rs "should too"

@bkchr
Copy link
Member

bkchr commented Apr 25, 2022

@jacogr why not use the "ping-pong" protocol of websockets?

@jacogr
Copy link
Contributor

jacogr commented Apr 26, 2022

I honestly cannot recall as to why system_health is specifically used to keep the connection open in a ping/pong fashion. This is all from early 2018. All I know is that the server drops the connection is we don't specifically do it, i.e. if the server doesn't receive anything from the client inside a 60 second period it drops and assumes the client is dead ...which is not that useful if you have subscriptions open....

(The timeout on specific calls is something completely different, it is on a per-call basis where it assumes something went wrong if a reponse is not received for a query inside a timeout window)

Can certainly investigate WS-specific ping/pong, not sure if all the clients libs have it either (there are eg. different usages for Node.js vs the browser). Never looked into it, so cannot comment on server or client lib support and if it gets around the underlying drops.

(Although this doesn't look promising, it mentions the ping/pong is normally from the server, not client - https://stackoverflow.com/a/50883592 and this mentions specifically that the JS browser side doesn't have control of it https://stackoverflow.com/a/10586583 so it would rather seem like a server-side config)

@niklasad1
Copy link
Member

niklasad1 commented Apr 26, 2022

Ok, we can add these as "server-side pings" when jsonrpsee is a thing in substrate if that works better because soketto supports it but it needs a fix in jsonrpsee (should be trivial though)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
J2-unconfirmed Issue might be valid, but it’s not yet known.
Projects
None yet
Development

No branches or pull requests

4 participants