-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Firehose connections hang when more than 100 subgraphs are deployed #3879
Comments
Mystery solved on why connection pooling wasn't working, tonic uses the URI as the key so repeating the same endpoint doesn't really work. |
That is super helpful great investigation thank you very much about all this great knowledge. And it's great you also found why the connection pooling was not working, weird it's not "reported" somehow would have been easily catchable. Have you raised an issue on the tonic repo about that, I would do it if it's not the case. Is the connection pooling active on Ethereum also, would be good to retry a shootout to see how it behaves with real connection pooling :) I assume we shall close this issue now? |
We can close this issue after getting the |
See this commit for a reproduction example. It exposes a few causes that seem to be conspiring:
DEBUG
. It shows that the firehose endpoint sets the http2 settingmax_concurrent_streams
to 100, thereby responding withREFUSED_STREAM
to any additional streams. This limits the number of block streams to 100 on a single connection.This setting is per connection, and we were supposed to be using connection polling, but the example seems to use a single connection no matter what
conn_pool_size
is set to, so that connection polling implementation must not be working or not be balancing based on number of streams (least loaded, round-robin or randomly would work).Tonic seems to not retry establishing the stream after receiving
REFUSED_STREAM
. On retry it logs the below and then goes silent. This might be related to this issue Client must not expect to hear back from server when establishing bidirectional stream hyperium/tonic#515.On the potential fix:
If Firehose could reliably set
max_concurrent_streams
to a high value, that would be great. But afaik Firehose itself does not set a limit, so proxies with draconian defaults must be to blame (nginx for example defaults to 128). Since we don't want to add configuration pitfalls to operators or require specific proxies, we cannot rely on this being set higher than 100 which is the RFC recommended minimum.The tonic bug seems related but even if tonic retried, the stream would probably be refused again.
So the most reliable fix would be to get connection pooling working with an algorithm that seeks to balance the number of streams per connection.
The text was updated successfully, but these errors were encountered: