Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

our TCP listener is nonblocking so we *must* set the accepted stream to blocking explicitly #3154

Merged
merged 1 commit into from
Nov 30, 2019

Conversation

antiochp
Copy link
Member

@antiochp antiochp commented Nov 30, 2019

A nonblocking TCP listener will accept nonblocking TCP streams by default.
This is very surprising to put it mildly.

We were creating nonblocking streams when accepting inbound connections from peers, but creating blocking streams when connecting to outbound peers.
Which made this problem really hard to debug and really hard to track down.

You can see this if you add debug logging in try_break! when matching on WouldBlock. On some connections we were seeing these scroll by at a rapid rate, but not all connections.
It turns out it was only on inbound connections.

So you can only reproduce the issue on a public node, or locally with a couple of nodes on localhost.

This probably explains all kinds of weird connection related problems.

One example: if we request a txhashset from a public node and we are an inbound connection from the perspective of that node then the nonblocking stream is going to prevent that node from reliably sending the full txhashset attachment back un-corrupted.

@antiochp antiochp added this to the 3.0.0 milestone Nov 30, 2019
@antiochp antiochp requested a review from hashmap November 30, 2019 22:28
@DavidBurkett
Copy link
Contributor

Amazing. I would've never thought that would happen and not fail catastrophically somewhere. Awesome find!

@antiochp
Copy link
Member Author

I suspect this also explains part of the slow header sync - if you request headers from an outbound peer then they are sending you a reasonably large chunk of bytes via a write_all which will be susceptible to a WouldBlock internally as its on a nonblocking stream but we assume it is blocking and safe to treat this way.
End result is the receiving end may not get a full Headers msg back in some scenarios and be forced to time the connection out as it is blocking on the receiving end.

@antiochp
Copy link
Member Author

I have successfully sync'd against a node running this code several times in a row without any issues.

@DavidBurkett
Copy link
Contributor

Once we get this merged, I'll ask everyone I know of that runs public nodes to build from master. I'm aware of dozens of people that simply can't use Grin without this.

@hashmap
Copy link
Contributor

hashmap commented Nov 30, 2019

@antiochp do you have a link to the docs or source code?

I did it in the beginning a while ago and then removed it out of curiosity. Nothing changed. Also I run a public node and don’t see any issues, moreover I could not reproduce txhashet issue for 2 weeks. It would also cause high cpu load and my node runs on 0.3-1% cpu having at least a dozen of inbound peers. Having said this Gary experienced cpu spikes and fixed it adding a sleep after WouldBlock. The only explanation I have is that my node has a very good connection or extremely lucky.

@hashmap
Copy link
Contributor

hashmap commented Nov 30, 2019

I'm fine with the change. After reading an example in rust docs I agree with your idea, however I still don't understand how it works on my machine, I have 29 inbound peers now, even with tui I'm under 5% cpu, with non-blocking io it was at least 60%.

@hashmap hashmap merged commit 7f7d51a into mimblewimble:master Nov 30, 2019
@antiochp antiochp deleted the explicit_blocking branch December 2, 2019 10:29
@antiochp
Copy link
Member Author

antiochp commented Dec 2, 2019

@antiochp do you have a link to the docs or source code?

I actually don't no. This is all from experimenting with nodes locally (and a lot of debug logs).

But I do think that the "principle of least surprise" may apply here - 99% of the time if you have a nonblocking listener you are going to want nonblocking streams accepted from it and it would actually be surprising behavior if you were then required to set each stream to nonblocking.

Our usage is definitely an outlier - we have a nonblocking listener but we do want blocking streams accepted.

It would be nice if there was a simple way of actually reading what the current blocking/nonblocking state of a stream was, but there does not appear to be.

@antiochp
Copy link
Member Author

antiochp commented Dec 3, 2019

Hmm.

@hashmap has been testing this under various conditions and it appears this is only an issue on MacOS.
So not entirely sure how much effect this fix will have in the real world on mainnet as presumably majority of public nodes are on Linux.

MacOS accepts nonblocking connections by default from a nonblocking listener
Linux accepts blocking connections by default from a nonblocking listener
Windows - who knows?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants