Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simultaneous connection opening with TCP+TLS #732

Closed
hsanjuan opened this issue Oct 25, 2019 · 7 comments
Closed

Simultaneous connection opening with TCP+TLS #732

hsanjuan opened this issue Oct 25, 2019 · 7 comments
Labels
kind/bug A bug in existing code (including security flaws)

Comments

@hsanjuan
Copy link
Contributor

hsanjuan commented Oct 25, 2019

I have what seems a race condition when opening a NewStream to a host. This has only become apparent when using TCP Transport + TLS. It did not happen (at least not usually) with TCP+secio and does not seem to happen with QUIC (at least not so often).

We are using libp2p-0.4.0:

go list -m all | grep libp2p github.com/lanzafame/go-libp2p-ocgorpc v0.1.1 github.com/libp2p/go-addr-util v0.0.1 github.com/libp2p/go-buffer-pool v0.0.2 github.com/libp2p/go-conn-security v0.0.1 github.com/libp2p/go-conn-security-multistream v0.1.0 github.com/libp2p/go-eventbus v0.1.0 github.com/libp2p/go-flow-metrics v0.0.1 github.com/libp2p/go-libp2p v0.4.0 github.com/libp2p/go-libp2p-autonat v0.1.0 github.com/libp2p/go-libp2p-autonat-svc v0.1.0 github.com/libp2p/go-libp2p-blankhost v0.1.4 github.com/libp2p/go-libp2p-circuit v0.1.3 github.com/libp2p/go-libp2p-connmgr v0.1.1 github.com/libp2p/go-libp2p-consensus v0.0.1 github.com/libp2p/go-libp2p-core v0.2.3 github.com/libp2p/go-libp2p-crypto v0.1.0 github.com/libp2p/go-libp2p-discovery v0.1.0 github.com/libp2p/go-libp2p-gorpc v0.1.0 github.com/libp2p/go-libp2p-gostream v0.2.0 github.com/libp2p/go-libp2p-host v0.1.0 github.com/libp2p/go-libp2p-http v0.1.4 github.com/libp2p/go-libp2p-interface-connmgr v0.0.5 github.com/libp2p/go-libp2p-interface-pnet v0.1.0 github.com/libp2p/go-libp2p-kad-dht v0.2.1 github.com/libp2p/go-libp2p-kbucket v0.2.1 github.com/libp2p/go-libp2p-loggables v0.1.0 github.com/libp2p/go-libp2p-metrics v0.0.1 github.com/libp2p/go-libp2p-mplex v0.2.1 github.com/libp2p/go-libp2p-nat v0.0.4 github.com/libp2p/go-libp2p-net v0.0.2 github.com/libp2p/go-libp2p-netutil v0.1.0 github.com/libp2p/go-libp2p-peer v0.2.0 github.com/libp2p/go-libp2p-peerstore v0.1.3 github.com/libp2p/go-libp2p-pnet v0.1.0 github.com/libp2p/go-libp2p-protocol v0.1.0 github.com/libp2p/go-libp2p-pubsub v0.1.1 github.com/libp2p/go-libp2p-quic-transport v0.1.2-0.20190830164807-17543aa14ed2 github.com/libp2p/go-libp2p-raft v0.1.3 github.com/libp2p/go-libp2p-record v0.1.1 github.com/libp2p/go-libp2p-routing v0.1.0 github.com/libp2p/go-libp2p-secio v0.2.0 github.com/libp2p/go-libp2p-swarm v0.2.2 => /home/hector/go/src/github.com/libp2p/go-libp2p-swarm github.com/libp2p/go-libp2p-testing v0.1.0 github.com/libp2p/go-libp2p-tls v0.1.1 github.com/libp2p/go-libp2p-transport v0.0.5 github.com/libp2p/go-libp2p-transport-upgrader v0.1.1 github.com/libp2p/go-libp2p-yamux v0.2.1 github.com/libp2p/go-maddr-filter v0.0.5 github.com/libp2p/go-mplex v0.1.0 github.com/libp2p/go-msgio v0.0.4 github.com/libp2p/go-nat v0.0.3 github.com/libp2p/go-openssl v0.0.2 github.com/libp2p/go-reuseport v0.0.1 github.com/libp2p/go-reuseport-transport v0.0.2 github.com/libp2p/go-stream-muxer v0.1.0 github.com/libp2p/go-stream-muxer-multistream v0.2.0 github.com/libp2p/go-tcp-transport v0.1.1 github.com/libp2p/go-testutil v0.1.0 github.com/libp2p/go-ws-transport v0.1.2 github.com/libp2p/go-yamux v1.2.3

(go-libp2p-swarm replaced to add some logging).

The procedure is the following:

  1. Initialize a libp2p host using default transport and TLS for the libp2p.Security option. The listen multiaddress indicates a random port 0. The identity is an RSA-2048 key.
  2. Initialize a second libp2p host with the same configuration.
  3. Import the first peer to the peerstore.
  4. Open a stream from the second to the first. This usually results in max dial attempts failed.
  5. As a side note, the first peer seems to be opening a stream back to the second. This might be gossippub after getting a connection notification?
  6. There is no unpn autodiscovery, dht, bitswap, connection manager, nat port mapping or relay in these tests.

Adding some logging to libp2p-swarm shows:

  • First peer is /ip4/127.0.0.1/tcp/38275/p2p/QmQqzvSetBMJDtzwXKoucc9SvRhRSwxxJtny2eNPyKcLd3
11:58:01.228 DEBUG     swarm2: [QmSWL2AD5ondRHYWAMaZRdkTectuGVKVaxn7c2JmG4noYr] swarm dialing peer [QmQqzvSetBMJDtzwXKoucc9SvRhRSwxxJtny2eNPyKcLd3] swarm_dial.go:193
11:58:01.228 DEBUG     swarm2: QmSWL2AD5ondRHYWAMaZRdkTectuGVKVaxn7c2JmG4noYr swarm dialing QmQqzvSetBMJDtzwXKoucc9SvRhRSwxxJtny2eNPyKcLd3 swarm_dial.go:371
11:58:01.228 DEBUG     swarm2: [limiter] adding a dial job through limiter: /ip4/127.0.0.1/tcp/38275 limiter.go:192
11:58:01.228 DEBUG     swarm2: [limiter] taking FD token: peer: QmQqzvSetBMJDtzwXKoucc9SvRhRSwxxJtny2eNPyKcLd3; addr: /ip4/127.0.0.1/tcp/38275; prev consuming: 0 limiter.go:160
11:58:01.228 DEBUG     swarm2: [limiter] executing dial; peer: QmQqzvSetBMJDtzwXKoucc9SvRhRSwxxJtny2eNPyKcLd3; addr: /ip4/127.0.0.1/tcp/38275; FD consuming: 1; waiting: 0 limiter.go:166
11:58:01.228 DEBUG     swarm2: QmSWL2AD5ondRHYWAMaZRdkTectuGVKVaxn7c2JmG4noYr swarm dialing QmQqzvSetBMJDtzwXKoucc9SvRhRSwxxJtny2eNPyKcLd3 /ip4/127.0.0.1/tcp/38275 swarm_dial.go:455
11:58:01.232 DEBUG     swarm2: swarm listener accepted connection: <stream.Conn[TCP] /ip4/127.0.0.1/tcp/38275 (QmQqzvSetBMJDtzwXKoucc9SvRhRSwxxJtny2eNPyKcLd3) <-> /ip4/127.0.0.1/tcp/41033 (QmSWL2AD5ondRHYWAMaZRdkTectuGVKVaxn7c2JmG4noYr)> swarm_listen.go:87
11:58:01.232 DEBUG     swarm2: [limiter] freeing FD token; waiting: 0; consuming: 1 limiter.go:82
11:58:01.232 DEBUG     swarm2: [limiter] freeing peer token; peer QmQqzvSetBMJDtzwXKoucc9SvRhRSwxxJtny2eNPyKcLd3; addr: /ip4/127.0.0.1/tcp/38275; active for peer: 1; waiting on peer limit: 0 limiter.go:109
11:58:01.232 DEBUG     swarm2: [QmQqzvSetBMJDtzwXKoucc9SvRhRSwxxJtny2eNPyKcLd3] opening stream to peer [QmSWL2AD5ondRHYWAMaZRdkTectuGVKVaxn7c2JmG4noYr] swarm.go:292
11:58:01.232 DEBUG     swarm2: [QmQqzvSetBMJDtzwXKoucc9SvRhRSwxxJtny2eNPyKcLd3] swarm dialing peer [QmSWL2AD5ondRHYWAMaZRdkTectuGVKVaxn7c2JmG4noYr] swarm_dial.go:193
11:58:01.232 ERROR     swarm2: connection error: failed to dial QmSWL2AD5ondRHYWAMaZRdkTectuGVKVaxn7c2JmG4noYr: no addresses swarm.go:322
11:58:01.232 DEBUG     swarm2: [limiter] clearing all peer dials: QmQqzvSetBMJDtzwXKoucc9SvRhRSwxxJtny2eNPyKcLd3 limiter.go:200
11:58:01.233 DEBUG     swarm2: [QmSWL2AD5ondRHYWAMaZRdkTectuGVKVaxn7c2JmG4noYr] opening stream to peer [QmQqzvSetBMJDtzwXKoucc9SvRhRSwxxJtny2eNPyKcLd3] swarm.go:292
11:58:01.233 DEBUG     swarm2: [QmSWL2AD5ondRHYWAMaZRdkTectuGVKVaxn7c2JmG4noYr] swarm dialing peer [QmQqzvSetBMJDtzwXKoucc9SvRhRSwxxJtny2eNPyKcLd3] swarm_dial.go:193
11:58:01.233 DEBUG     swarm2: network for QmSWL2AD5ondRHYWAMaZRdkTectuGVKVaxn7c2JmG4noYr finished dialing QmQqzvSetBMJDtzwXKoucc9SvRhRSwxxJtny2eNPyKcLd3 swarm_dial.go:228
11:58:01.233 DEBUG     swarm2: [QmSWL2AD5ondRHYWAMaZRdkTectuGVKVaxn7c2JmG4noYr] opening stream to peer [QmQqzvSetBMJDtzwXKoucc9SvRhRSwxxJtny2eNPyKcLd3] swarm.go:292
11:58:01.233 DEBUG     swarm2: [QmSWL2AD5ondRHYWAMaZRdkTectuGVKVaxn7c2JmG4noYr] swarm dialing peer [QmQqzvSetBMJDtzwXKoucc9SvRhRSwxxJtny2eNPyKcLd3] swarm_dial.go:193
11:58:01.233 DEBUG     swarm2: network for QmSWL2AD5ondRHYWAMaZRdkTectuGVKVaxn7c2JmG4noYr finished dialing QmQqzvSetBMJDtzwXKoucc9SvRhRSwxxJtny2eNPyKcLd3 swarm_dial.go:228
11:58:01.233  INFO     swarm2: s.dialPeer() produced a connection: <swarm.Conn[*tcp.TcpTransport] /ip4/127.0.0.1/tcp/41033 (QmSWL2AD5ondRHYWAMaZRdkTectuGVKVaxn7c2JmG4noYr) <-> /ip4/127.0.0.1/tcp/38275 (QmQqzvSetBMJDtzwXKoucc9SvRhRSwxxJtny2eNPyKcLd3)> swarm.go:325
11:58:01.233 ERROR     swarm2: NewStream() error: EOF swarm.go:329
11:58:01.233 ERROR     swarm2: NewStream(): connection was closed swarm.go:331
11:58:01.233 ERROR  p2p-gorpc: max dial attempts exceeded call.go:64
11:58:01.233 ERROR    cluster: max dial attempts exceeded cluster.go:902
11:58:01.233 ERROR    cluster: max dial attempts exceeded ipfscluster_test.go:355

Notice the following:

  • A connection error : no addresses that sometimes appears and happens to be after peer1 tries to contact peer2 (why is it not re-using the existing connections?)
  • Multiple finished dialing messages that indicates that several connections were produced
  • The stream returns an EOF error
  • Sometimes retry happens (because a connection was returned by bestConnToPeer and things work, sometimes things die in the retry, sometimes there is no retry as no connection existed in the first place and DialAttempts is reached. In this case after the EOF, it seems the connection used was closed (this is extra logging added by myself).
  • As mentioned above, this was not an error we saw when using TCP+secio.

I am a bit lost here, but it seems that there should not be so many connections being opened or attempted and that streams should not get a closed connection.

@hsanjuan hsanjuan added the kind/bug A bug in existing code (including security flaws) label Oct 25, 2019
@hsanjuan
Copy link
Contributor Author

I'm not sure if this is actually triggering the old libp2p/go-libp2p-swarm#79 . But this would mean that the connection notifier perhaps triggers while the connection is not fully setup yet and TLS is negotiated, therefore it is not usuable and that's why it tries to open a new one?

@Stebalien
Copy link
Member

A connection error : no addresses that sometimes appears and happens to be after peer1 tries to contact peer2

That means we don't know any addresses so we're giving up.

(why is it not re-using the existing connections?)

It generally should. Swarm.doDial calls s.bestConnToPeer when the actual dial fails. If this happens, it logs "ignoring dial error because we have a connection: ..." and returns the connection it found.

This usually results in max dial attempts failed.

That really is pointing to a simultaneous connection issue: https://github.com/libp2p/go-libp2p-swarm/blob/5efa2fa248976bbdc141f638ead80cbf7ab689cf/swarm.go#L291-L334.

  1. Check to see if we have a connection: we don't.
  2. Increment the "dials" count.
  3. Try dialing: succeed
  4. Try opening a stream: fail
  5. Check if the connection is closed: true.
  6. Go around the loop to see if we have an existing open connection: we don't.
  7. Consider dialing again: notice that "dials" is >= 1 and abort.

This can happen when the connection we created is killed before we get to actually use it.

@hsanjuan
Copy link
Contributor Author

This can happen when the connection we created is killed before we get to actually use it.

Does the Connected notification happen before the transport has been upgraded? Is there a way that a stream opened after receiving a Connected notification might end up opening a new connection on the way back?

@Stebalien
Copy link
Member

No. The connection notification fires at the very end.

Is there a way that a stream opened after receiving a Connected notification might end up opening a new connection on the way back?

This should be impossible.

@hsanjuan
Copy link
Contributor Author

hsanjuan commented Nov 2, 2019

So I think the TLS problem is fixed.

The no addresses error was related. Since identify could not even complete on the closed connections, other streams tried to re-open a connection back to the original peer but no addresses were known.

I need to verify things with QUIC though, as I remember having seen something similar at least.

@marten-seemann
Copy link
Contributor

@hsanjuan I'm not sure I completely understand the issue here, but there's no such thing as simultaneous open in QUIC. You'll end up with two separate QUIC connections when two peers are dialing each other simultaneously.

Stebalien added a commit to ipfs/kubo that referenced this issue Jan 17, 2020
Stebalien added a commit to ipfs/kubo that referenced this issue Jan 17, 2020
Stebalien added a commit to ipfs/kubo that referenced this issue Jan 17, 2020
Stebalien added a commit to ipfs/kubo that referenced this issue Jan 17, 2020
Stebalien added a commit to ipfs/kubo that referenced this issue Jan 17, 2020
Stebalien added a commit to ipfs/kubo that referenced this issue Jan 17, 2020
Stebalien added a commit to ipfs/kubo that referenced this issue Jan 18, 2020
ralendor pushed a commit to ralendor/go-ipfs that referenced this issue Jun 6, 2020
ralendor pushed a commit to ralendor/go-ipfs that referenced this issue Jun 8, 2020
ralendor pushed a commit to ralendor/go-ipfs that referenced this issue Jun 8, 2020
@Stebalien
Copy link
Member

This has likely been fixed.

hannahhoward pushed a commit to filecoin-project/kubo-api-client that referenced this issue Jun 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug A bug in existing code (including security flaws)
Projects
None yet
Development

No branches or pull requests

3 participants