Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

swarm: emit PeerConnectedness event from swarm instead of from hosts #1574

Merged
merged 5 commits into from
Feb 25, 2023

Conversation

marten-seemann
Copy link
Contributor

Fixes #1571.

This was surprisingly easy. Unfortunately, I can't find a way to test the deduplication (that no event is emitted when a second connection is dialed), since the swarm is too good at preventing dials to hosts we're already connected to.

@marten-seemann marten-seemann requested review from vyzo and MarcoPolo May 28, 2022 09:43
@@ -112,6 +112,9 @@ var _ host.Host = (*BasicHost)(nil)
// HostOpts holds options that can be passed to NewHost in order to
// customize construction of the *BasicHost.
type HostOpts struct {
// EventBus sets the event bus. Will construct a new event bus if omitted.
Copy link
Contributor

@vyzo vyzo May 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel uneasy about this.
Maybe we should add the EventBus method to the Network interface and forward from the host so that we cant end up with two of them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really how this could happen, but we can make this change.

Can we defer this until we consolidated all the other repos as well (#1566)? It's happening in the next release.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another option would be to just make this entire thing private.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a big fan of having both a Config and HostOpts. I expect us to merge these two (or refactor them) in some other way when we start using Fx for service construction (#1993). I suggest not making any big changes in this PR.

@@ -310,6 +321,13 @@ func (s *Swarm) addConn(tc transport.CapableConn, dir network.Direction) (*Conn,
})
c.notifyLk.Unlock()

if isFirstConn {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we sure we want this outside of the notifyLk? That changes the behavior from what it used to be (inside the notifyAll).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think this needs to be inside. We need to make sure that the disconnect comes after the connect event.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm. Actually, I'm not sure if this lock helps us here. We need to make this guarantee across all connections.

s.conns.Unlock()

if disconnected {
s.emitter.Emit(event.EvtPeerConnectednessChanged{
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reason to do this here instead of near the old place where this happened: https://github.com/libp2p/go-libp2p/blob/marco/1575/p2p/net/swarm/swarm_conn.go#L85 ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably to ensure that we only emit the event on the last disconnect. But it does look like there's a bit of a race and we can emit a disconnect after a connect (looks like we already have that race, actually).

@Stebalien
Copy link
Member

So, I think the only way to sanely do this is to write something like: https://github.com/ipfs/go-bitswap/pull/565/files. We probably don't want to block some global queue while sending events.

That is:

  1. On connect/disconnect, mark the peer as connected/disconnected in some tracker. Add the peer to some "todo" queue, and tell the background process that it needs to wake up.
  2. In a background event loop, fire connect/disconnect events.

Even if this background process falls behind (someone is blocking events), that's fine. We'll just always fire the "last" event. This does mean we can get unmatched connected/disconnected events, but, IMO, that's fine.

I.e.:

  1. Connect.
  2. Disconnect.
  3. Connect.
  4. Disconnect.

May be seen as:

  1. Connect.
  2. Connect.
  3. Disconnect.

Or

  1. Disconnect.
  2. Disconnect.

Or

  1. Connect.
  2. Disconnect.

Or

  1. Disconnect.

But never:

  1. Connect.
  2. Disconnect.
  3. Connect.

Importantly, the final state is always accurate, which is what users care about. At the moment, it's possible for the last event fired to be "connect", which could cause memory leaks.

@BigLep BigLep added the P3 Low: Not priority right now label Sep 9, 2022
@marten-seemann
Copy link
Contributor Author

marten-seemann commented Feb 5, 2023

Rebased this PR on top of the current master.

I think emitting the event while holding the conns mutex should be fine:

  • It's correct since the the swarm.conns.m[peer.ID] is the source of truth: Once we're down to 0 connections in this list, we're officially disconnected from the peer.
  • It's ok performance-wise, because emitting events is cheap, as long as the subscribers consume events quickly enough. Now we could (somewhat) protect against that by introducing a worker Go routine and another tracking data structure as suggested by @Stebalien in swarm: emit PeerConnectedness event from swarm instead of from hosts #1574 (comment), but that data structure would need to be limited in size somehow, so this is no better than using a larger channel for the event bus subscriber. Using the new event bus dashboard (eventbus: add metrics #2038) it should be easy to detect a misconfiguration there.

The following sequence can show up in two different ways:

  1. Connect.
  2. Disconnect.
  3. Connect.
  4. Disconnect.

First way:

  1. Connect
  2. Disconnect
  3. Connect
  4. Disconnect

Second way (when the two connects are processed first):

  1. Connect
  2. Disconnect

I introduced the change in the last commit. @MarcoPolo, @Stebalien does this reasoning make sense to you?

@marten-seemann marten-seemann mentioned this pull request Feb 22, 2023
21 tasks
Copy link
Collaborator

@MarcoPolo MarcoPolo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look good and reasoning makes sense.

Unfortunately, I can't find a way to test the deduplication (that no event is emitted when a second connection is dialed)

You could try this:
Make a new test transport that wraps an existing transport. Put two multiaddrs for this test transport (they might have to be slightly different if we're deduping these somewhere). In the test transport have them synchronize with a waitgroup to return a dial at the same time.

p2p/net/swarm/swarm_event_test.go Outdated Show resolved Hide resolved
@marten-seemann
Copy link
Contributor Author

marten-seemann commented Feb 24, 2023

Make a new test transport that wraps an existing transport. Put two multiaddrs for this test transport (they might have to be slightly different if we're deduping these somewhere). In the test transport have them synchronize with a waitgroup to return a dial at the same time.

That doesn't work either. The dial worker loop somewhere deduplicates. I have no intention of diving into that mess to figure out where.

@marten-seemann marten-seemann merged commit 581a015 into master Feb 25, 2023
Wondertan added a commit to celestiaorg/celestia-node that referenced this pull request Apr 27, 2023
This reverts commit 82fbba8.

The update breaks the event system. It likely happened after this libp2p/go-libp2p#1574, but we don't need to debug it right now, as a simple revert fixes this
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P3 Low: Not priority right now
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

swarm: emit PeerConnectedness event
5 participants