Peers not always joining a topic and/or connecting #93

haadcode · 2018-08-01T08:40:22Z

We've been running into some hickups in OrbitDB recently as we've been upgrading from go-ipfs 0.4.13 to 0.4.16 (and now 0.4.17).

We've observed a behaviour where pubsub peers don't always connect to each other. Detailed information on the behaviour and what we've discovered so far in orbitdb/orbitdb#417. To summarize: the tests are run with two local go-ipfs nodes, both subscribe to the same pubsub topic, but sometimes (about 50% of test runs) they don't seem to connect to each other on pubsub level even though they're connected on swarm level (see this comment orbitdb/orbitdb#417 (comment)).

Looking at the commit log in this repo, I noticed that the pubsub implementation was changed (yay, great to see the gossip pubsub coming together!) some time ago, so most likely this is regression. I believe 0.4.13 was working fine, but we haven't verified which version after that doesn't work anymore.

If anyone has an idea why this is happening and work on a fix, would highly appreciate it. We're (finally) adding back the support for go-ipfs in OrbitDB, but the tests are not running reliably at all times due to this bug, so we're hesitant to make a new release until this is fixed (alternatively we could instruct user to use =< 0.4.13 but that's no fun).

If there's any information I can provide to solve this, please let me know!

bigs · 2018-08-01T17:27:52Z

thanks for the report! the old floodsub implementation is still present and being used by IPFS, though it was changed a non-trivial amount to abstract the routing layer from the rest of the implementation, potentially adding room for regression. could you enable debug logging in the pubsub module and run the tests a few times to see if we can get some extra information?

haadcode · 2018-08-09T04:19:43Z

How does one "enable logging in the pubsub module"? Please advice.

haadcode · 2018-08-16T14:58:03Z

@bigs would appreciate advice on this to be able to provide more information to narrow this down and possibly have a fix soon.

vyzo · 2018-08-16T15:45:12Z

Are you using gossipsub or floodsub? as of #95 gossipsub has quite a bit of logging around internal events, and this includes logging peer discovery.

haadcode · 2018-08-16T17:07:46Z

@vyzo using whatever comes with the official release and using it through go-ipfs-dep and js-ipfs-api.

Generally though my question is how to turn on logging for go-ipfs? I used to know it, but can't remember anymore and can't find up-to-date documentation what env vars or flags I should set :)

vyzo · 2018-08-16T17:11:26Z

This is still floodsub, although gossipsub is coming soon to mainline ipfs as well.

You can see logs with the IPFS_LOGGING env var -- eg you can set it to DEBUG for lots of verbosity.

haadcode · 2018-08-16T18:58:47Z

Thank you @vyzo!

After figuring out how to get the debug logs in ipfsd-ctl, was able to get them on CI and finally reproduce the failure and the behaviour described above. See here https://circleci.com/gh/orbitdb/orbit-db/1523#queue-placeholder/containers/3.

There's a pubsub error pubsub: already have connection to peer: <peer.ID bUcKci> pubsub.go:205 and a second one a little below in the logs. As previously described, this happens only some times but that error doesn't happen in successful test runs. Hopefully this will help!

For those ending up here looking for instructions on how to get go-ipfs logs when used via js-ipfsd-ctl and js-ipfs-api, logging need to be turned on in ipfsd-ctl too (this is not documented and was unintuitive for me):
IPFS_LOGGING=DEBUG DEBUG=ipfsd-ctl:daemon:stderr node your-program.js

haadcode · 2018-09-05T04:54:10Z

Ping. @vyzo @bigs any pointers on this? Where in the code should we start looking if we wanted to dig deeper?

vyzo · 2018-09-05T07:29:18Z

The peer discovery mechanism is based on ambient connection events, with entry point here:
https://github.com/libp2p/go-floodsub/blob/master/notify.go#L18

The notification is handled here:
https://github.com/libp2p/go-floodsub/blob/master/pubsub.go#L201

thiagozs · 2018-09-05T15:17:24Z

Hello guys, we use the lib in our project and we got the same problem, The announce of peers we got

INFO[0001] Start new node                               
INFO[0001] Register peer...                             
INFO[0001] bootstrapping...                             
INFO[0006] bootstrapping done                           
INFO[0009] Announce, searching for other peers          
INFO[0009] Register Channel                              name=Transactions
11:50:52.660 ERROR     pubsub: already have connection to peer:  <peer.ID byL35h> asm_amd64.s:2361
11:50:52.661 ERROR     pubsub: already have connection to peer:  <peer.ID byL35h> asm_amd64.s:2361

We are try use the same examples of code for connection each others.

whyrusleeping · 2018-09-12T22:43:51Z

Hrm... @haadcode are you calling connect from both peers at the same time? That is, are you telling peer A to connect to peer B and telling peer B to connect to peer A? If you are, then I think this might be an issue with the ordering of connection notifications. cc @Stebalien

bigs · 2018-09-13T16:27:25Z

apologies for the delays, going to dig into this today and get back to you ASAP

bigs · 2018-09-13T23:22:05Z

@haadcode @whyrusleeping i think that might be on the money... after spelunking on this, my guess is it's a bizarre situation where the connect notification from A dialing out to B is overridden by the notification of B dialing in to A, with the same happening on the other side. this would cause both sides to cancel the stream that THEY opened in favor of the one the other opened.

i think a solution could be to serialize the notification process on the swarm level OR to make the floodsub logic more robust to make up for this (keep track of whether you've initialized a stream for every peer, only allowing re-initialization when there is a disconnect between connects)

thoughts @whyrusleeping? i can make this patch tonight or tomorrow.

bigs · 2018-09-13T23:24:46Z

~~my gut is the more correct answer lies in swarm so i will take a go of it~~

edit: on second thought, we maintain a list of connections to a peer for a reason (i'm guessing) so i'll shore up flood sub and see if that does it.

whyrusleeping · 2018-09-14T01:27:35Z

@bigs yeah, give it a shot. See if you can reproduce the issue that @haadcode is describing in isolation (a test) first.

haadcode · 2018-09-14T06:45:57Z

Thank you @whyrusleeping and @bigs! I was able to make the error go away by not doing what @whyrusleeping was describing, ie. manually connect the peers. By not calling swarm.connect at all (we do it in the failing tests as described: both call each other) the problem goes away. Iirc, calling it from just one still results in the same erroneous behaviour.

Wherever this is fixed, my two cents would be that I feel this should be fixed in a way that one system doesn't effect the behaviour of another sub-system from the user's perspective. Ie. calling swarm.connect shouldn't effect pubsub (they may or may not be connected internally, but for the user it shouldn't matter).

bigs · 2018-09-14T17:30:38Z

failing test and subsequent fix over at #105. feel free to test out that commit, @haadcode

bigs · 2018-09-17T17:22:48Z

@haadcode quick update... after some discussion on our end, we've got a better fix planned. first attempt will be done today

bigs · 2018-09-26T17:23:39Z

@haadcode another update... this is actually a bug in our muxing protocol triggered when a tcp simultaneous open occurs between two hosts. there are a few options for fixes, but it will take some time before we settle on something to ship.

haadcode · 2018-10-01T02:38:04Z

@bigs thanks for the update! Wow, didn't realize this goes so deep, but I'm glad to hear this has lead to the lower levels and the bug was caught :) Is there an open issue somewhere else to follow the discussion/progress?

bigs · 2018-10-04T17:39:06Z

@haadcode with any luck today or tomorrow i'll be sketching a fix for yamux!

code4trade · 2018-11-07T13:49:23Z

@bigs @haadcode
Do you have any news about this problem?
We are commenting a small part of pubsub.go, is there any other alternative to work around this problem?

haadcode · 2018-12-07T06:27:19Z

Friendly bump, any updates @bigs?

bigs · 2018-12-10T16:30:23Z

@haadcode @code4trade i think the best way forward is for me to push a fork of yamux and register it with a new protocol string. then you could use this fork and avoid this issue. you can also force your transports to use mplex which doesn't have the issue

haadcode · 2019-01-03T06:44:50Z

i think the best way forward is for me to push a fork of yamux and register it with a new protocol string. then you could use this fork and avoid this issue.

A fork doesn't work for us unfortunately as we want to be able to use the latest from dist.ipfs.io, ie. official release (as that's what our users will use). What's blocking you to fix it in yamux and pulling in the fixed version to the official release?

you can also force your transports to use mplex which doesn't have the issue

How would one do that? Is this a configuration flag somewhere?

vyzo · 2019-01-03T08:00:54Z

Hey guys, can you try this again to verify that the problem still exists?
We did fix the multiple connections issue in #132.

thiagozs · 2019-01-03T20:28:57Z

Hello @vyzo I made a quick update on the lib and for me work like a charm. I keeping make a test. Thanks for the update.

iulianpascalau · 2019-03-25T19:51:34Z

While using go-libp2p-kad-dht for peer discovery it works like a charm, using the go-libp2p/p2p/discovery/mdns discovery mechanism, issues as described above still appear: peers have connections but querying the pubsub subsystem for peers subscribed to a topic shows bizarre results: out of 3 connected peers, one peer "sees" 2 other peers but the other 2 peers see only one subscription to that particular topic. This fact results in having "lonely peers" that do no get messages from other peers. It happened to me a couple of times, this failure does not appear every time I run my tests.

vyzo · 2020-03-25T10:20:53Z

fixed by #132.

haadcode mentioned this issue Aug 16, 2018

Wait for send to complete on exchange heads orbitdb/orbitdb#433

Merged

haadcode mentioned this issue Sep 5, 2018

Run tests with go-ipfs orbitdb-archive/ipfs-log#169

Merged

bigs mentioned this issue Sep 14, 2018

Fix bug where peers connecting to one another simultaneously never establish a pubsub stream #105

Closed

ghost assigned bigs Sep 14, 2018

ghost added the in progress label Sep 14, 2018

This was referenced Dec 10, 2018

Add event peer.exchanged orbitdb/orbitdb#514

Closed

Use go-ipfs 0.4.16 orbitdb/orbitdb#417

Closed

haadcode mentioned this issue Dec 10, 2018

Support ipfs-api again orbitdb/orbitdb#273

Closed

daviddias removed the in progress label Mar 25, 2020

daviddias unassigned bigs Mar 25, 2020

daviddias added the kind/bug A bug in existing code (including security flaws) label Mar 25, 2020

vyzo closed this as completed Mar 25, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Peers not always joining a topic and/or connecting #93

Peers not always joining a topic and/or connecting #93

haadcode commented Aug 1, 2018

bigs commented Aug 1, 2018

haadcode commented Aug 9, 2018

haadcode commented Aug 16, 2018

vyzo commented Aug 16, 2018

haadcode commented Aug 16, 2018

vyzo commented Aug 16, 2018

haadcode commented Aug 16, 2018

haadcode commented Sep 5, 2018

vyzo commented Sep 5, 2018

thiagozs commented Sep 5, 2018

whyrusleeping commented Sep 12, 2018

bigs commented Sep 13, 2018

bigs commented Sep 13, 2018

bigs commented Sep 13, 2018 •

edited

Loading

whyrusleeping commented Sep 14, 2018

haadcode commented Sep 14, 2018

bigs commented Sep 14, 2018

bigs commented Sep 17, 2018

bigs commented Sep 26, 2018

haadcode commented Oct 1, 2018

bigs commented Oct 4, 2018

code4trade commented Nov 7, 2018

haadcode commented Dec 7, 2018

bigs commented Dec 10, 2018

haadcode commented Jan 3, 2019

vyzo commented Jan 3, 2019

thiagozs commented Jan 3, 2019

iulianpascalau commented Mar 25, 2019

vyzo commented Mar 25, 2020

Peers not always joining a topic and/or connecting #93

Peers not always joining a topic and/or connecting #93

Comments

haadcode commented Aug 1, 2018

bigs commented Aug 1, 2018

haadcode commented Aug 9, 2018

haadcode commented Aug 16, 2018

vyzo commented Aug 16, 2018

haadcode commented Aug 16, 2018

vyzo commented Aug 16, 2018

haadcode commented Aug 16, 2018

haadcode commented Sep 5, 2018

vyzo commented Sep 5, 2018

thiagozs commented Sep 5, 2018

whyrusleeping commented Sep 12, 2018

bigs commented Sep 13, 2018

bigs commented Sep 13, 2018

bigs commented Sep 13, 2018 • edited Loading

whyrusleeping commented Sep 14, 2018

haadcode commented Sep 14, 2018

bigs commented Sep 14, 2018

bigs commented Sep 17, 2018

bigs commented Sep 26, 2018

haadcode commented Oct 1, 2018

bigs commented Oct 4, 2018

code4trade commented Nov 7, 2018

haadcode commented Dec 7, 2018

bigs commented Dec 10, 2018

haadcode commented Jan 3, 2019

vyzo commented Jan 3, 2019

thiagozs commented Jan 3, 2019

iulianpascalau commented Mar 25, 2019

vyzo commented Mar 25, 2020

bigs commented Sep 13, 2018 •

edited

Loading