Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Peers not always joining a topic and/or connecting #93

Closed
haadcode opened this issue Aug 1, 2018 · 29 comments
Closed

Peers not always joining a topic and/or connecting #93

haadcode opened this issue Aug 1, 2018 · 29 comments
Labels
kind/bug A bug in existing code (including security flaws)

Comments

@haadcode
Copy link

haadcode commented Aug 1, 2018

We've been running into some hickups in OrbitDB recently as we've been upgrading from go-ipfs 0.4.13 to 0.4.16 (and now 0.4.17).

We've observed a behaviour where pubsub peers don't always connect to each other. Detailed information on the behaviour and what we've discovered so far in orbitdb/orbitdb#417. To summarize: the tests are run with two local go-ipfs nodes, both subscribe to the same pubsub topic, but sometimes (about 50% of test runs) they don't seem to connect to each other on pubsub level even though they're connected on swarm level (see this comment orbitdb/orbitdb#417 (comment)).

Looking at the commit log in this repo, I noticed that the pubsub implementation was changed (yay, great to see the gossip pubsub coming together!) some time ago, so most likely this is regression. I believe 0.4.13 was working fine, but we haven't verified which version after that doesn't work anymore.

If anyone has an idea why this is happening and work on a fix, would highly appreciate it. We're (finally) adding back the support for go-ipfs in OrbitDB, but the tests are not running reliably at all times due to this bug, so we're hesitant to make a new release until this is fixed (alternatively we could instruct user to use =< 0.4.13 but that's no fun).

If there's any information I can provide to solve this, please let me know!

@bigs
Copy link

bigs commented Aug 1, 2018

thanks for the report! the old floodsub implementation is still present and being used by IPFS, though it was changed a non-trivial amount to abstract the routing layer from the rest of the implementation, potentially adding room for regression. could you enable debug logging in the pubsub module and run the tests a few times to see if we can get some extra information?

@haadcode
Copy link
Author

haadcode commented Aug 9, 2018

How does one "enable logging in the pubsub module"? Please advice.

@haadcode
Copy link
Author

@bigs would appreciate advice on this to be able to provide more information to narrow this down and possibly have a fix soon.

@vyzo
Copy link
Collaborator

vyzo commented Aug 16, 2018

Are you using gossipsub or floodsub? as of #95 gossipsub has quite a bit of logging around internal events, and this includes logging peer discovery.

@haadcode
Copy link
Author

@vyzo using whatever comes with the official release and using it through go-ipfs-dep and js-ipfs-api.

Generally though my question is how to turn on logging for go-ipfs? I used to know it, but can't remember anymore and can't find up-to-date documentation what env vars or flags I should set :)

@vyzo
Copy link
Collaborator

vyzo commented Aug 16, 2018

This is still floodsub, although gossipsub is coming soon to mainline ipfs as well.

You can see logs with the IPFS_LOGGING env var -- eg you can set it to DEBUG for lots of verbosity.

@haadcode
Copy link
Author

Thank you @vyzo!

After figuring out how to get the debug logs in ipfsd-ctl, was able to get them on CI and finally reproduce the failure and the behaviour described above. See here https://circleci.com/gh/orbitdb/orbit-db/1523#queue-placeholder/containers/3.

There's a pubsub error pubsub: already have connection to peer: <peer.ID bUcKci> pubsub.go:205 and a second one a little below in the logs. As previously described, this happens only some times but that error doesn't happen in successful test runs. Hopefully this will help!

For those ending up here looking for instructions on how to get go-ipfs logs when used via js-ipfsd-ctl and js-ipfs-api, logging need to be turned on in ipfsd-ctl too (this is not documented and was unintuitive for me):
IPFS_LOGGING=DEBUG DEBUG=ipfsd-ctl:daemon:stderr node your-program.js

@haadcode
Copy link
Author

haadcode commented Sep 5, 2018

Ping. @vyzo @bigs any pointers on this? Where in the code should we start looking if we wanted to dig deeper?

@vyzo
Copy link
Collaborator

vyzo commented Sep 5, 2018

The peer discovery mechanism is based on ambient connection events, with entry point here:
https://github.com/libp2p/go-floodsub/blob/master/notify.go#L18

The notification is handled here:
https://github.com/libp2p/go-floodsub/blob/master/pubsub.go#L201

@thiagozs
Copy link

thiagozs commented Sep 5, 2018

Hello guys, we use the lib in our project and we got the same problem, The announce of peers we got

INFO[0001] Start new node                               
INFO[0001] Register peer...                             
INFO[0001] bootstrapping...                             
INFO[0006] bootstrapping done                           
INFO[0009] Announce, searching for other peers          
INFO[0009] Register Channel                              name=Transactions
11:50:52.660 ERROR     pubsub: already have connection to peer:  <peer.ID byL35h> asm_amd64.s:2361
11:50:52.661 ERROR     pubsub: already have connection to peer:  <peer.ID byL35h> asm_amd64.s:2361

We are try use the same examples of code for connection each others.

@whyrusleeping
Copy link
Contributor

Hrm... @haadcode are you calling connect from both peers at the same time? That is, are you telling peer A to connect to peer B and telling peer B to connect to peer A? If you are, then I think this might be an issue with the ordering of connection notifications. cc @Stebalien

@bigs
Copy link

bigs commented Sep 13, 2018

apologies for the delays, going to dig into this today and get back to you ASAP

@bigs
Copy link

bigs commented Sep 13, 2018

@haadcode @whyrusleeping i think that might be on the money... after spelunking on this, my guess is it's a bizarre situation where the connect notification from A dialing out to B is overridden by the notification of B dialing in to A, with the same happening on the other side. this would cause both sides to cancel the stream that THEY opened in favor of the one the other opened.

i think a solution could be to serialize the notification process on the swarm level OR to make the floodsub logic more robust to make up for this (keep track of whether you've initialized a stream for every peer, only allowing re-initialization when there is a disconnect between connects)

thoughts @whyrusleeping? i can make this patch tonight or tomorrow.

@bigs
Copy link

bigs commented Sep 13, 2018

my gut is the more correct answer lies in swarm so i will take a go of it

edit: on second thought, we maintain a list of connections to a peer for a reason (i'm guessing) so i'll shore up flood sub and see if that does it.

@whyrusleeping
Copy link
Contributor

@bigs yeah, give it a shot. See if you can reproduce the issue that @haadcode is describing in isolation (a test) first.

@haadcode
Copy link
Author

Thank you @whyrusleeping and @bigs! I was able to make the error go away by not doing what @whyrusleeping was describing, ie. manually connect the peers. By not calling swarm.connect at all (we do it in the failing tests as described: both call each other) the problem goes away. Iirc, calling it from just one still results in the same erroneous behaviour.

Wherever this is fixed, my two cents would be that I feel this should be fixed in a way that one system doesn't effect the behaviour of another sub-system from the user's perspective. Ie. calling swarm.connect shouldn't effect pubsub (they may or may not be connected internally, but for the user it shouldn't matter).

@bigs
Copy link

bigs commented Sep 14, 2018

failing test and subsequent fix over at #105. feel free to test out that commit, @haadcode

@bigs
Copy link

bigs commented Sep 17, 2018

@haadcode quick update... after some discussion on our end, we've got a better fix planned. first attempt will be done today

@bigs
Copy link

bigs commented Sep 26, 2018

@haadcode another update... this is actually a bug in our muxing protocol triggered when a tcp simultaneous open occurs between two hosts. there are a few options for fixes, but it will take some time before we settle on something to ship.

@haadcode
Copy link
Author

haadcode commented Oct 1, 2018

@bigs thanks for the update! Wow, didn't realize this goes so deep, but I'm glad to hear this has lead to the lower levels and the bug was caught :) Is there an open issue somewhere else to follow the discussion/progress?

@bigs
Copy link

bigs commented Oct 4, 2018

@haadcode with any luck today or tomorrow i'll be sketching a fix for yamux!

@code4trade
Copy link

@bigs @haadcode
Do you have any news about this problem?
We are commenting a small part of pubsub.go, is there any other alternative to work around this problem?

@haadcode
Copy link
Author

haadcode commented Dec 7, 2018

Friendly bump, any updates @bigs?

@bigs
Copy link

bigs commented Dec 10, 2018

@haadcode @code4trade i think the best way forward is for me to push a fork of yamux and register it with a new protocol string. then you could use this fork and avoid this issue. you can also force your transports to use mplex which doesn't have the issue

@haadcode
Copy link
Author

haadcode commented Jan 3, 2019

i think the best way forward is for me to push a fork of yamux and register it with a new protocol string. then you could use this fork and avoid this issue.

A fork doesn't work for us unfortunately as we want to be able to use the latest from dist.ipfs.io, ie. official release (as that's what our users will use). What's blocking you to fix it in yamux and pulling in the fixed version to the official release?

you can also force your transports to use mplex which doesn't have the issue

How would one do that? Is this a configuration flag somewhere?

@vyzo
Copy link
Collaborator

vyzo commented Jan 3, 2019

Hey guys, can you try this again to verify that the problem still exists?
We did fix the multiple connections issue in #132.

@thiagozs
Copy link

thiagozs commented Jan 3, 2019

Hello @vyzo I made a quick update on the lib and for me work like a charm. I keeping make a test. Thanks for the update.

@iulianpascalau
Copy link
Contributor

While using go-libp2p-kad-dht for peer discovery it works like a charm, using the go-libp2p/p2p/discovery/mdns discovery mechanism, issues as described above still appear: peers have connections but querying the pubsub subsystem for peers subscribed to a topic shows bizarre results: out of 3 connected peers, one peer "sees" 2 other peers but the other 2 peers see only one subscription to that particular topic. This fact results in having "lonely peers" that do no get messages from other peers. It happened to me a couple of times, this failure does not appear every time I run my tests.

@daviddias daviddias added the kind/bug A bug in existing code (including security flaws) label Mar 25, 2020
@vyzo
Copy link
Collaborator

vyzo commented Mar 25, 2020

fixed by #132.

@vyzo vyzo closed this as completed Mar 25, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug A bug in existing code (including security flaws)
Projects
None yet
8 participants