-
Notifications
You must be signed in to change notification settings - Fork 189
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Peers not always joining a topic and/or connecting #93
Comments
thanks for the report! the old floodsub implementation is still present and being used by IPFS, though it was changed a non-trivial amount to abstract the routing layer from the rest of the implementation, potentially adding room for regression. could you enable debug logging in the |
How does one "enable logging in the pubsub module"? Please advice. |
@bigs would appreciate advice on this to be able to provide more information to narrow this down and possibly have a fix soon. |
Are you using gossipsub or floodsub? as of #95 gossipsub has quite a bit of logging around internal events, and this includes logging peer discovery. |
@vyzo using whatever comes with the official release and using it through go-ipfs-dep and js-ipfs-api. Generally though my question is how to turn on logging for go-ipfs? I used to know it, but can't remember anymore and can't find up-to-date documentation what env vars or flags I should set :) |
This is still floodsub, although gossipsub is coming soon to mainline ipfs as well. You can see logs with the |
Thank you @vyzo! After figuring out how to get the debug logs in ipfsd-ctl, was able to get them on CI and finally reproduce the failure and the behaviour described above. See here https://circleci.com/gh/orbitdb/orbit-db/1523#queue-placeholder/containers/3. There's a pubsub error For those ending up here looking for instructions on how to get go-ipfs logs when used via js-ipfsd-ctl and js-ipfs-api, logging need to be turned on in ipfsd-ctl too (this is not documented and was unintuitive for me): |
The peer discovery mechanism is based on ambient connection events, with entry point here: The notification is handled here: |
Hello guys, we use the lib in our project and we got the same problem, The announce of peers we got
We are try use the same examples of code for connection each others. |
Hrm... @haadcode are you calling connect from both peers at the same time? That is, are you telling peer A to connect to peer B and telling peer B to connect to peer A? If you are, then I think this might be an issue with the ordering of connection notifications. cc @Stebalien |
apologies for the delays, going to dig into this today and get back to you ASAP |
@haadcode @whyrusleeping i think that might be on the money... after spelunking on this, my guess is it's a bizarre situation where the connect notification from A dialing out to B is overridden by the notification of B dialing in to A, with the same happening on the other side. this would cause both sides to cancel the stream that THEY opened in favor of the one the other opened. i think a solution could be to serialize the notification process on the swarm level OR to make the floodsub logic more robust to make up for this (keep track of whether you've initialized a stream for every peer, only allowing re-initialization when there is a disconnect between connects) thoughts @whyrusleeping? i can make this patch tonight or tomorrow. |
edit: on second thought, we maintain a list of connections to a peer for a reason (i'm guessing) so i'll shore up flood sub and see if that does it. |
Thank you @whyrusleeping and @bigs! I was able to make the error go away by not doing what @whyrusleeping was describing, ie. manually connect the peers. By not calling Wherever this is fixed, my two cents would be that I feel this should be fixed in a way that one system doesn't effect the behaviour of another sub-system from the user's perspective. Ie. calling |
@haadcode quick update... after some discussion on our end, we've got a better fix planned. first attempt will be done today |
@haadcode another update... this is actually a bug in our muxing protocol triggered when a tcp simultaneous open occurs between two hosts. there are a few options for fixes, but it will take some time before we settle on something to ship. |
@bigs thanks for the update! Wow, didn't realize this goes so deep, but I'm glad to hear this has lead to the lower levels and the bug was caught :) Is there an open issue somewhere else to follow the discussion/progress? |
@haadcode with any luck today or tomorrow i'll be sketching a fix for yamux! |
Friendly bump, any updates @bigs? |
@haadcode @code4trade i think the best way forward is for me to push a fork of yamux and register it with a new protocol string. then you could use this fork and avoid this issue. you can also force your transports to use |
A fork doesn't work for us unfortunately as we want to be able to use the latest from
How would one do that? Is this a configuration flag somewhere? |
Hey guys, can you try this again to verify that the problem still exists? |
Hello @vyzo I made a quick update on the lib and for me work like a charm. I keeping make a test. Thanks for the update. |
While using go-libp2p-kad-dht for peer discovery it works like a charm, using the go-libp2p/p2p/discovery/mdns discovery mechanism, issues as described above still appear: peers have connections but querying the pubsub subsystem for peers subscribed to a topic shows bizarre results: out of 3 connected peers, one peer "sees" 2 other peers but the other 2 peers see only one subscription to that particular topic. This fact results in having "lonely peers" that do no get messages from other peers. It happened to me a couple of times, this failure does not appear every time I run my tests. |
fixed by #132. |
We've been running into some hickups in OrbitDB recently as we've been upgrading from go-ipfs 0.4.13 to 0.4.16 (and now 0.4.17).
We've observed a behaviour where pubsub peers don't always connect to each other. Detailed information on the behaviour and what we've discovered so far in orbitdb/orbitdb#417. To summarize: the tests are run with two local go-ipfs nodes, both subscribe to the same pubsub topic, but sometimes (about 50% of test runs) they don't seem to connect to each other on pubsub level even though they're connected on swarm level (see this comment orbitdb/orbitdb#417 (comment)).
Looking at the commit log in this repo, I noticed that the pubsub implementation was changed (yay, great to see the gossip pubsub coming together!) some time ago, so most likely this is regression. I believe 0.4.13 was working fine, but we haven't verified which version after that doesn't work anymore.
If anyone has an idea why this is happening and work on a fix, would highly appreciate it. We're (finally) adding back the support for go-ipfs in OrbitDB, but the tests are not running reliably at all times due to this bug, so we're hesitant to make a new release until this is fixed (alternatively we could instruct user to use =< 0.4.13 but that's no fun).
If there's any information I can provide to solve this, please let me know!
The text was updated successfully, but these errors were encountered: