Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CRITICAL: IPFS Companion exposes issue where Slate Gateway URLs are not resolving when the IPFS Companion Extension is enabled. #342

Closed
jimmylee opened this issue Sep 29, 2020 · 28 comments
Assignees
Labels
Bug Something we want to fix.

Comments

@jimmylee
Copy link
Contributor

jimmylee commented Sep 29, 2020

This issue was reported by momack2. She reported that with IPFS Companion Extension on Google Chrome, you could not view any of the image assets on https://slate.host.

The original screenshot in the report included this:

image

  • I was able to deduce that the issue is not Metamask related.
  • a1.jpg is a reference to an old default avatar we had that we have since removed.

Reproduction

I was able to reproduce this bug by using Google Chrome, and the IPFS companion extension.

Upon further investigation I was able to deduce that:

  • Slate is still functional.
  • All IPFS gateway URLs fail to resolve

You don't need to be running https://slate.host to reproduce this issue, you can just try to visit these URLS:

@jimmylee jimmylee added the Bug Something we want to fix. label Sep 29, 2020
@jimmylee jimmylee self-assigned this Sep 29, 2020
@sanderpick
Copy link

Hey @jimmylee - My guess is that IPFS Companion naively parses outbound GET requests, snagging any that contain /ipfs.

@jimmylee
Copy link
Contributor Author

Thanks @sanderpick, pinging @lidel here in case that helps with the diagnosis

@olizilla
Copy link

olizilla commented Sep 29, 2020

IPFS Companion will redirect urls that are valid IPFS addresses to the local IPFS deamon. That's the core feature. Are the images announced to the dht? Are there any discoverable provider records?

@lidel
Copy link

lidel commented Sep 29, 2020

I played with slate a bit and it works fine with Companion and my local go-ipfs 0.7.0, if i upload 💚 well known file the content is read from IPFS instead of slate.textile.io/ipfs/.. and I get content integrity guarantees for free:

2020-09-29--21-26-27

💔 Uploading unique stuff thats not on the IPFS network already just hangs.

IIUC all this is expected behavior: you have IPFS Companion enabled, and by default it will load content from local IPFS node. If the content is not there yet, it may take time for your node to find it and fetch it.

So.. sounds like the problem you experience is slow content discovery when using a local IPFS node?
@jimmylee what type of local node you have? are you on slow network?
@sanderpick is textile's gateway running go-ipfs 0.7.0? are you providing records to the network?

@jimmylee
Copy link
Contributor Author

(1) @momack2 was the first to report this issue, hopefully she can provide more details if her bandwidth provides

(2) I'm on WIFI with this speed:

Screen Shot 2020-09-29 at 12 39 06 PM

Tagging @carsonfarmer so he might be able to provide details if bandwidth permits.

@jimmylee
Copy link
Contributor Author

Node Type:

Screen Shot 2020-09-29 at 12 40 45 PM

Screen Shot 2020-09-29 at 12 40 52 PM

I am using the out of the box config for IPFS companion. No custom settings @lidel

@lidel
Copy link

lidel commented Sep 29, 2020

Ok, after longer look, I believe this is not a problem with Companion, but discoverability of data on Textile's nodes.

Not only my local node is unable to find CIDs you provided in the first comment:

$ ipfs refs -r bafkreibp4qw5qq3bzgx5fbcz3bvznyc2xyjeevn3hhbjav35dl5fy7ew54

But none of the public gateways, for example:

They hang forever.

To confirm its content discovery issue you can download this file and import it to Slate – you will see it works fine, even with companion and local node.

@olizilla
Copy link

@lidel an idle thought, companion could look up a dnsaddr record for the domain when it encounters IPFS urls. So slate.host could have a txt record like /dnsaddr/slate.host/tcp/4001/ipfs/QmNodeyNodeNode and while it redirects the requests to the local daemon, also try and connect to the suggested peer, to help with situations where it is difficult to publish all the provider records to the dht.

@jimmylee
Copy link
Contributor Author

@lidel @olizilla thank you for the next level debugging 🌸 I appreciate it!

@carsonfarmer
Copy link

@lidel they appear to work perfectly fine for me? Even "fresh" slate cids.

@jimmylee
Copy link
Contributor Author

Screen Shot 2020-09-29 at 1 12 27 PM

A few other images started working 👀

@sanderpick
Copy link

@lidel we're still on ipfs/go-ipfs:v0.6.0. The node is announcing records, but is currently NATed... which may account for the large discovery times. We're going to attach a public IP routed to that node's swarm port.

@jimmylee jimmylee changed the title CRITICAL: IPFS Companion Prevents Slate Gateway URLs from working. CRITICAL: IPFS Companion exposes issue where Slate Gateway URLs are not resolving when the IPFS Companion Extension is enabled. Sep 29, 2020
@lidel
Copy link

lidel commented Sep 29, 2020

@olizilla yes.. that would be really elegant. Won't be an easy fix as there is no Web API for dnsaddr lookup, nor we expose it in IPFS APIs. I created ipfs/ipfs-companion#925 to track this idea.

@carsonfarmer @jimmylee yeah, now I see them too. Looks like a really slow discovery for some reason.

If I add unique CID to Slate, my local node is unable to find it unless I execute preload call to https://node1.preload.ipfs.io/api/v0/refs?r=true&arg=<cid>. I suspect preload nodes are peered with Textile, and that is how I was able to get them to the network.

@sanderpick ah.. yeah, that would explain the above observation. Those nodes need to be publicly dialable for other nodes behind NAT to be able to reach them.

@sanderpick
Copy link

Sounds good! It's now publicly visible (telnet 40.76.153.74 4001) but not announcing there yet. We'll schedule some downtime tomorrow morning to get that going and update the node to ipfs/go-ipfs:v0.7.0.

@sanderpick
Copy link

Reporting back here. That node is now running ipfs/go-ipfs:v0.7.0 and is announcing a public IP:

Swarm announcing /ip4/40.76.153.74/tcp/4001

@lidel
Copy link

lidel commented Oct 1, 2020

I've added https://slate.textile.io/ipfs/bafkreid5w43amr736etsba7jkpyqlh4tb5powe3ftssd7f5ws5dkfeekqe but my local node is unable to find it via DHT (been looking for over 10 minutes+)

@sanderpick what is the PeerID of that machine?
I want to confirm it is dialable from behind NAT (and that ipfs dht findpeer returns the address you mentioned).

@sanderpick
Copy link

PeerID: QmR69wtWUMm1TWnmuD4JqC1TWLZcc8iR2KrTenfZZbiztd.

From my local node,

⋊> ~ ipfs dht findpeer QmR69wtWUMm1TWnmuD4JqC1TWLZcc8iR2KrTenfZZbiztd
/ip4/40.76.153.74/tcp/4001
⋊> ~ ipfs swarm connect /ip4/40.76.153.74/tcp/4001/p2p/QmR69wtWUMm1TWnmuD4JqC1TWLZcc8iR2KrTenfZZbiztd
connect QmR69wtWUMm1TWnmuD4JqC1TWLZcc8iR2KrTenfZZbiztd success

But even after connecting, ipfs get /ipfs/bafkreid5w43amr736etsba7jkpyqlh4tb5powe3ftssd7f5ws5dkfeekqe hangs and hangs. More investigation needed.

@jimmylee
Copy link
Contributor Author

I need to double check if this is still a problem, I'll verify and ping the necessary parties again.

@lidel
Copy link

lidel commented Nov 12, 2020

@jimmylee For what its worth I still experience the problem with content discovery of newly added content 😿

For example, when redirect in IPFS Companion is enabled, my local IPFS node is unable to find content from:

What is concerning, is that it fails to find the content even if I manually connect to peer provided by @sanderpick:

$ ipfs swarm connect /p2p/QmR69wtWUMm1TWnmuD4JqC1TWLZcc8iR2KrTenfZZbiztd
connect QmR69wtWUMm1TWnmuD4JqC1TWLZcc8iR2KrTenfZZbiztd success

A quick fix is to disable IPFS integration for slate.host website:

2020-11-12--14-25-57

..but we really need to figure out the content discovery problem.
Without this working, it had sell is just a centralized website.

Note that loading content from local IPFS node works fine on other websites that use content-addressed assets, such as Audius. @jessicaschilling did case study on Audius and can put you in touch with them if you'd like to compare notes on backend setup related to the way content is provided to the IPFS network.

@sanderpick
Copy link

Hey @lidel @jimmylee sorry to have dropped the ball here. I just popped open this nodes logs and saw a number of these errors shown in this issue:

2020-11-04T23:58:33.322Z	ERROR	dht	ignoring incoming dht message while not in server mode

However, the config is definitely not setup to be running in client mode. Maybe related to high CPU as mentioned in the issue above.

In any case, I also noticed that "DisableNatPortMap" was set to true. Out of curiosity, I flipped that to false. After restarting the node, I can ipfs get Slate Cids and browse the site IPFS Companion on. This node runs on a Kubernetes pod, so maybe something special is going on with the networking which requires the NAT port map. Another possibility is the node just needed a restart to reduce CPU... in which case this probably isn't a permanent fix.

@lidel
Copy link

lidel commented Nov 13, 2020

I get mixed results. Can confirm the content routing issue seems to be gone for links I posted (content was found fast and loads fine from my local node), however other ones (eg. https://slate.host/bitgraves/september) still struggle to find the content (ipfs dht findprovs bafybeiensbi2qyx2fpyjhl32deplv264ewf2ceo5duypfmx6ykraf7nc3u returns nothing).

Gut feeling: perhaps it only works when you are directly connected to the node in the pod?
If your local node was connected to the node in the pod, then you were able to cache the content form my links and then your local node started providing it to the network. This would explain why common links posted here work for me (pod+your laptop), but not the other stuff (only pod).

@sanderpick some ideas to try:

  • See if Reprovider.Interval and Reprovider.Strategy are set to all and 12h or something else
  • If you are running behind NAT
    • Try.. not doing that on the server, if possible :-)
      (not familiar with Kubernetes enough to tell if its feasible).

    • If you need to run behind NAT,

      • Set Routing.Type to dhtclient to avoid ignoring incoming dht message while not in server mode in logs (sounds that Kubernetes NAT causes your node to switch into client mode anyway).
      • Try to manually forward swarm ports, then try adding public IP+port to Addresses.Announce list to ensure publicly dialable address is published to DHT. This should help if your network topology is too complex for go-ipfs to infer its own publicly diallable address for some reason.
  • Did you apply server profile?
  • Mind sharing full config? Perhaps something sticks out.

@sanderpick
Copy link

Gut feeling: perhaps it only works when you are directly connected to the node in the pod?

After a restart (low CPU), I was able to browse slate pages with fresh local node (no direct connection and no caching).

* See if `Reprovider.Interval` and `Reprovider.Strategy` are set to `all` and `12h` or something else

Yep, not changed from the default.

* If you are running behind NAT

No.

  * If you need to run behind NAT,      
    * Set [Routing.Type](https://github.com/ipfs/go-ipfs/blob/master/docs/config.md#routing) to `dhtclient`  to avoid `ignoring incoming dht message while not in server mode` in logs (sounds that Kubernetes NAT causes your node to switch into client mode anyway).

It appears that when the node goes under very high load (providing a huge amount of Cids), the connectivity suffers, and we see these ignoring incoming dht message while not in server mode errors. I'll compare results with a staging setup that has many fewer Cids.

    * Try to [manually forward swarm ports](https://kubernetes.io/docs/tutorials/stateless-application/expose-external-ip-address/), then  try [adding public IP+port to  `Addresses.Announce` list](https://discuss.ipfs.io/t/how-to-add-external-ip-to-ipfs-swarm-announcing-list/4647/3?u=lidel) to ensure publicly dialable address is published to DHT. This should help if your network topology is too complex for go-ipfs to infer its own publicly diallable address for some reason.

The public IP is included in Addresses.Announce as mentioned above.

* Did you apply `server` profile?

Partially. We can't use the address filters with Kubernetes. I can manually pluck the filters that will work, but I doubt that will solve the connectivity issue. DisableNatPortMap has previously been set to true, but as mentioned above I flipped it... though again I doubt this is related.

* Mind sharing full config? Perhaps something sticks out.

Sure, https://gist.github.com/sanderpick/7bd7eb045b31f17e14a80a408e4a1b10

@sanderpick
Copy link

Idea from @jsign: Since this machine is only pinning buckets, would it be possible to only provide recursively pinned Cids? That may reduce the load, and since a connection should then exist (at least intermittently), the other Cids will be indirectly discoverable.

In any case, I think we need some config tweaking to handle huge amounts of Cids. Any recommendations @lidel, @aschmahmann, @hsanjuan?

@jsign
Copy link

jsign commented Dec 15, 2020

Here are some experiments with the theory DHT reproviding being the cause of the issue.

Use pinned reproviding strategy:
image
35hs approx to reach CPU limit.

Use roots reproviding stragegy:
image
Some days to ~reach CPU limit.

So it seems that using roots alleviates the issue, but we still hit the limit.

While still running with roots I took a CPU pprof profile with the hot-path being:
image
So it looks like most of CPU usage is related to querying peers in the DHT, so might also confirm is related to reproviding?

Some extra facts about this IPFS node below.

Stats:

NumObjects: 6163728
RepoSize:   1059479615925
StorageMax: 1000000000000
RepoPath:   /data/ipfs
Version:    fs-repo@10

The number of pins --type=recursive is: 122138.
As shown in the CPU usage history, the node has a limit of vCPU (are these enough resources for this IPFS node size?).
The config is quite default-ish (apart from reproviding strategy changes), so might be non-optimal so don't assume an optimized one.

@jsign
Copy link

jsign commented Dec 15, 2020

@ribasushi, recommended trying out a quite reasonable config change: disable QUIC.
We did, now we'll wait some hours and I'll report back here how this worked out.

@momack2
Copy link

momack2 commented Dec 17, 2020

@aschmahmann and @jacobheun as FYI for the roots pinning profiles (these are gold @jsign!)

@aschmahmann
Copy link

aschmahmann commented Dec 18, 2020

@momack2 we're in touch. I'm pretty sure those profiles have very little to do with providing/advertising since we are "finding providers" in that profile.

My guess is that this has to do with Textile having many IPNS over PubSub channels and periodically querying the DHT to find if anyone new has joined the channel. I suspect Textile is pushing this feature harder than most and with hundreds of topics per node is doing a lot of crawling. They're running some tests now where periodically searching for new PubSub peers is just disabled to see if that's really the issue. Once we've confirmed then we can discuss what the options are.

Note: I don't currently have a good guess as to why switching from providing pins to roots would be any less work for this machine since I don't think they're likely to be able to reprovide even 100k pin roots within the default 12 hr period.

@kuzdogan
Copy link

kuzdogan commented Jan 28, 2022

Hey @lidel @jimmylee sorry to have dropped the ball here. I just popped open this nodes logs and saw a number of these errors shown in this issue:

2020-11-04T23:58:33.322Z	ERROR	dht	ignoring incoming dht message while not in server mode

However, the config is definitely not setup to be running in client mode. Maybe related to high CPU as mentioned in the issue above.

In any case, I also noticed that "DisableNatPortMap" was set to true. Out of curiosity, I flipped that to false. After restarting the node, I can ipfs get Slate Cids and browse the site IPFS Companion on. This node runs on a Kubernetes pod, so maybe something special is going on with the networking which requires the NAT port map. Another possibility is the node just needed a restart to reduce CPU... in which case this probably isn't a permanent fix.

We are experiencing massive resource consumption on our nodes. It seemed to increase when we start getting the error above. Tried running on a server profile with ipfs init --profile server but didn't help. Also using pinned strategy.

Anyone found a solution to this?

Our config: https://gist.github.com/kuzdogan/c1d69dabafc8286f31afc8bb988099b8

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Something we want to fix.
Projects
None yet
Development

No branches or pull requests

10 participants