Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicate Data increases with the number of nodes serving the file. #4588

Closed
natewalck opened this issue Jan 17, 2018 · 7 comments
Closed

Duplicate Data increases with the number of nodes serving the file. #4588

natewalck opened this issue Jan 17, 2018 · 7 comments

Comments

@natewalck
Copy link

Version information:

ipfs version --all
go-ipfs version: 0.4.13-
Repo version: 6
System version: amd64/linux
Golang version: go1.9.2

Type: Bug

Severity: Medium

Description:

When running ipfs get on a given file, the more nodes that have the file and supply it to the test node, the more dupe data the node receives. This leads to lower performance and massive amounts of bandwidth waste.

Test Setup:
Testing was done with EC2 medium instances that all lived on the same subnet. Bootstrap was updated to ensure all nodes could find each other correctly.

ipfs swarm peers was used to confirm that the nodes were connected before doing ipfs get on the test file.

Before testing different file sizes, the ipfs daemon was stopped, .ipfs was deleted and the node was re-provisioned using ipfs init and the ipfs bootstrap add command. This ensured no files were cached and the stats were only for the test file in question.

Test files used were as follows:
5.1GB - sintel_4k.mov - QmWntgau1qWJh7hos91e6CqEzSWfaSn7permky8A3WJEnS
1.1GB - Sintel.2010.1080p.mkv - QmUwZFGPptdF5ZG58EdozjDSXYugPsxe1MwPZFQ4vZmAsb
649MB - Sintel.2010.720p.mkv - QmcdSfr63CHZ3sJkubrozeRmT4bo2DqpD8DKPFfhNby4FB

Test files can be found here: http://download.blender.org/durian/movies/

Replication procedure:

  1. Configure a fresh ipfs node
  2. Add the test file to the node (Node 1)
  3. Configure another ipfs node (Node N)
  4. Run ipfs get HASHHERE on Node N
  5. Record output of ipfs stats bw and ipfs bitswap stat
  6. Repeat steps 3-5 until you have Node 6 retrieving the test file from Nodes 1-5 (Which have each done an ipfs get on the test file over the course of this testing).

See full dataset here: https://gist.github.com/natewalck/c739b57b1e90dfe2092344f78bf7de78

For each node that the test file was retrieved from, the duplicate data rate increased in a linear fashion with the number of nodes that served the file. For instance, if Node 3 retrieved the test file from Nodes 1 and 2, the duplicate data can be expected to be 100% of the file size. If Node 4 retrieves data from Nodes 1, 2 and 3, the expected duplicate data is around 200% of the file size.

iftop was used to validate that the actual traffic incoming to the node matched the data observed in TotalIn from the ipfs stats bw command.

In the chart below, each of the test files is compared as number of nodes vs duplicate data received. As you can see, it is almost completely linear.

screenshot 2018-01-17 00 37 36

I'm not sure if the situation is better for small files (it probably impacts the transfer of smaller files to a lesser degree due to their size), but this seems like a rather large issue for big files.

One use case for ipfs is a distributed yum/software repo. With the current bitswap/wantlist performance, it would be difficult to host rpms and serve them out to clients in a performant fashion.

I'm not sure where to start looking to optimize this issue, but I wanted to investigate it and provide some data. Is it possible this is caused by a node requesting all of the same blocks from other nodes in its wantlist, receiving the blocks from all of the nodes at nearly the same time and then requesting the same block yet again to all the nodes, etc?

Thanks for all the work you are doing on IPFS, it is a fantastic project! :)

@jessepeterson
Copy link

I'm working with Nate on this problem as well. As a (possibly) related issue I found when looking into this issue — I still seem to get duplicated data when I change the maxProvidersPerRequest from its default 3 to 1 and recompile. I'm probably misunderstanding something but shouldn't that only permit one potential host per block and thus completely rule out any duplicate blocks?

@leerspace
Copy link
Contributor

leerspace commented Jan 17, 2018

Related to (or duplicate of?) #3802, #3786 and #1750

edit: added #3802 to this list

@natewalck
Copy link
Author

@leerspace I agree. It would be nice if these 4 issues could be summarized and squashed into one issue for each underlying improvement. It seems like this issue has been around for a bit, but might have gotten buried over time.

@kvm2116
Copy link
Contributor

kvm2116 commented Mar 19, 2018

Is there any update here? Has anyone been able to fix the duplicate blocks issue?
I am building an application on top of IPFS and I am running into the duplicate blocks issue (leading to poor performance for the application).

If this issue has been fixed, where should I download it?
If not, how can I contribute?

@momack2
Copy link
Contributor

momack2 commented May 29, 2020

We've made a lot of improvements to Bitswap that rolled out in the go-ipfs 0.5 release (#6782) -- addressing this exact "duplicate data" performance challenge, so closing this issue to redirect the conversation/evaluation there. =]

@momack2 momack2 closed this as completed May 29, 2020
@christroutner
Copy link

@momack2 were those improvements ever pushed to js-ipfs?

@momack2
Copy link
Contributor

momack2 commented Dec 31, 2021

Not sure - @dirkmc @achingbrain - do you know? If not, is there an open issue for someone to pick this up?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants