Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

investigate low performance and CPU hotspot transferring 4GiB over loopback #166

Open
raulk opened this issue Mar 19, 2021 · 14 comments
Open

Comments

@raulk
Copy link
Member

raulk commented Mar 19, 2021

This is an initial observation from running the Whitenoise tests with a perfect network and no interruption rate, over the loopback interface. In other words, these are ideal conditions for a data transfer.

The total payload size that was transfered was 4GiB.

The transfer stack was go-data-transfer with graphsync, conducting a push from the sender to the receiver.

The hardware used was a MBP 16", 2,3 GHz 8-Core Intel Core i9, 16 GB 2667 MHz DDR4.

  • The import of the random 4GiB payload into the badger store took 22109ms.
  • The transfer as perceived from the client took 117000ms (almost 2 minutes).
  • The average data transfer rate is 35MB/s -- this is very poor for loopback.

Attached is the run result. IMPORTANT: this is really a gzip, but GitHub only allows uploading zips: change the extension to gzip!

c1ah45gi7qkqo2qcbicg.zip

Of special interest are the events files, which capture the subscription stream from go-data-transfer from both the sender and receiver. Very useful for debugging.

There are heap profiles and CPU profiles included too (using Testground's new profiling feature) for further digging.

So far, I found what appears to be a large hotspot on the receiving side. Addressing this hotspot could allow the receiver to ACK data sooner, and thus speed up the data flow.

The hotspot seems to be in the "CID lists" feature, which I'm not familiar with.

profile001

NOTE: If you run the test, make sure you use the feat/tmp-dir branch of Testground, as Whitenoise depends on the new temporary directory feature. This is also noted in the Whitenoise README.

@dirkmc
Copy link
Contributor

dirkmc commented Mar 22, 2021

When go-data-transfer receives a block, it adds the CID to a list.
If the data transfer is restarted, eg because the connection went down, the receiving side sends a list of all CIDs that it has received to the sender, so that the sender can skip over these CID and only resend missing CIDs to the receiver.

I'm not sure why this is showing up in the profile. Are you running multiple tests, with a new datastore each time? The CID list is kept on disk, so it may be that each time you do a test you're writing more and more CIDs to the list, and each time you start the run you're reading the CIDs from all previous tests from the file.

Thinking about CID lists we may want to consider storing the list as a bloom filter. Then if there's a restart during the transfer, we wouldn't need to send all the individual CIDs as a list, we could just send the bloom filter.

@raulk
Copy link
Member Author

raulk commented Mar 22, 2021

This is just a single test, nothing fancy. Things should be configured the way we set them up in Lotus. Could you try running the basic composition in the whitenoise tests and see if you can reproduce?

@raulk
Copy link
Member Author

raulk commented Mar 22, 2021

Yeah, and sending all CIDs scales very poorly. We should never do that!

@hannahhoward
Copy link
Collaborator

hannahhoward commented Jun 2, 2021

Hi @raulk -- you're decoding the list of received CIDs every time you get a transfer event. For now, can you remove this line and run again? https://github.com/raulk/whitenoise/blob/master/testplan/main.go#L128. I think our hotspot here may not reflect our regular use. I agree the cid lists need attention.

@hannahhoward
Copy link
Collaborator

hannahhoward commented Jun 2, 2021

that's a file i/o decode and read on every event

@hannahhoward
Copy link
Collaborator

btw, alternatives to storing CID lists could be as simple as skip N blocks if were just trying to resume.

@dirkmc
Copy link
Contributor

dirkmc commented Jun 4, 2021

raulk/whitenoise#1 updates whitenoise to point at #217 which eliminates the received CIDs bottleneck.

It seems that the transfer is still relatively slow on my machine:
1m41s to transfer 4GB: 40MB/s

My machine specs:

  • MacBook Pro (13-inch, 2019, Four Thunderbolt 3 ports)
  • 2.8 GHz Intel Core i7
  • 16 GB 2133 MHz LPDDR3

The bottleneck is no longer coming from reading received CIDs. It seems likely it's from hashing.

Next steps:

  • Update to graphsync 0.8
  • Profile and compare

CPU profile:
pprof001
prof-rcvd-cids-fix.zip

@aarshkshah1992
Copy link
Collaborator

@dirkmc I re-ran your branch on my machine and the hashing bottle-neck dosen't really show up on the sender or the reciever.

However, I will re-run this with GS v0.8.0 tomorrow to see if it makes a meaningful difference.
gstest.zip

@dirkmc
Copy link
Contributor

dirkmc commented Jun 8, 2021

@aarshkshah1992 your suspicions seem to be correct, updating to graphsync v0.8.0 doesn't seem to make much difference. I still get a transfer speed of about 40 MB/s. The CPU profile also looks similar:

graphsync-v0 8

badger-gs-v0.8.zip

@dirkmc
Copy link
Contributor

dirkmc commented Jun 8, 2021

Still with graphsync v0.8 but now with an in-memory datastore:

  • The transfer is significantly faster: 4GB takes 45s: 95MB/s
  • The major bottleneck seems to be hashing:

pprof-in-mem-gs-v0 8

in-mem-gsv0.8.zip

@dirkmc
Copy link
Contributor

dirkmc commented Jun 10, 2021

Results for graphsync v0.7 with an in-memory datastore:

The transfer time is comparable to graphsync v0.8: 4GB takes 45s: 95MB/s
The CPU profile also looks similar:

gs-v0 7-cpu-pprof

gs-0.7-inmem-prof.zip

@hannahhoward
Copy link
Collaborator

hannahhoward commented Jun 15, 2021

So folks, I'm just going to leave my results here.

Here are my three runs. There were run on a Mac Minin M1 with 16GB of RAM & SSD. I closed most of my apps before running. I used Badger only.

Right off @raulk's master branch, running v0.8.0 (v0.8.1-rc), with batcher. Total time: 8.005984208s for 4GB. Comes in at about 500MB/S. Full testground output:
c34ihmimvdvm12r7rd2g.zip

Dirks v0.7.0 branch, using Badger. Total time: 23.006475083s for 4GB. Comes in at about 173MB/S. Full testground output: :

c34ii6amvdvm12r7rd30.zip

My own branch in PR raulk/whitenoise#4, with Badger. Total time: 8.004791959s, again 4GB transferred. Again at 500MB/S. Actually makes no difference, but resolves the CPU bottleneck

c34ig0qmvdvm12r7rd10.zip

So, both 0.8.1 (unmodified, master branch of this repo) and my branch with SHA fix show a near three fold increase on my machine.

The big difference between versions appeared after I closed all the apps on my machine -- I suspect even with badger we are consuming a lot of memory and going to swap space
Running with memory stores on both client and provider saturates my available RAM -- this makes things run much slower for me

The only thing I can offer is if you are running locally close your other apps. This made a huge difference for me.

Happy to post heap profile images thought hey are in the ZIP files.

@dirkmc
Copy link
Contributor

dirkmc commented Jun 16, 2021

To try to remove any noise from other processes from running on a laptop I ran the profiles on a cloud machine:
AWS c5d.12xlarge: 48vCPU / 96 GB RAM

I ran

I've attached

@hannahhoward
Copy link
Collaborator

Some thoughts on the discrepancy between your linux node and my node:

  • the CPU profiles for go-graphsync on the Sender side in your last comment for v0.8 show hashing about around 50% of v0.7 for both memory and badger.
  • I suspect if you were to run with the branch in Use custom graphsync with PR outstanding raulk/whitenoise#4 you would find the same for the receiver.
  • For whatever reason, this makes a difference on M1 but not Intel?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants