-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
investigate low performance and CPU hotspot transferring 4GiB over loopback #166
Comments
When go-data-transfer receives a block, it adds the CID to a list. I'm not sure why this is showing up in the profile. Are you running multiple tests, with a new datastore each time? The CID list is kept on disk, so it may be that each time you do a test you're writing more and more CIDs to the list, and each time you start the run you're reading the CIDs from all previous tests from the file. Thinking about CID lists we may want to consider storing the list as a bloom filter. Then if there's a restart during the transfer, we wouldn't need to send all the individual CIDs as a list, we could just send the bloom filter. |
This is just a single test, nothing fancy. Things should be configured the way we set them up in Lotus. Could you try running the basic composition in the whitenoise tests and see if you can reproduce? |
Yeah, and sending all CIDs scales very poorly. We should never do that! |
Hi @raulk -- you're decoding the list of received CIDs every time you get a transfer event. For now, can you remove this line and run again? https://github.com/raulk/whitenoise/blob/master/testplan/main.go#L128. I think our hotspot here may not reflect our regular use. I agree the cid lists need attention. |
that's a file i/o decode and read on every event |
btw, alternatives to storing CID lists could be as simple as skip N blocks if were just trying to resume. |
raulk/whitenoise#1 updates whitenoise to point at #217 which eliminates the received CIDs bottleneck. It seems that the transfer is still relatively slow on my machine: My machine specs:
The bottleneck is no longer coming from reading received CIDs. It seems likely it's from hashing. Next steps:
CPU profile: |
@dirkmc I re-ran your branch on my machine and the hashing bottle-neck dosen't really show up on the sender or the reciever. However, I will re-run this with GS v0.8.0 tomorrow to see if it makes a meaningful difference. |
@aarshkshah1992 your suspicions seem to be correct, updating to graphsync v0.8.0 doesn't seem to make much difference. I still get a transfer speed of about 40 MB/s. The CPU profile also looks similar: |
Still with graphsync v0.8 but now with an in-memory datastore:
|
Results for graphsync v0.7 with an in-memory datastore: The transfer time is comparable to graphsync v0.8: 4GB takes 45s: 95MB/s |
So folks, I'm just going to leave my results here. Here are my three runs. There were run on a Mac Minin M1 with 16GB of RAM & SSD. I closed most of my apps before running. I used Badger only. Right off @raulk's master branch, running v0.8.0 (v0.8.1-rc), with batcher. Total time: 8.005984208s for 4GB. Comes in at about 500MB/S. Full testground output: Dirks v0.7.0 branch, using Badger. Total time: 23.006475083s for 4GB. Comes in at about 173MB/S. Full testground output: : My own branch in PR raulk/whitenoise#4, with Badger. Total time: 8.004791959s, again 4GB transferred. Again at 500MB/S. Actually makes no difference, but resolves the CPU bottleneck So, both 0.8.1 (unmodified, master branch of this repo) and my branch with SHA fix show a near three fold increase on my machine. The big difference between versions appeared after I closed all the apps on my machine -- I suspect even with badger we are consuming a lot of memory and going to swap space The only thing I can offer is if you are running locally close your other apps. This made a huge difference for me. Happy to post heap profile images thought hey are in the ZIP files. |
To try to remove any noise from other processes from running on a laptop I ran the profiles on a cloud machine: I ran
I've attached
|
Some thoughts on the discrepancy between your linux node and my node:
|
This is an initial observation from running the Whitenoise tests with a perfect network and no interruption rate, over the loopback interface. In other words, these are ideal conditions for a data transfer.
The total payload size that was transfered was 4GiB.
The transfer stack was go-data-transfer with graphsync, conducting a push from the sender to the receiver.
The hardware used was a MBP 16", 2,3 GHz 8-Core Intel Core i9, 16 GB 2667 MHz DDR4.
Attached is the run result. IMPORTANT: this is really a gzip, but GitHub only allows uploading zips: change the extension to gzip!
c1ah45gi7qkqo2qcbicg.zip
Of special interest are the
events
files, which capture the subscription stream from go-data-transfer from both the sender and receiver. Very useful for debugging.There are heap profiles and CPU profiles included too (using Testground's new profiling feature) for further digging.
So far, I found what appears to be a large hotspot on the receiving side. Addressing this hotspot could allow the receiver to ACK data sooner, and thus speed up the data flow.
The hotspot seems to be in the "CID lists" feature, which I'm not familiar with.
NOTE: If you run the test, make sure you use the
feat/tmp-dir
branch of Testground, as Whitenoise depends on the new temporary directory feature. This is also noted in the Whitenoise README.The text was updated successfully, but these errors were encountered: