Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance of data onboarding #888

Merged
merged 23 commits into from
Apr 3, 2025
Merged

Improve performance of data onboarding #888

merged 23 commits into from
Apr 3, 2025

Conversation

gammazero
Copy link
Contributor

@gammazero gammazero commented Mar 19, 2025

Adding a large directory of data to IPFS can take a long time. A significant amount of this time is spent writing to and reading from the persisted provider queue, depending on the underlying datastore. The time spent doing this can be greatly reduced by writing queued items to the datastore in batches. In addition to batched writes, reading queued items can be read directly from the input buffer if there are no queued items in the datastore. This allows in-memory only operation when the number of queued items is below the batch size, and when the queue is not idle for too long.

kubo PR: ipfs/kubo#10758
part of: https://github.com/ipshipyard/roadmaps/issues/6 / https://github.com/ipshipyard/roadmaps/issues/7

Time Comparison

Using freshly initialized kubo each time. All files in test data were 1Mib and filled with random data.

Before PR (No Buffer)

> time kubo/cmd/ipfs/ipfs add -r /tmp/testfiles-100M > /dev/null
 100.00 MiB / 100.00 MiB [=====================================================================] 100.00%
real	0m6.464s

> time kubo/cmd/ipfs/ipfs add -r /tmp/testfiles-1G > /dev/null
 1000.00 MiB / 1000.00 MiB [===================================================================] 100.00%
real	1m10.542s

> time kubo/cmd/ipfs/ipfs add -r /tmp/testfiles-10G > /dev/null
 10.00 GiB / 10.00 GiB [=======================================================================] 100.00%
real	24m5.744s

After PR (With Buffer)

> time kubo/cmd/ipfs/ipfs add -r /tmp/testfiles-100M > /dev/null
 100.00 MiB / 100.00 MiB [=====================================================================] 100.00%
real	0m0.323s

> time kubo/cmd/ipfs/ipfs add -r /tmp/testfiles-1G > /dev/null
 1.00 GiB / 1.00 GiB [=========================================================================] 100.00%
real	0m2.721s

> time kubo/cmd/ipfs/ipfs add -r /tmp/testfiles-10G > /dev/null
 10.00 GiB / 10.00 GiB [=======================================================================] 100.00%
real	0m35.042s

Summary

Data Before PR After PR
100M 6.5s 0.32s
1G 70.5s 2.7s
10G 1446s 35s

@gammazero gammazero requested a review from a team as a code owner March 19, 2025 12:35
@gammazero gammazero force-pushed the faster-online-add branch 2 times, most recently from 7c2bfb9 to e107f99 Compare March 19, 2025 17:59
Copy link
Contributor

@hsanjuan hsanjuan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm lukewarm about the approach:

  • We should probably batch datastore writes when we have a Batching datastore and see if that solves most of the problem, as it can improve write perf and at the same time works as buffering and introduces back-pressure like here.
  • I would log a warning message on Close saying "We are storing x CIDs before shutting down, waiting for 5 seconds at most". Helps surface that things are not going well on that front even if it does finish before 5 seconds.

@gammazero gammazero requested a review from hsanjuan March 25, 2025 09:46
@gammazero gammazero changed the title Improve the perceived performance of data onboarding Improve performance of data onboarding Mar 25, 2025
Copy link
Contributor

@hsanjuan hsanjuan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm good with the approach. My main comment is to batch-commit and batch-delete, which can probably make a difference with large queue capacity.

@gammazero gammazero marked this pull request as draft March 26, 2025 08:18
@gammazero gammazero marked this pull request as ready for review March 27, 2025 18:25
@gammazero gammazero requested a review from hsanjuan March 27, 2025 20:33
Copy link
Contributor

@hsanjuan hsanjuan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Complexity has increased a bit!

Copy link
Contributor

@hsanjuan hsanjuan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional comment:

  • We enqueue 1 element
  • We write to batch (but not commit yet)
  • getQueueHead() returns nothing because nothing was written (uncommited batch)

This means we need to wait for the batchCommitInterval to trigger a write, so if we provide something below 16*1024 CIDs, we will only start providing after 5 seconds, right?

This is ok, I think, just making sure I understand the flow..

@gammazero gammazero requested a review from hsanjuan March 28, 2025 15:09
@gammazero
Copy link
Contributor Author

gammazero commented Apr 2, 2025

Revised with improved solution:

Input to the queue is buffered in memory. The contents of the buffer are written to the datastore when the input buffer contains batchSize items, or when the queue has been idel (no input or output) for a long enough time. CIDs to dequeue are read, in order, from the input buffer when there are none in the datastore. Otherwise they are read from the datastore.

Reading queued items from the input buffer means items can be dequeued without having to first be written to the datastore. This way CIDs can remain in memory, since the queue is not idle and the input buffer size is reduced by reading items from it. When the queue is closed, any remaining items in memory are written to the datastore.

This gives us in-memory behavior until there are too many CIDs in memory, or until items have been sitting idle in memory too long.

Adding a large directory of data to IPFS can take a long time. A significant amount of this time is spent writing to and reading from the persisted provider queue, depending on the underlying datastore. For slower datastores, buffering input to the persisted queue, in memory, allows data to be read in much more quickly. Overall performance is somewhat increased since writes are not blocked (up to a point) while waiting for the queue to persist data to the datastore.
@gammazero gammazero force-pushed the faster-online-add branch from cc75b92 to 44c013d Compare April 2, 2025 22:21
@gammazero gammazero merged commit 79294e7 into main Apr 3, 2025
20 of 22 checks passed
@gammazero gammazero deleted the faster-online-add branch April 3, 2025 01:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants