Improve performance of data onboarding #888

gammazero · 2025-03-19T12:35:02Z

Adding a large directory of data to IPFS can take a long time. A significant amount of this time is spent writing to and reading from the persisted provider queue, depending on the underlying datastore. The time spent doing this can be greatly reduced by writing queued items to the datastore in batches. In addition to batched writes, reading queued items can be read directly from the input buffer if there are no queued items in the datastore. This allows in-memory only operation when the number of queued items is below the batch size, and when the queue is not idle for too long.

kubo PR: ipfs/kubo#10758
part of: https://github.com/ipshipyard/roadmaps/issues/6 / https://github.com/ipshipyard/roadmaps/issues/7

Time Comparison

Using freshly initialized kubo each time. All files in test data were 1Mib and filled with random data.

Before PR (No Buffer)

> time kubo/cmd/ipfs/ipfs add -r /tmp/testfiles-100M > /dev/null
 100.00 MiB / 100.00 MiB [=====================================================================] 100.00%
real	0m6.464s

> time kubo/cmd/ipfs/ipfs add -r /tmp/testfiles-1G > /dev/null
 1000.00 MiB / 1000.00 MiB [===================================================================] 100.00%
real	1m10.542s

> time kubo/cmd/ipfs/ipfs add -r /tmp/testfiles-10G > /dev/null
 10.00 GiB / 10.00 GiB [=======================================================================] 100.00%
real	24m5.744s

After PR (With Buffer)

> time kubo/cmd/ipfs/ipfs add -r /tmp/testfiles-100M > /dev/null
 100.00 MiB / 100.00 MiB [=====================================================================] 100.00%
real	0m0.323s

> time kubo/cmd/ipfs/ipfs add -r /tmp/testfiles-1G > /dev/null
 1.00 GiB / 1.00 GiB [=========================================================================] 100.00%
real	0m2.721s

> time kubo/cmd/ipfs/ipfs add -r /tmp/testfiles-10G > /dev/null
 10.00 GiB / 10.00 GiB [=======================================================================] 100.00%
real	0m35.042s

Summary

Data	Before PR	After PR
100M	6.5s	0.32s
1G	70.5s	2.7s
10G	1446s	35s

provider/reprovider.go

provider/internal/queue/queue_test.go

hsanjuan

I'm lukewarm about the approach:

We should probably batch datastore writes when we have a Batching datastore and see if that solves most of the problem, as it can improve write perf and at the same time works as buffering and introduces back-pressure like here.
I would log a warning message on Close saying "We are storing x CIDs before shutting down, waiting for 5 seconds at most". Helps surface that things are not going well on that front even if it does finish before 5 seconds.

provider/internal/queue/queue.go

hsanjuan

I'm good with the approach. My main comment is to batch-commit and batch-delete, which can probably make a difference with large queue capacity.

provider/internal/queue/queue.go

hsanjuan

Complexity has increased a bit!

provider/internal/queue/queue.go

hsanjuan

Additional comment:

We enqueue 1 element
We write to batch (but not commit yet)
getQueueHead() returns nothing because nothing was written (uncommited batch)

This means we need to wait for the batchCommitInterval to trigger a write, so if we provide something below 16*1024 CIDs, we will only start providing after 5 seconds, right?

This is ok, I think, just making sure I understand the flow..

provider/internal/queue/queue.go

gammazero · 2025-04-02T16:47:16Z

Revised with improved solution:

Input to the queue is buffered in memory. The contents of the buffer are written to the datastore when the input buffer contains batchSize items, or when the queue has been idel (no input or output) for a long enough time. CIDs to dequeue are read, in order, from the input buffer when there are none in the datastore. Otherwise they are read from the datastore.

Reading queued items from the input buffer means items can be dequeued without having to first be written to the datastore. This way CIDs can remain in memory, since the queue is not idle and the input buffer size is reduced by reading items from it. When the queue is closed, any remaining items in memory are written to the datastore.

This gives us in-memory behavior until there are too many CIDs in memory, or until items have been sitting idle in memory too long.

Adding a large directory of data to IPFS can take a long time. A significant amount of this time is spent writing to and reading from the persisted provider queue, depending on the underlying datastore. For slower datastores, buffering input to the persisted queue, in memory, allows data to be read in much more quickly. Overall performance is somewhat increased since writes are not blocked (up to a point) while waiting for the queue to persist data to the datastore.

…then read items to dequeue from memory buffer.

gammazero added the skip/changelog label Mar 19, 2025

gammazero requested a review from a team as a code owner March 19, 2025 12:35

gammazero force-pushed the faster-online-add branch 2 times, most recently from 7c2bfb9 to e107f99 Compare March 19, 2025 17:59

gammazero commented Mar 19, 2025

View reviewed changes

provider/reprovider.go Outdated Show resolved Hide resolved

guillaumemichel approved these changes Mar 20, 2025

View reviewed changes

provider/internal/queue/queue_test.go Outdated Show resolved Hide resolved

hsanjuan requested changes Mar 20, 2025

View reviewed changes

provider/internal/queue/queue.go Outdated Show resolved Hide resolved

provider/internal/queue/queue.go Outdated Show resolved Hide resolved

provider/internal/queue/queue.go Outdated Show resolved Hide resolved

provider/internal/queue/queue.go Outdated Show resolved Hide resolved

This was referenced Mar 20, 2025

Use boxo with provider queue input buffer ipfs/kubo#10758

Closed

Release 0.35 ipfs/kubo#10760

Open

gammazero requested a review from hsanjuan March 25, 2025 09:46

gammazero changed the title ~~Improve the perceived performance of data onboarding~~ Improve performance of data onboarding Mar 25, 2025

hsanjuan requested changes Mar 25, 2025

View reviewed changes

gammazero force-pushed the faster-online-add branch from 668376a to acb85b9 Compare March 26, 2025 00:24

gammazero marked this pull request as draft March 26, 2025 08:18

gammazero force-pushed the faster-online-add branch from acb85b9 to 73c33b9 Compare March 26, 2025 09:19

gammazero marked this pull request as ready for review March 27, 2025 18:25

gammazero removed the skip/changelog label Mar 27, 2025

gammazero requested a review from hsanjuan March 27, 2025 20:33

hsanjuan reviewed Mar 27, 2025

View reviewed changes

provider/internal/queue/queue.go Outdated Show resolved Hide resolved

hsanjuan reviewed Mar 28, 2025

View reviewed changes

provider/internal/queue/queue.go Outdated Show resolved Hide resolved

gammazero requested a review from hsanjuan March 28, 2025 15:09

hsanjuan approved these changes Apr 1, 2025

View reviewed changes

gammazero added 7 commits April 2, 2025 12:20

no ctx in struct

295380d

rm coment

ccd403d

Make queue input buffering optional

0d8abf4

review changes

bb4a89f

Queue can operate without datastore (mem-only mode)

a275ba5

Option for memory-only queue

f7cd6e8

gammazero added 15 commits April 2, 2025 12:20

fix lint warn

49ee7f0

Revert to disk based queue, with batching

9bf4aca

Update datastore-backed queue

b973004

update batching

636e3d1

remove unused option

f04e933

re unneeded sync function

ef6bef0

batch size, comments

949950f

Rename queue.NewQueue to queue.New

75e9c6c

update changelog

1be37ea

batch deletes

508af50

no need for batchticker

675ca0e

Keep CIDs in memory until batchSize reached. If nothing in datastore …

65ab200

…then read items to dequeue from memory buffer.

remove redundant delete

880b87d

idle check with minimum timer resets

3ab25fc

error messages

44c013d

gammazero force-pushed the faster-online-add branch from cc75b92 to 44c013d Compare April 2, 2025 22:21

ignore undef cids

b6d4601

gammazero merged commit 79294e7 into main Apr 3, 2025
20 of 22 checks passed

gammazero deleted the faster-online-add branch April 3, 2025 01:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve performance of data onboarding #888

Improve performance of data onboarding #888

gammazero commented Mar 19, 2025 •

edited

Loading

hsanjuan left a comment

hsanjuan left a comment

hsanjuan left a comment

hsanjuan left a comment

gammazero commented Apr 2, 2025 •

edited

Loading

Improve performance of data onboarding #888

Improve performance of data onboarding #888

Conversation

gammazero commented Mar 19, 2025 • edited Loading

Time Comparison

Before PR (No Buffer)

After PR (With Buffer)

Summary

hsanjuan left a comment

Choose a reason for hiding this comment

hsanjuan left a comment

Choose a reason for hiding this comment

hsanjuan left a comment

Choose a reason for hiding this comment

hsanjuan left a comment

Choose a reason for hiding this comment

gammazero commented Apr 2, 2025 • edited Loading

gammazero commented Mar 19, 2025 •

edited

Loading

gammazero commented Apr 2, 2025 •

edited

Loading