-
Notifications
You must be signed in to change notification settings - Fork 115
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve performance of data onboarding #888
Conversation
7c2bfb9
to
e107f99
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm lukewarm about the approach:
- We should probably batch datastore writes when we have a Batching datastore and see if that solves most of the problem, as it can improve write perf and at the same time works as buffering and introduces back-pressure like here.
- I would log a warning message on Close saying "We are storing x CIDs before shutting down, waiting for 5 seconds at most". Helps surface that things are not going well on that front even if it does finish before 5 seconds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm good with the approach. My main comment is to batch-commit and batch-delete, which can probably make a difference with large queue capacity.
668376a
to
acb85b9
Compare
acb85b9
to
73c33b9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Complexity has increased a bit!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additional comment:
- We enqueue 1 element
- We write to batch (but not commit yet)
getQueueHead()
returns nothing because nothing was written (uncommited batch)
This means we need to wait for the batchCommitInterval to trigger a write, so if we provide something below 16*1024
CIDs, we will only start providing after 5 seconds, right?
This is ok, I think, just making sure I understand the flow..
Revised with improved solution: Input to the queue is buffered in memory. The contents of the buffer are written to the datastore when the input buffer contains batchSize items, or when the queue has been idel (no input or output) for a long enough time. CIDs to dequeue are read, in order, from the input buffer when there are none in the datastore. Otherwise they are read from the datastore. Reading queued items from the input buffer means items can be dequeued without having to first be written to the datastore. This way CIDs can remain in memory, since the queue is not idle and the input buffer size is reduced by reading items from it. When the queue is closed, any remaining items in memory are written to the datastore. This gives us in-memory behavior until there are too many CIDs in memory, or until items have been sitting idle in memory too long. |
Adding a large directory of data to IPFS can take a long time. A significant amount of this time is spent writing to and reading from the persisted provider queue, depending on the underlying datastore. For slower datastores, buffering input to the persisted queue, in memory, allows data to be read in much more quickly. Overall performance is somewhat increased since writes are not blocked (up to a point) while waiting for the queue to persist data to the datastore.
…then read items to dequeue from memory buffer.
cc75b92
to
44c013d
Compare
Adding a large directory of data to IPFS can take a long time. A significant amount of this time is spent writing to and reading from the persisted provider queue, depending on the underlying datastore. The time spent doing this can be greatly reduced by writing queued items to the datastore in batches. In addition to batched writes, reading queued items can be read directly from the input buffer if there are no queued items in the datastore. This allows in-memory only operation when the number of queued items is below the batch size, and when the queue is not idle for too long.
kubo PR: ipfs/kubo#10758
part of: https://github.com/ipshipyard/roadmaps/issues/6 / https://github.com/ipshipyard/roadmaps/issues/7
Time Comparison
Using freshly initialized kubo each time. All files in test data were 1Mib and filled with random data.
Before PR (No Buffer)
After PR (With Buffer)
Summary