-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add async dmu support #10377
add async dmu support #10377
Conversation
Codecov Report
@@ Coverage Diff @@
## master #10377 +/- ##
==========================================
- Coverage 76.82% 76.67% -0.16%
==========================================
Files 400 400
Lines 128040 129298 +1258
==========================================
+ Hits 98366 99133 +767
- Misses 29674 30165 +491
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
4a6a99e
to
7fbf8ae
Compare
I'm getting plenty of these quite easily with this branch (7fbf8ae):
|
@adamdmoss Yup. It appears that in the refactoring that I lost a completion path. It shows up on all platforms in the CI so should be easy to reproduce. |
7fbf8ae
to
151e9b3
Compare
I moved the dbuf completion callback execution out from underneath the dbuf lock to avoid holding it across copyout and potentially recursing. This created a race because the db_changed cv would get signalled before the refcount was dropped. I added a cv to indicate buf set completion to plug the race where it was hit. |
151e9b3
to
3f85a69
Compare
Cheers, definitely looking much better @ 3f85a69 |
Curiously this appears to have significantly raised latency on l2arc hits in my usual bevvy of casual tests. I'll verify this properly later. |
Confirmed, alternating with/without this patchset over 8 reboots I see this strong pattern (l2arc is not a significant factor): Wallclock times without patchset ( Wallclock times with patchset ( A = reading 15GB set of files from NVME, not backed by l2arc (FYI it's a set of 523 files with a variety of sizes, dominated by ~12GB of files in the 1-3GB range. The files are spread across 65 directories. The files are read with a script which does Curiously, this branch's read bandwidth according to Dumb speculation:
Let me know if I can provide further info. |
@adamdmoss My last bug fix introduced a second |
3f85a69
to
bcec065
Compare
@adamdmoss bcec065 limits us to one sleep / wakeup per synchronous read. I see a 1:17 reduction in run time for the |
Never mind. I see this even with a 20G pool in virtualbox. I have ZoF and FreeBSD git checkouts on it using 3.1G:
I did
|
master: milliseconds off-cpu
async_dmu_baseline:
|
c1f2c61
to
5d29fed
Compare
I lost the implied prefetch in VFS ops with the API flag conversion.
|
5d29fed
to
04f6220
Compare
Great - I'll give the latest version a try-out. |
Confirmed - read performance is back within normal ranges. Thanks for investigating! |
FYI module/os/freebsd/zfs/zvol_os.c has considerable conflicts with master now - I couldn't resolve them with shallow knowledge. :) |
In zfsonfreebsd:testing/async_dmu_baseline I've rebased/resolved that file. |
Thanks Ryan. When merging that version to master, it appears conflict-free and it builds okay, but it creates a DKMS module which doesn't build okay:
|
be33c31
to
262ffb2
Compare
Verified that |
262ffb2
to
7e92fb0
Compare
Async dmu adds reference counting for dbufs, with the completion routine being called when the count goes to zero. CFA refactor This is the fix: |
So to clarify, this should be safe to use against 2.0.1? |
65b7360
to
affe4bd
Compare
Should, yes. |
8076f48
to
8813d4d
Compare
- Integrate with zvol and vnops - Add asynchronous rangelock acquisition. o Make rangelock acquisition strictly FCFS while eliminating potential recursion when executing acquisition callbacks o Add debug support for dumping held rangelocks. - Add async read write to ztest - add taskq ctor and dtor callbacks - Add arc_watch support to FreeBSD kmod. Initially based on old code from SpectraLogic. Co-authored-by: Will Andrews <[email protected]> Co-authored-by: Matt Macy <[email protected]> Signed-off-by: Matt Macy <[email protected]>
8813d4d
to
fbf26c2
Compare
FWIW I've been using this PR (again!) relative to master for the last few weeks day-to-day on my personal/dev linux box. That's not exactly a strenuous test but it appears to have behaved well so far. |
@mattmacy - with all the force-pushes, this has become a little bit confusing to merge back atop the 2.0.2 tag. Any chance you have the whole commit history in a branch somewhere so i could rebase into our 2.0.2 branch? |
@sempervictus - I'm sorry, no I haven't. I've just been folding in changes as I fix issues exposed in the CI. It's very difficult to maintain any amount of history when regularly rebasing against master. |
I've been using this PR for yet another month, though it's becoming increasingly laborious to reconcile with master. Still seems good, but the PR is still marked as Work-In-Progress; is there really more to do or is it ready for review? |
Closing at the request of the author. |
@freqlabs: is this superseded or are there issues/concerns with the implementation? |
@sempervictus neither. @mattmacy has taken a position at a different company and is no longer working on ZFS or FreeBSD. We hope someone will pick up this work and move it across the finish line. |
That presupposes that it's remotely clear what its state was and what it'd take to move it across the finish line; goodness knows I've asked. |
I think it did ship in Truenas, but there were bugs. |
If anyone here can give me a quote to wrap this and get it into a current stable tag (or at least compatible with it once its in master), i'm willing to consider paying a qualified community member to complete the work... ZFS performance issues are taking it out of the mainstream. |
@behlendorf: this looks like this useful functionality about to be lost in the churn. Is there any way to pin this as "useful but currently unmaintained" to avoid losing all of the work and tag it for pickup in the PR queue by contributors? |
In fact that's a nice idea - either pin closed PR's or may create a new project or pinned discussion and link some closed/not merged PR's. That way its easier for people/dev's to pickup some unfinished work. For e.g.: PR #10308, PR #10943, PR #10377, PR #5095, etc. (just to note some PRs.. I got a bigger, complete list) |
It would be nice to more easily track these PR which haven't been merged but we do want to see completed. For the moment I've tagged it as "Status: Inactive" which we've used before for this kind of thing. We could certainly either create a new tag or project if that's preferable and we have a concise list of these kinds of tasks. Or even just curate what's currently tagged as inactive a bit more. |
Trying to revive this in my branch atop 2.1-rc5 and not looking too great. Already master has gone quite far ahead and there's per-commit digging required to see how some macros and constants changed. Plus it looks like there's new taskq semantics in the dmu. |
Rebase 2.1 rc6 atop fbf26c2 (openzfs#10377), including updates for: 668115f98f1 e330514ad08 ece24c1 The rebase was executed skipping the following commits to permit testing while requesting assistance from appropriate contributors: 64e0fe1 - ping @amotin for assistance e439ee8 - ping @behlendorf for assistance 336bb3662b - ping @amotin for assistance Testing: Built into 5.10.41-grsec (with grsec 2.1 ZFS patch applied). Zloop execution for 4h with no crashes. FIO and bonnie++ tests in a VM against zvol over a loopback file inside a qcow2 atop a zpool (on 2.1 without this) on an nvme drive. FIO runs atop 3 ~1GB/s ceph pool's RBDs in a raidz as an 8k block size ZVOL.
Rebase 2.1 rc6 atop fbf26c2 (openzfs#10377), including updates for: 668115f98f1 e330514ad08 ece24c1 The rebase was executed skipping the following commits to permit testing while requesting assistance from appropriate contributors: 64e0fe1 - ping @amotin for assistance e439ee8 - ping @behlendorf for assistance 336bb3662b - ping @amotin for assistance DO NOT MERGE THIS - IT IS A DIFF OF A REBASE WHICH HAS SKIPPED COMMITS, the commits above *MUST* be resolved before this can be applied to a current branch. Testing: Built into 5.10.41-grsec (with grsec 2.1 ZFS patch applied). Zloop execution for 4h with no crashes. FIO and bonnie++ tests in a VM against zvol over a loopback file inside a qcow2 atop a zpool (on 2.1 without this) on an nvme drive. FIO runs atop 3 ~1GB/s ceph pool's RBDs in a raidz as an 8k block size ZVOL.
This supersedes #10303 and #10317
Exposed to user space by way of zvol and vnops. Can be exercised by way of
aio
on FreeBSD andio_uring
on Linux.The only user visible change for existing code is the removal of
DMU_READ_NO_PREFETCH
and the replacement of
DMU_READ_PREFETCH
andDMU_READ_NO_DECRYPT
There are two new data structures:
dmu_ctx_t
maintains dmu context during operationsdmu_buf_set_t
maintains references to dbufs and is used for dbuf read completions.For most purposes the new functions of interest are:
Two helper functions have been added to simplify the most straightforward case.
These two functions are used to exercise the async completion logic in ztest.
Read and write in zvol on both FreeBSD and Linux no longer require a dedicated thread for async semantics. This could be later extended to support delete.
Platforms with VFS interfaces that accept completion routines can use the new interfaces to eliminate the need for a dedicated kernel thread to support posix AIO on ZFS.
Some further implementation details:
Internally,
dmu_buf_set_t
is used for managing state and tracking completions. Thedbs_holds
field is used as a count of pending read operations that need to be completed before the data transfer callback can be executed. The completions are handled bydmu_buf_set_rele
whendbs_holds
goes to zero it calls the data transfer callback. The initial set up for this occurs indbuf_hold_impl
. Whendbuf_hold_impl
is called indmu_buf_set_setup_buffers
, we now also pass it admu_buf_set_t
. If the dbuf in question is notDB_CACHED
it will add it to a list attached to the dbuf fordmu_buf_set_rele
to be called at read completion time, otherwise it release the hold immediately.This change further extends the SpectraLogic work by generalizing the dbuf_read completion callbacks and deferring their processing until the lock has been dropped to avoid cases of recursive deadlocks.
NB:
dmu_ctx_setup_tx
still issues synchronous reads and blocks on the write throttle. So only reads are truly asynchronous.2020.06.09
An async variant of
zil_commit
zil_commit_async
allows the async paths in zvol to avoid block on sync.A non-blocking variant of
zfs_rangelock_enter
zfs_rangelock_tryenter
allows async code paths in zvol to avoid blocking on rangelock acquisition. The callback passed will resume the operation when the range becomes available. To avoid asynchronous waiters starving synchronous waiters or vice versa, all deferred handling has been consolidated in to a single list, guaranteeing FCFS behavior. All waiters acquire their desired locked range before being signaled or having their callbacks executed (wakeups are simply handled as a callback on a stack local kcondvar).Co-authored-by: Will Andrews [email protected]
Co-authored-by: Matt Macy [email protected]
Signed-off-by: Matt Macy [email protected]
Motivation and Context
Description
How Has This Been Tested?
Types of changes
Checklist:
Signed-off-by
.