Added Mark Maybee's changes to allow for Direct IO with read/writes. … #5

bwatkinson · 2019-08-28T21:24:01Z

…I also added kernel compatibility checks for get_user_pages_unlocked.

Adding Direct IO support for Read/Writes

This is the first commit of Mark Maybee's code for implementing Direct IO for read and write in ZFS 0.8.x. Direct IO allows for the ARC to be bypassed for read and writes reducing the number of memory copies ZFS.

Motivation and Context

Currently ZFS Zpool's created using NVMe SSD's as the underlying physical VDEV's were experiencing poor performance characteristics for both asynchronous writes and synchronous reads. By adding Direct IO, we are able to bypass the ARC avoiding the overhead of memory copies that were restricting NVMe performance.

Description

This is the first go at getting Direct IO implemented. Other modifications will need to be added in order for this to eventually be pulled into master. The main idea behind the code is splicing the user buffers into abd's and mapping the user pages into kernel space.

How Has This Been Tested?

Currently testing is still ongoing to observe performance gains from using Direct IO with NVMe SSD Zpools, but verification between kernel's 3.10, 4.8, and 4.20 have been tested for compilation.

Types of changes

Bug fix (non-breaking change which fixes an issue)
[ x] New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Breaking change (fix or feature that would cause existing functionality to change)
Documentation (a change to man pages or other documentation)

Checklist:

[ x] My code follows the ZFS on Linux code style requirements.
I have updated the documentation accordingly.
[ x] I have read the contributing document.
I have added tests to cover my changes.
All new and existing tests passed.
All commit messages are properly formatted and contain Signed-off-by.

The call to txg_wait_synced in zfsvfs_teardown should be made conditional on the objset having dirty data. This can prevent unnecessary txg_wait_synced during some unmount operations. Reviewed-by: Matt Ahrens <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Paul Zuchowski <[email protected]> Closes openzfs#9115

This patch introduces an assertion that can catch pitfalls in development where there is a mismatch between the size of reads and writes between a *_phys structure and its respective in-core structure when bonus buffers are used. This debugging-aid should be complementary to the verification done by ztest in ztest_verify_dnode_bt(). A side to this patch is that we now clear out any extra bytes past a bonus buffer's new size when the buffer is shrinking. Reviewed-by: Matt Ahrens <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Tom Caputi <[email protected]> Signed-off-by: Serapheim Dimitropoulos <[email protected]> Closes openzfs#8348

In zfs_log_write(), we can use dmu_read_by_dnode() rather than dmu_read() thus avoiding unnecessary dnode_hold() calls. We get a 2-5% performance gain for large sequential_writes tests, >=128K writes to files with recordsize=8K. Testing done on Ubuntu 18.04 with 4.15 kernel, 8vCPUs and SSD storage on VMware ESX. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tony Nguyen <[email protected]> Closes openzfs#9156

Even though the bug's writeup (Github issue openzfs#9136) is very detailed, we still don't know exactly how we got to that state, thus I wasn't able to reproduce the bug. That said, we can make an educated guess combining the information on filled issue with the code. From the fact that `dp_dirty_total` was 0 (which is less than `zfs_dirty_data_max`) we know that there was one thread that set it to 0 and then signaled one of the waiters of `dp_spaceavail_cv` [see `dsl_pool_dirty_delta()` which is also the only place that `dp_dirty_total` is changed]. Thus, the only logical explaination then for the bug being hit is that the waiter that just got awaken didn't go through `dsl_pool_dirty_data()`. Given that this function is only called by `dsl_pool_dirty_space()` or `dsl_pool_undirty_space()` I can only think of two possible ways of the above scenario happening: [1] The waiter didn't call into any of the two functions - which I find highly unlikely (i.e. why wait on `dp_spaceavail_cv` to begin with?). [2] The waiter did call in one of the above function but it passed 0 as the space/delta to be dirtied (or undirtied) and then the callee returned immediately (e.g both `dsl_pool_dirty_space()` and `dsl_pool_undirty_space()` return immediately when space is 0). In any case and no matter how we got there, the easy fix would be to just broadcast to all waiters whenever `dp_dirty_total` hits 0. That said and given that we've never hit this before, it would make sense to think more on why the above situation occured. Attempting to mimic what Prakash was doing in the issue filed, I created a dataset with `sync=always` and started doing contiguous writes in a file within that dataset. I observed with DTrace that even though we update the pool's dirty data accounting when we would dirty stuff, the accounting wouldn't be decremented incrementally as we were done with the ZIOs of those writes (the reason being that `dbuf_write_physdone()` isn't be called as we go through the override code paths, and thus `dsl_pool_undirty_space()` is never called). As a result we'd have to wait until we get to `dsl_pool_sync()` where we zero out all dirty data accounting for the pool and the current TXG's metadata. In addition, as Matt noted and I later verified, the same issue would arise when using dedup. In both cases (sync & dedup) we shouldn't have to wait until `dsl_pool_sync()` zeros out the accounting data. According to the comment in that part of the code, the reasons why we do the zeroing, have nothing to do with what we observe: ```` /* * We have written all of the accounted dirty data, so our * dp_space_towrite should now be zero. However, some seldom-used * code paths do not adhere to this (e.g. dbuf_undirty(), also * rounding error in dbuf_write_physdone). * Shore up the accounting of any dirtied space now. */ dsl_pool_undirty_space(dp, dp->dp_dirty_pertxg[txg & TXG_MASK], txg); ```` Ideally what we want to do is to undirty in the accounting exactly what we dirty (I use the word ideally as we can still have rounding errors). This would make the behavior of the system more clear and predictable. Another interesting issue that I observed with DTrace was that we wouldn't update any of the pool's dirty data accounting whenever we would dirty and/or undirty MOS data. In addition, every time we would change the size of a dbuf through `dbuf_new_size()` we wouldn't update the accounted space dirtied in the appropriate dirty record, so when ZIOs are done we would undirty less that we dirtied from the pool's accounting point of view. For the first two issues observed (sync & dedup) this patch ensures that we still update the pool's accounting when we undirty data, regardless of the write being physical or not. For changes in the MOS, we first ensure to zero out the pool's dirty data accounting in `dsl_pool_sync()` after we synced the MOS. Then we can go ahead and enable the update of the pool's dirty data accounting wheneve we change MOS data. Another fix is that we now update the accounting explicitly for counting errors in `dbuf_write_done()`. Finally, `dbuf_new_size()` updates the accounted space of the appropriate dirty record correctly now. The problem is that we still don't know how the bug came up in the issue filled. That said the issues fixed seem to be very relevant, so instead of going with the broadcasting solution right away, I decided to leave this patch as is. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Prakash Surya <[email protected]> Signed-off-by: Serapheim Dimitropoulos <[email protected]> External-issue: DLPX-47285 Closes openzfs#9137

* contrib/initramfs: include /etc/default/zfs and /etc/zfs/zfs-functions At least debian needs /etc/default/zfs and /etc/zfs/zfs-functions for its initramfs. Include both in build when initramfs is configured. * contrib/initramfs: include 60-zvol.rules and zvol_id Include 60-zvol.rules and zvol_id and set udev as predependency instead of debians zdev. This makes debians additional zdev hook unneeded. * Correct initconfdir substitution for some distros Not every Linux distro is using @sysconfdir@/default but @initconfdir@ which is already determined by configure. Let's use it. * systemd: prevent possible conflict between systemd and sysvinit Systemd will not load a sysvinit service if a unit exists with the same name. This prevents conflicts between sysvinit and systemd. In ZFS there is one sysvinit service that does not have a systemd service but a target counterpart, zfs-import.target. Usually it does not make any sense to install both but it is possisble. Let's prevent any conflict by masking zfs-import.service by default. This does not harm even if init.d/zfs-import does not exist. Reviewed-by: Chris Wedgwood <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Tested-by: Alex Ingram <[email protected]> Tested-by: Dreamcat4 <[email protected]> Signed-off-by: Michael Niewöhner <[email protected]> Closes openzfs#7904 Closes openzfs#9089

On systems with large amounts of storage and high fragmentation, a huge amount of space can be used by storing metaslab range trees. Since metaslabs are only unloaded during a txg sync, and only if they have been inactive for 8 txgs, it is possible to get into a state where all of the system's memory is consumed by range trees and metaslabs, and txgs cannot sync. While ZFS knows how to evict ARC data when needed, it has no such mechanism for range tree data. This can result in boot hangs for some system configurations. First, we add the ability to unload metaslabs outside of syncing context. Second, we store a multilist of all loaded metaslabs, sorted by their selection txg, so we can quickly identify the oldest metaslabs. We use a multilist to reduce lock contention during heavy write workloads. Finally, we add logic that will unload a metaslab when we're loading a new metaslab, if we're using more than a certain fraction of the available memory on range trees. Reviewed-by: Matt Ahrens <[email protected]> Reviewed-by: George Wilson <[email protected]> Reviewed-by: Sebastien Roy <[email protected]> Reviewed-by: Serapheim Dimitropoulos <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Paul Dagnelie <[email protected]> Closes openzfs#9128

It used to be possible for zfs receive (and other operations related to clone swap) to bypass refquotas. This can cause a number of issues, and there should be an automated test for it. Added tests for rollback and receive not overriding refquota. Reviewed-by: Pavel Zakharov <[email protected]> Reviewed-by: John Kennedy <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Paul Dagnelie <[email protected]> Closes openzfs#9139

Existing zfs initramfs script logic will attempt to set the 'noop' scheduler if it's available on the vdev block devices. Newer kernels have the similar 'none' scheduler on multiqueue devices; this change alters the initramfs script logic to also attempt to set this scheduler if it's available. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Garrett Fields <[email protected]> Reviewed-by: Richard Laager <[email protected]> Signed-off-by: Colm Buckley <[email protected]> Closes openzfs#9042

Uses obj-m instead, due to kernel changes. See LKML: Masahiro Yamada, Tue, 6 Aug 2019 19:03:23 +0900 Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Signed-off-by: Dominic Pearson <[email protected]> Closes openzfs#9169

There are two different deadlock scenarios, but they share a common link, which is thread 1 holding sa_lock and trying to get zap->zap_rwlock: zap_lockdir_impl+0x858/0x16c0 [zfs] zap_lockdir+0xd2/0x100 [zfs] zap_lookup_norm+0x7f/0x100 [zfs] zap_lookup+0x12/0x20 [zfs] sa_setup+0x902/0x1380 [zfs] zfsvfs_init+0x3d6/0xb20 [zfs] zfsvfs_create+0x5dd/0x900 [zfs] zfs_domount+0xa3/0xe20 [zfs] and thread 2 trying to get sa_lock, either in sa_setup: sa_setup+0x742/0x1380 [zfs] zfsvfs_init+0x3d6/0xb20 [zfs] zfsvfs_create+0x5dd/0x900 [zfs] zfs_domount+0xa3/0xe20 [zfs] or in sa_build_index: sa_build_index+0x13d/0x790 [zfs] sa_handle_get_from_db+0x368/0x500 [zfs] zfs_znode_sa_init.isra.0+0x24b/0x330 [zfs] zfs_znode_alloc+0x3da/0x1a40 [zfs] zfs_zget+0x39a/0x6e0 [zfs] zfs_root+0x101/0x160 [zfs] zfs_domount+0x91f/0xea0 [zfs] From there, there are different locking paths back to something holding zap->zap_rwlock. The deadlock scenarios involve multiple different ZFS filesystems being mounted. sa_lock is common to these scenarios, and the sa struct involved is private to a mount. Therefore, these must be referring to different sa_lock instances and these deadlocks can't occur in practice. The fix, from Brian Behlendorf, is to remove sa_lock from lockdep coverage by initializing it with MUTEX_NOLOCKDEP. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Jeff Dike <[email protected]> Closes openzfs#9110

When there are many snapshots, calls to zfs_ioc_space_snaps() (e.g. from `zfs destroy -nv pool/fs@snap1%snap10000`) can be very slow, resulting in poor performance because we are holding the dp_config_rwlock the entire time, blocking spa_sync() from continuing. With around ten thousand snapshots, we've seen up to 500 seconds in this ioctl, iterating over up to 50,000,000 bpobjs, ~99% of which are the empty bpobj. By creating a fast path for zfs_ioc_space_snaps() handling of the empty_bpobj, we can achieve a ~5x performance improvement of this ioctl (when there are many snapshots, and the deadlist is mostly empty_bpobj's). Reviewed-by: Pavel Zakharov <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Paul Dagnelie <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> External-issue: DLPX-58348 Closes openzfs#8744

Automake can perform program name transformations at install time. However, arc_summary has its own name transformation taking place, which interferes with the automake transforms. The automake transforms must be taken into account in order to resolve the conflict. Signed-off-by: Ryan Moeller <[email protected]>

$fs used with the wrong sed command where should be $mntpnt instead to match a variable exported by read_mtab() The fix is mostly to reuse the sed command found in read_mtab() Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Michael Niewöhner <[email protected]> Signed-off-by: Alexey Smirnoff <[email protected]> Closes openzfs#9168

Split long lines where adding license info to dist archive. Remove extra colon from target line. Reviewed-by: Chris Dunlop <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Ryan Moeller <[email protected]> Closes openzfs#9189

Fix some switch() fall-though compiler errors: abd.c:1504:9: error: this statement may fall through Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tony Hutter <[email protected]> Closes openzfs#9170

The ancient version of blkid (v2.17.2) used in CentOS 6 will not detect the newly created pool unless it has been written to. Force a pool sync so `zpool import` will detect the newly created pool. Reviewed-by: John Kennedy <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes openzfs#9199

When checking ZFS_IOC_* numbers, print which numbers are wrong rather than silently failing. Reviewed-by: Chris Dunlop <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Ryan Moeller <[email protected]> Closes openzfs#9187

Reuse enum value ZFS_IOC_BASE for `('Z' << 8)`. This is helpful on FreeBSD where ZFS_IOC_BASE has a different value and `('Z' << 8)` is wrong. Reviewed-by: Chris Dunlop <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Ryan Moeller <[email protected]> Closes openzfs#9188

Document the ZFS_DKMS_ENABLE_DEBUGINFO option in the userland configuration file, as done with the other ZFS_DKMS_* options. It has been introduced with commit e45c173 ("dkms: Enable debuginfo option to be set with zfs sysconfig file") but isn't mentioned anywhere other than the 'dkms.conf' file (generated). Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Mauricio Faria de Oliveira <[email protected]> Closes openzfs#9191

Automake can perform program name transformations at install time. However, arc_summary has its own name transformation taking place, which interferes with the automake transforms. The automake transforms must be taken into account in order to resolve the conflict. Signed-off-by: Ryan Moeller <[email protected]>

The mdb_set_uint32 function requires that the values passed in be decimal. This was overlooked initially because the matching Linux function accepts both decimal and hexadecimal values. Reviewed-by: John Kennedy <[email protected]> Reviewed by: Sara Hartse <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Igor Kozhukhov <[email protected]> Closes openzfs#9125 Closes openzfs#9195

Signed-off-by: Paul Dagnelie <[email protected]>

Reviewed-by: Antonio Russo <[email protected]> Reviewed-by: Richard Laager <[email protected]> Signed-off-by: Yuxuan Shui <[email protected]> Closes openzfs#9174

The slog tests fail when attempting to create pools using file vdevs that already exist from previous test runs. Remove these files in the setup for the test. Reviewed-by: Igor Kozhukhov <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: John Kennedy <[email protected]> Signed-off-by: Ryan Moeller <[email protected]> Closes openzfs#9194

Signed-off-by: Paul Dagnelie <[email protected]>

Commit a887d65 updated the dbufstats such that escalated privileges are required. Since all tests under cli_user are run with normal privileges move this test case to a location where it will be run required privileges. Reviewed-by: John Kennedy <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Michael Niewöhner <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes openzfs#9118 Closes openzfs#9196

Split the arguments for ${TEST_RUNNER} across multiple lines for clarity. Also added quotes in the message to match the invoked command. Unquoted variables in argument lists are subject to splitting. In this particular case we can't quote the variable because it is an optional argument. Use the method suggested in the description linked below, instead. The technique is to use an unquoted variable with an alternate value. https://github.com/koalaman/shellcheck/wiki/SC2086 Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Giuseppe Di Natale <[email protected]> Reviewed-by: John Kennedy <[email protected]> Signed-off-by: Ryan Moeller <[email protected]> Closes openzfs#9212

Other than this test, zpool list -p is not well tested by any of the automated tests. Add a test for zpool list -p. Reviewed-by: Prakash Surya <[email protected]> Reviewed-by: Serapheim Dimitropoulos <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Paul Dagnelie <[email protected]> Closes openzfs#9134

The double-colon looked like a typo, but it's actually an obscure feature. Rules with :: may appear multiple times and are run independently of one another in the order they appear. The use of :: for distclean-local was conventional, not accidental. Add comments to indicate the intentional use of double-colon rules. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Ryan Moeller <[email protected]> Closes openzfs#9210

Currently, the 'zfs rollback' code can end up deadlocked due to the way the kernel handles unreferenced inodes on a suspended fs. Essentially, the zfs_resume_fs() code path may cause zfs to spawn new threads as it reinstantiates the suspended fs's zil. When a new thread is spawned, the kernel may attempt to free memory for that thread by freeing some unreferenced inodes. If it happens to select inodes that are a a part of the suspended fs a deadlock will occur because freeing inodes requires holding the fs's z_teardown_inactive_lock which is still held from the suspend. This patch corrects this issue by adding an additional reference to all inodes that are still present when a suspend is initiated. This prevents them from being freed by the kernel for any reason. Reviewed-by: Alek Pinchuk <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tom Caputi <[email protected]> Closes openzfs#9203

mmaybee

This looks pretty reasonable, however I have a couple of comments:

I think we should leave out the "user-space" version of the interface until such time as it will actually be consumed. At the moment it is just going to be dead code.
The naming: abd_get_from_pages/abd_free_from_pages violates normal naming conventions. I think we should either embed the extra code from abd_free_from_pages in abd_put() (leveraging ABD_FLAG_DIO_PAGE) or we should rename abd_get_from_pages to abd_alloc_from_pages. I lean toward the former, as it reduces the number of interfaces.

bwatkinson · 2019-09-06T04:17:55Z

@mmaybee

I am pretty neutral about this. I am fine with removing the user space version as it is not necessary at this point. I just figured it might be easier for us to implement it than someone else having to maybe do it at a later time. @behlendorf do you have an opinion on whether to remove the user space version?
I actually agree with you on this. I am prefer the former as well. I was originally going to go this route anyways. I just need to check for the ABD_FLAG_DIO_PAGE in abd_put() so we do not run the abd_verify() function there. If we were to call verify after removing the sg_table that would lead to segfaults (I know because I saw it happen lol). It is any easy enough change though and I will go ahead and update that code and push the changes Monday.

@mmaybee and @behlendorf I think we could be ready for an official pull release against the zfsonlinux master branch once we get the idea of how do we only limit read and write for Direct IO to the PAGE_SIZE limitation. I have been really thinking about how to go about this, but it definitely is tricky. No matter what, I believe we are kinda stuck in some ways on dn->datablksz check due to checksum verifies for reads. I think a bigger issue also comes when we think about a Direct IO request (that is within the PAGE_SIZE limitation), but the offset tiggers having to pull in 2 datablksz chunks to perform the checksum verifies (AKA the request is in between both). We are limited to the data buffer size the user gives us, so we can not grab and write more data into the user buffer than what is provided. I was thinking we would have to eventually having to perform a data copy again at some point... Also, what do we do about compression? Technically the semantics of O_DIRECT should lead to no data copies; however, with compression an implicit data copy is required. I may just be overthinking the datablksz restriction and the compression as well. It is also possible that I do not have a complete understanding of the issue, so please correct me if I am wrong on either of these thoughts. I am still learning the intricacies of the ZFS software stack, so I might just be out in left field on these thoughts : ). I think till we iron out the right way to only restrict to to the PAGE_SIZE limitation, we should hold off on an official pull request on master. Thoughts or opinions on this?

mmaybee · 2019-09-06T16:53:39Z

For data services like compression and encryption, my opinion is that "Direct IO" is really just a "reduced data copy" interface. That is, we perform the normal IO pipeline processing (with the necessary data copies for the data transforms) but avoid the final "copy out" because we have a mapped buffer from the user. The only alternative I could think of would be to return the "raw" data when doing direct IO so that we return a compressed and/or encrypted block... but I don't think this is really useful.

For relaxing the alignment constraints to PAGE_SIZE: my approach here would be to use the "multi list" functionality that I added to the abd interface to allow one to combine together multiple abd buffers into a single logical buffer (I did this for the aggregation functionality in vdev_queue). If we have a read that does not cover an entire block then we "add in" an abd buffer to the beginning or end of the one created from the users buffer to construct one the size of the data block, and then perform a normal read of the block. For the write side, we would create a similar abd buffer, but then would also need to to a read/modify/write for the partial blocks.

behlendorf · 2019-09-06T17:54:23Z

1 I think we should leave out the "user-space" version

That's fine with me, I agree if it's just going to be dead code let's not include it until it's needed.

my opinion is that "Direct IO" is really just a "reduced data copy" interface.

I agree completely. The behavior of Direct IO has always been described as a best effort to "minimize cache effects of the I/O to and from this file". When things like compression and encryption are involved there's realistically only so much which can be done.

For relaxing the alignment constraints to PAGE_SIZE: my approach here would be to use the "multi list" functionality that I added to the abd interface to allow one to combine together multiple abd buffers into a single logical buffer

This makes the most sense to me. @mmaybee if you prefer, this could be pushed as a stand alone PR to improve the aggregation logic, or folded in to this larger change. Either way is OK with me.

mmaybee · 2019-09-06T18:28:10Z

Let me try to fold my stuff into this code and make sure that it actually works :-)

bwatkinson · 2019-09-06T18:53:57Z

I will go ahead and remove the user space interface and also update the get_from_user_pages to use abd_put as well. I will get those changes pushed up on Monday.

Makes sense on the “reduce data copy” with Direct IO.

@mmaybee, I am guessing you are already working on the “add in” abd buffer stuff for loosing the constraints to only PAGE_SIZE for Direct IO? I am definitely willing to help out with this, so please don’t hesitate to push some work my way if you have already started on this. Otherwise, I can start working on modifications to allow for “add in” buffers.

Also, I definitely don’t mind continuing to integrate the changes into this branch as we get prepared for an official PR. The current process of me migrating code from the Slack Channel into this branch works for me, but if either of you think it would be more beneficial to just push changes directly to this branch just let me know.

mmaybee · 2019-09-06T19:22:51Z

@bwatkinson, yea, I already have an implementation of the abd extension. I am now folding it into your bits. Once I have done a sanity check I will toss it over to you and we can work on applying it to the direct IO path.

bwatkinson · 2019-09-06T20:29:55Z

@mmaybee sounds good!

… updated the abd code so now abd_put is only used to free up the sg_table instead of having to call abd_free.

bwatkinson · 2019-09-09T17:17:00Z

@bwatkinson, yea, I already have an implementation of the abd extension. I am now folding it into your bits. Once I have done a sanity check I will toss it over to you and we can work on applying it to the direct IO path.

@mmaybee I have pushed up the changes we talked about with only using abd_put() with the Direct IO path. I also removed the user space version of abd_get_from_user_pages(). Just let me know when you have the abd extension ready (or you would like me to help out with anything else) and we can work on applying it to the Direct IO pathl.

behlendorf · 2019-09-09T17:38:20Z

@bwatkinson in order to open a zfsonlinux PR we're going to want to get this rebased on the current version of the master branch. It should be pretty straight forward, though there may be a couple conflicts to resolve.

bwatkinson · 2019-09-09T17:45:21Z

@bwatkinson in order to open a zfsonlinux PR we're going to want to get this rebased on the current version of the master branch. It should be pretty straight forward, though there may be a couple conflicts to resolve.

@behlendorf I did a rebase, I believe, last week, so hopefully there will not be too many conflicts when going to rebase again for the PR. I did notice that some of the organization has changed with the module/os/linux/(spl/zfs), but hopefully the conflicts will not be too out of control : ). I figured I would do the rebase after we integrated @mmaybee's abd extension code.

bwatkinson · 2019-09-09T23:13:22Z

@behlendorf and @mmaybee, so I have actually found another bug, which I was hoping you both might have some insights into. So, if we write using Direct IO and immediately delete file, we get to a NULL pointer dereference in arc_release(). I attached a quick stack trace, but it definitely appears to be a race condition. If I wait 1 second and then delete the file, after using Direct IO to write the file, the null pointer deference is no longer an issue.

Also, as a quick test, I also tried setting primarycache=none and the issue still occurs.

I have traced the issue down to the following section of code in dbuf_free_range():

We are currently calling dbuf_unoverride. What strange is we are marking each of the dbuf's as DB_NOFILL in dmu_write_direct() and DB_UNCACHED in dmu_write_direct_done(), so at some point we are getting to another dbuf that falls down to the db->db_last_dirty != NULL check. Any ideas why we would be meeting the if condition to call dbuf_unoverride here if we do not wait a certain amount of time before deleting the file just written using Direct IO?

mmaybee · 2019-09-10T17:58:46Z

The call do dbuf_unoverride() is correct if we try to "free" this block in the same txg that we originally wrote it in. We use the override logic to commit the direct IO block pointer to disk (this is why we "abuse" the dbuf state with DB_NOFILL and DB_UNCACHED). A direct IO dbuf is never "filled" with data and so has no cached content... yet it is dirty and has a block pointer that must be written out in syncing context.

I suspect that you may be looking at the wrong call to dbuf_unoverride() though. You are correct that we should not get to the call from dbuf_free_range(). But there is also a call from dbuf_undirty(). This is only protected by a DB_NOFILL check, and by the time we get here our state should be DB_UNCACHED. We need logic here and in dbuf_unoverride() that handles the direct IO case. (i.e., dirty, but uncached => no dr_data).

bwatkinson · 2019-09-10T19:54:46Z

The call do dbuf_unoverride() is correct if we try to "free" this block in the same txg that we originally wrote it in. We use the override logic to commit the direct IO block pointer to disk (this is why we "abuse" the dbuf state with DB_NOFILL and DB_UNCACHED). A direct IO dbuf is never "filled" with data and so has no cached content... yet it is dirty and has a block pointer that must be written out in syncing context.

I suspect that you may be looking at the wrong call to dbuf_unoverride() though. You are correct that we should not get to the call from dbuf_free_range(). But there is also a call from dbuf_undirty(). This is only protected by a DB_NOFILL check, and by the time we get here our state should be DB_UNCACHED. We need logic here and in dbuf_unoverride() that handles the direct IO case. (i.e., dirty, but uncached => no dr_data).

@mmaybee, yeah the stack trace was leading me to believe that we were hitting dbuf_unoverride() from dbuf_free_range(), but very well could be happening in dbuf_undirty(). I noticed the call to dbuf_unoverride() in there as well, but the stack trace got me all screwed up lol. In any case, I still need to add the logic in both spots anyways. I will let you know if this fixes the issue. On a side note, I noticed we might have mistake in dmu_write_direct_done() with an ASSERT. I believe we need to have ASSERT(dr->dt.dl.dr_data == NULL) correct?

mmaybee · 2019-09-10T20:18:16Z

@bwatkinson yup, correct... would have shown up when we did a debug compile! :-)
BTW, I think, probably, in both dbuf_undirty() and dbuf_unoverride() you just need to put an: if (db->db_state != DB_UNCACHED) around the calls to the arc functions trying to de-reference dr_data.

bwatkinson · 2019-09-10T20:33:31Z

@bwatkinson yup, correct... would have shown up when we did a debug compile! :-)
BTW, I think, probably, in both dbuf_undirty() and dbuf_unoverride() you just need to put an: if (db->db_state != DB_UNCACHED) around the calls to the arc functions trying to de-reference dr_data.

@mmaybee, so here is the modification I made to dbuf_undirty():

I just added the extra check for DB_UNCACHED. Also, below is the modifications I made to dbuf_unoverride():

I believe we are okay just bailing at this point in dbuf_unoverride(); however, I was not certain if I should move the check down to the arc_release() and only call it if db->db_state != DB_UNCACHED.

mmaybee · 2019-09-10T20:42:40Z

@bwatkinson For the dbuf_unoverride() case, we still need to free the block that was written out in dmu_write_direct(). So we need to get to the call to zio_free(). Otherwise we will leak the block. I would just put the if-statement before the call to arc_release().
For dbuf_undirty(), we need to make sure we call dbuf_unoverride() (for the reasons I stated above). So I would create a new if-statement to check for UNCACHED prior to the ASSERTs and call to arc_buf_destroy().

bwatkinson · 2019-09-10T20:46:04Z

@bwatkinson For the dbuf_unoverride() case, we still need to free the block that was written out in dmu_write_direct(). So we need to get to the call to zio_free(). Otherwise we will leak the block. I would just put the if-statement before the call to arc_release().
For dbuf_undirty(), we need to make sure we call dbuf_unoverride() (for the reasons I stated above). So I would create a new if-statement to check for UNCACHED prior to the ASSERTs and call to arc_buf_destroy().

@mmaybee, yeah I just thought about that before you replied... Sorry, I went brain dead there for a bit lol. I will update the code correctly, check that everything works, and push up the changes. Sorry for all the confusion on this.

…ould cause a NULL pointer derefence. Just needed to check if the dbuf was in a DB_UNCACHED state, and if so, not call any of the ARC related functions in dbuf_unoverride and dbuf_undirty.

After spa_vdev_remove_aux() is called, the config nvlist is no longer valid, as it's been replaced by the new one (with the specified device removed). Therefore any pointers into the nvlist are no longer valid. So we can't save the result of `fnvlist_lookup_string(nv, ZPOOL_CONFIG_PATH)` (in vd_path) across the call to spa_vdev_remove_aux(). Instead, use spa_strdup() to save a copy of the string before calling spa_vdev_remove_aux. Found by AddressSanitizer: ERROR: AddressSanitizer: heap-use-after-free on address ... READ of size 34 at 0x608000a1fcd0 thread T686 #0 0x7fe88b0c166d (/usr/lib/x86_64-linux-gnu/libasan.so.4+0x5166d) #1 0x7fe88a5acd6e in spa_strdup spa_misc.c:1447 #2 0x7fe88a688034 in spa_vdev_remove vdev_removal.c:2259 #3 0x55ffbc7748f8 in ztest_vdev_aux_add_remove ztest.c:3229 #4 0x55ffbc769fba in ztest_execute ztest.c:6714 #5 0x55ffbc779a90 in ztest_thread ztest.c:6761 #6 0x7fe889cbc6da in start_thread openzfs#7 0x7fe8899e588e in __clone 0x608000a1fcd0 is located 48 bytes inside of 88-byte region freed by thread T686 here: #0 0x7fe88b14e7b8 in __interceptor_free #1 0x7fe88ae541c5 in nvlist_free nvpair.c:874 #2 0x7fe88ae543ba in nvpair_free nvpair.c:844 #3 0x7fe88ae57400 in nvlist_remove_nvpair nvpair.c:978 #4 0x7fe88a683c81 in spa_vdev_remove_aux vdev_removal.c:185 #5 0x7fe88a68857c in spa_vdev_remove vdev_removal.c:2221 #6 0x55ffbc7748f8 in ztest_vdev_aux_add_remove ztest.c:3229 openzfs#7 0x55ffbc769fba in ztest_execute ztest.c:6714 openzfs#8 0x55ffbc779a90 in ztest_thread ztest.c:6761 openzfs#9 0x7fe889cbc6da in start_thread Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> Closes openzfs#9706

bghira · 2020-02-18T22:52:17Z

openzfs#10018

After spa_vdev_remove_aux() is called, the config nvlist is no longer valid, as it's been replaced by the new one (with the specified device removed). Therefore any pointers into the nvlist are no longer valid. So we can't save the result of `fnvlist_lookup_string(nv, ZPOOL_CONFIG_PATH)` (in vd_path) across the call to spa_vdev_remove_aux(). Instead, use spa_strdup() to save a copy of the string before calling spa_vdev_remove_aux. Found by AddressSanitizer: ERROR: AddressSanitizer: heap-use-after-free on address ... READ of size 34 at 0x608000a1fcd0 thread T686 #0 0x7fe88b0c166d (/usr/lib/x86_64-linux-gnu/libasan.so.4+0x5166d) #1 0x7fe88a5acd6e in spa_strdup spa_misc.c:1447 #2 0x7fe88a688034 in spa_vdev_remove vdev_removal.c:2259 #3 0x55ffbc7748f8 in ztest_vdev_aux_add_remove ztest.c:3229 #4 0x55ffbc769fba in ztest_execute ztest.c:6714 #5 0x55ffbc779a90 in ztest_thread ztest.c:6761 #6 0x7fe889cbc6da in start_thread openzfs#7 0x7fe8899e588e in __clone 0x608000a1fcd0 is located 48 bytes inside of 88-byte region freed by thread T686 here: #0 0x7fe88b14e7b8 in __interceptor_free #1 0x7fe88ae541c5 in nvlist_free nvpair.c:874 #2 0x7fe88ae543ba in nvpair_free nvpair.c:844 #3 0x7fe88ae57400 in nvlist_remove_nvpair nvpair.c:978 #4 0x7fe88a683c81 in spa_vdev_remove_aux vdev_removal.c:185 #5 0x7fe88a68857c in spa_vdev_remove vdev_removal.c:2221 #6 0x55ffbc7748f8 in ztest_vdev_aux_add_remove ztest.c:3229 openzfs#7 0x55ffbc769fba in ztest_execute ztest.c:6714 openzfs#8 0x55ffbc779a90 in ztest_thread ztest.c:6761 openzfs#9 0x7fe889cbc6da in start_thread Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> Closes openzfs#9706

Under certain loads, the following panic is hit: panic: page fault KDB: stack backtrace: #0 0xffffffff805db025 at kdb_backtrace+0x65 #1 0xffffffff8058e86f at vpanic+0x17f #2 0xffffffff8058e6e3 at panic+0x43 #3 0xffffffff808adc15 at trap_fatal+0x385 #4 0xffffffff808adc6f at trap_pfault+0x4f #5 0xffffffff80886da8 at calltrap+0x8 #6 0xffffffff80669186 at vgonel+0x186 openzfs#7 0xffffffff80669841 at vgone+0x31 openzfs#8 0xffffffff8065806d at vfs_hash_insert+0x26d openzfs#9 0xffffffff81a39069 at sfs_vgetx+0x149 openzfs#10 0xffffffff81a39c54 at zfsctl_snapdir_lookup+0x1e4 openzfs#11 0xffffffff8065a28c at lookup+0x45c openzfs#12 0xffffffff806594b9 at namei+0x259 openzfs#13 0xffffffff80676a33 at kern_statat+0xf3 openzfs#14 0xffffffff8067712f at sys_fstatat+0x2f openzfs#15 0xffffffff808ae50c at amd64_syscall+0x10c openzfs#16 0xffffffff808876bb at fast_syscall_common+0xf8 The page fault occurs because vgonel() will call VOP_CLOSE() for active vnodes. For this reason, define vop_close for zfsctl_ops_snapshot. While here, define vop_open for consistency. After adding the necessary vop, the bug progresses to the following panic: panic: VERIFY3(vrecycle(vp) == 1) failed (0 == 1) cpuid = 17 KDB: stack backtrace: #0 0xffffffff805e29c5 at kdb_backtrace+0x65 #1 0xffffffff8059620f at vpanic+0x17f #2 0xffffffff81a27f4a at spl_panic+0x3a #3 0xffffffff81a3a4d0 at zfsctl_snapshot_inactive+0x40 #4 0xffffffff8066fdee at vinactivef+0xde #5 0xffffffff80670b8a at vgonel+0x1ea #6 0xffffffff806711e1 at vgone+0x31 openzfs#7 0xffffffff8065fa0d at vfs_hash_insert+0x26d openzfs#8 0xffffffff81a39069 at sfs_vgetx+0x149 openzfs#9 0xffffffff81a39c54 at zfsctl_snapdir_lookup+0x1e4 openzfs#10 0xffffffff80661c2c at lookup+0x45c openzfs#11 0xffffffff80660e59 at namei+0x259 openzfs#12 0xffffffff8067e3d3 at kern_statat+0xf3 openzfs#13 0xffffffff8067eacf at sys_fstatat+0x2f openzfs#14 0xffffffff808b5ecc at amd64_syscall+0x10c openzfs#15 0xffffffff8088f07b at fast_syscall_common+0xf8 This is caused by a race condition that can occur when allocating a new vnode and adding that vnode to the vfs hash. If the newly created vnode loses the race when being inserted into the vfs hash, it will not be recycled as its usecount is greater than zero, hitting the above assertion. Fix this by dropping the assertion. FreeBSD-issue: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=252700 Reviewed-by: Andriy Gapon <[email protected]> Reviewed-by: Mateusz Guzik <[email protected]> Reviewed-by: Alek Pinchuk <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Signed-off-by: Rob Wing <[email protected]> Co-authored-by: Rob Wing <[email protected]> Submitted-by: Klara, Inc. Sponsored-by: rsync.net Closes openzfs#14501

PaulZ-98 and others added 30 commits August 15, 2019 08:27

Linux 5.3: Fix switch() fall though compiler errors

a9ebdfd

Fix some switch() fall-though compiler errors: abd.c:1504:9: error: this statement may fall through Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tony Hutter <[email protected]> Closes openzfs#9170

Enhance ioctl number checks

f591a58

When checking ZFS_IOC_* numbers, print which numbers are wrong rather than silently failing. Reviewed-by: Chris Dunlop <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Ryan Moeller <[email protected]> Closes openzfs#9187

Fix install error introduced by openzfs#9089

d1484fb

Signed-off-by: Paul Dagnelie <[email protected]>

zfs-mount-genrator: dependencies should be space-separated

19d61d6

Reviewed-by: Antonio Russo <[email protected]> Reviewed-by: Richard Laager <[email protected]> Signed-off-by: Yuxuan Shui <[email protected]> Closes openzfs#9174

Fix install error introduced by openzfs#9089 (openzfs#9205)

d1d1f8c

Signed-off-by: Paul Dagnelie <[email protected]>

Merge branch 'direct' of github.com:bwatkinson/zfs into direct

e5afaee

mmaybee suggested changes Sep 5, 2019

View reviewed changes

I removed the user space implementation of abd_get_from_pages. I also…

2137ae6

… updated the abd code so now abd_put is only used to free up the sg_table instead of having to call abd_free.

Fixed bug where immediately deleting a file written using Direct IO w…

e550898

…ould cause a NULL pointer derefence. Just needed to check if the dbuf was in a DB_UNCACHED state, and if so, not call any of the ARC related functions in dbuf_unoverride and dbuf_undirty.

bwatkinson force-pushed the direct branch from 9ef93e2 to e550898 Compare September 10, 2019 22:18

behlendorf closed this Feb 18, 2020

bwatkinson deleted the direct branch January 22, 2021 02:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added Mark Maybee's changes to allow for Direct IO with read/writes. … #5

Added Mark Maybee's changes to allow for Direct IO with read/writes. … #5

bwatkinson commented Aug 28, 2019

mmaybee left a comment

bwatkinson commented Sep 6, 2019 •

edited

Loading

mmaybee commented Sep 6, 2019

behlendorf commented Sep 6, 2019

mmaybee commented Sep 6, 2019

bwatkinson commented Sep 6, 2019

mmaybee commented Sep 6, 2019

bwatkinson commented Sep 6, 2019

bwatkinson commented Sep 9, 2019

behlendorf commented Sep 9, 2019

bwatkinson commented Sep 9, 2019

bwatkinson commented Sep 9, 2019 •

edited

Loading

mmaybee commented Sep 10, 2019

bwatkinson commented Sep 10, 2019

mmaybee commented Sep 10, 2019

bwatkinson commented Sep 10, 2019 •

edited

Loading

mmaybee commented Sep 10, 2019

bwatkinson commented Sep 10, 2019

bghira commented Feb 18, 2020

Added Mark Maybee's changes to allow for Direct IO with read/writes. … #5

Added Mark Maybee's changes to allow for Direct IO with read/writes. … #5

Conversation

bwatkinson commented Aug 28, 2019

Motivation and Context

Description

How Has This Been Tested?

Types of changes

Checklist:

mmaybee left a comment

Choose a reason for hiding this comment

bwatkinson commented Sep 6, 2019 • edited Loading

mmaybee commented Sep 6, 2019

behlendorf commented Sep 6, 2019

mmaybee commented Sep 6, 2019

bwatkinson commented Sep 6, 2019

mmaybee commented Sep 6, 2019

bwatkinson commented Sep 6, 2019

bwatkinson commented Sep 9, 2019

behlendorf commented Sep 9, 2019

bwatkinson commented Sep 9, 2019

bwatkinson commented Sep 9, 2019 • edited Loading

mmaybee commented Sep 10, 2019

bwatkinson commented Sep 10, 2019

mmaybee commented Sep 10, 2019

bwatkinson commented Sep 10, 2019 • edited Loading

mmaybee commented Sep 10, 2019

bwatkinson commented Sep 10, 2019

bghira commented Feb 18, 2020

bwatkinson commented Sep 6, 2019 •

edited

Loading

bwatkinson commented Sep 9, 2019 •

edited

Loading

bwatkinson commented Sep 10, 2019 •

edited

Loading