Improve ZVOL queue behavior #389

dechamps · 2011-09-05T13:57:55Z

The Linux block device queue subsystem exposes a number of configurable settings described in Linux block/blk-settings.c. The defaults for these settings are tuned for hard drives, and are not optimized for ZVOLs. Proper configuration of these options would allow upper layers (I/O scheduler) to take better decisions about write merging and ordering.

Detailed rationale:

max_hw_sectors is set to unlimited (UINT_MAX). zvol_write() is able to handle writes of any size, so there's no reason to impose a limit. Let the upper layer decide.
max_segments and max_segment_size are set to unlimited. zvol_write() will copy the requests' contents into a dbuf anyway, so the number and size of the segments are irrelevant. Let the upper layer decide.
physical_block_size and io_opt are set to the ZVOL's block size. This has the potential to somewhat alleviate issue Sequential file rewrite outside of block boundaries is dead slow #361 for ZVOLs, by warning the upper layers that writes smaller than the volume's block size will be slow.
The NONROT flag is set to indicate this isn't a rotational device. Although the backing zpool might be composed of rotational devices, the resulting ZVOL often doesn't exhibit the same behavior due to the COW mechanisms used by ZFS. Setting this flag will prevent upper layers from making useless decisions (such as reordering writes) based on incorrect assumptions about the behavior of the ZVOL.

Currently the "sync=always" property works for regular ZFS datasets, but not for ZVOLs. This patch remedies that. Fixes openzfs#374.

This operation allows "hole punching" in ZFS files. On Solaris this is done via the vop_space() system call, which maps to the zfs_space() function. So we just need to write zpl_truncate_range() as a wrapper around zfs_space(). Note that this only works for regular files, not ZVOLs. This is currently an insecure implementation without permission checking, although this isn't that big of a deal since truncate_range() isn't even callable from userspace. This is related to issue openzfs#334.

This isn't done on Solaris because on this OS zfs_space() can only be called with an opened file handle. Since the addition of zpl_truncate_range() this isn't the case anymore, so we need to enforce access rights. This is related to issue openzfs#334.

Currently only the (FALLOC_FL_PUNCH_HOLE) flag combination is supported, since it's the only one that matches the behavior of zfs_space(). This makes it pretty much useless in its current form, but it's a start. To support other flag combinations we would need to modify zfs_space() to make it more flexible, or emulate the desired functionality in zpl_fallocate(). This is related to issue openzfs#334.

DISCARD (REQ_DISCARD, BLKDISCARD) is useful for thin provisioning. It allows ZVOL clients to discard (unmap, trim) block ranges from a ZVOL, thus optimizing disk space usage by allowing a ZVOL to shrink instead of just grow. We can't use zfs_space() or zfs_freesp() here, since these functions only work on regular files, not volumes. Fortunately we can use the low-level function dmu_free_long_range() which does exactly what we want. Currently the discard operation is not added to the log. That's not a big deal since losing discard requests cannot result in data corruption. It would however result in disk space usage higher than it should be. Thus adding log support to zvol_discard() is probably a good idea for a future improvement.

zvol_write() assumes that the write request must be written to stable storage if rq_is_sync() is true. Unfortunately, this assumption is incorrect. Indeed, "sync" does *not* mean what we think it means in the context of the Linux block layer. This is well explained in linux/fs.h: WRITE: A normal async write. Device will be plugged. WRITE_SYNC: Synchronous write. Identical to WRITE, but passes down the hint that someone will be waiting on this IO shortly. WRITE_FLUSH: Like WRITE_SYNC but with preceding cache flush. WRITE_FUA: Like WRITE_SYNC but data is guaranteed to be on non-volatile media on completion. In other words, SYNC does not *mean* that the write must be on stable storage on completion. It just means that someone is waiting on us to complete the write request. Thus triggering a ZIL commit for each SYNC write request on a ZVOL is unnecessary and harmful for performance. To make matters worse, ZVOL users have no way to express that they actually want data to be written to stable storage, which means the ZIL is broken for ZVOLs. The request for stable storage is expressed by the FUA flag, so we must commit the ZIL after the write if the FUA flag is set. In addition, we must commit the ZIL before the write if the FLUSH flag is set. Also, we must inform the block layer that we actually support FLUSH and FUA.

The Linux block device queue subsystem exposes a number of configurable settings described in Linux block/blk-settings.c. The defaults for these settings are tuned for hard drives, and are not optimized for ZVOLs. Proper configuration of these options would allow upper layers (I/O scheduler) to take better decisions about write merging and ordering. Detailed rationale: - max_hw_sectors is set to unlimited (UINT_MAX). zvol_write() is able to handle writes of any size, so there's no reason to impose a limit. Let the upper layer decide. - max_segments and max_segment_size are set to unlimited. zvol_write() will copy the requests' contents into a dbuf anyway, so the number and size of the segments are irrelevant. Let the upper layer decide. - physical_block_size and io_opt are set to the ZVOL's block size. This has the potential to somewhat alleviate issue openzfs#361 for ZVOLs, by warning the upper layers that writes smaller than the volume's block size will be slow. - The NONROT flag is set to indicate this isn't a rotational device. Although the backing zpool might be composed of rotational devices, the resulting ZVOL often doesn't exhibit the same behavior due to the COW mechanisms used by ZFS. Setting this flag will prevent upper layers from making useless decisions (such as reordering writes) based on incorrect assumptions about the behavior of the ZVOL.

Conflicts: module/zfs/zvol.c

behlendorf · 2011-09-06T23:56:03Z

I'd love to just apply this but once again compatibility problems have cropped up. It turns out blk_queue_max_segments() wasn't introduced until 2.6.33. I added compatibility code to handle that, you can find it in my zvol branch, only to discover that 2.6.26 kernels didn't have any of the above blk_queue_max_hw_sectors(), blk_queue_max_physical_block_size(), or blk_queue_io_opt(). So some more compatibility code will unfortunately be needed... supporting Debian Lenny is become a real pain.

https://github.com/behlendorf/zfs/tree/zvol

dechamps · 2012-02-03T13:08:28Z

I made a new, clean pull request to fix this mess. See #554.

…he FS (openzfs#389)

dechamps added 13 commits September 2, 2011 09:57

Support "sync=always" for ZVOLs.

e46fa46

Currently the "sync=always" property works for regular ZFS datasets, but not for ZVOLs. This patch remedies that. Fixes openzfs#374.

Check permissions in zfs_space().

87a2542

This isn't done on Solaris because on this OS zfs_space() can only be called with an opened file handle. Since the addition of zpl_truncate_range() this isn't the case anymore, so we need to enforce access rights. This is related to issue openzfs#334.

Merge branch 'discard'

a6dc523

Merge branch 'zvolblksync'

bd3dfa1

Conflicts: module/zfs/zvol.c

Merge branch 'zvolsettings'

648f0a1

Forgot to fix a conflict (first time with git, well)

671ee6f

Forgot to remove a debugging printk.

bafc4ff

Merge branch 'discard'

ebdffe3

Etienne Dechamps and others added 4 commits November 6, 2011 11:59

Merge remote-tracking branch 'upstream/master' into zvolsettings

b78cca8

Merge remote-tracking branch 'upstream/master'

a53d94a

Merge branch 'master' of https://github.com/zfsonlinux/zfs

c902f40

Merge branch 'master' into zvolsettings

5f461c2

dechamps mentioned this pull request Feb 3, 2012

Improve ZVOL queue behavior #554

Closed

dechamps closed this Feb 3, 2012

sdimitro pushed a commit to sdimitro/zfs that referenced this pull request May 23, 2022

DLPX-80850 DOSE reboot takes over 30min due to long time to unmount t…

ba44173

…he FS (openzfs#389)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve ZVOL queue behavior #389

Improve ZVOL queue behavior #389

dechamps commented Sep 5, 2011

behlendorf commented Sep 6, 2011

dechamps commented Feb 3, 2012

Improve ZVOL queue behavior #389

Improve ZVOL queue behavior #389

Conversation

dechamps commented Sep 5, 2011

behlendorf commented Sep 6, 2011

dechamps commented Feb 3, 2012