Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve ZVOL queue behavior #389

Closed
wants to merge 17 commits into from
Closed

Conversation

dechamps
Copy link
Contributor

@dechamps dechamps commented Sep 5, 2011

The Linux block device queue subsystem exposes a number of configurable settings described in Linux block/blk-settings.c. The defaults for these settings are tuned for hard drives, and are not optimized for ZVOLs. Proper configuration of these options would allow upper layers (I/O scheduler) to take better decisions about write merging and ordering.

Detailed rationale:

  • max_hw_sectors is set to unlimited (UINT_MAX). zvol_write() is able to handle writes of any size, so there's no reason to impose a limit. Let the upper layer decide.
  • max_segments and max_segment_size are set to unlimited. zvol_write() will copy the requests' contents into a dbuf anyway, so the number and size of the segments are irrelevant. Let the upper layer decide.
  • physical_block_size and io_opt are set to the ZVOL's block size. This has the potential to somewhat alleviate issue Sequential file rewrite outside of block boundaries is dead slow #361 for ZVOLs, by warning the upper layers that writes smaller than the volume's block size will be slow.
  • The NONROT flag is set to indicate this isn't a rotational device. Although the backing zpool might be composed of rotational devices, the resulting ZVOL often doesn't exhibit the same behavior due to the COW mechanisms used by ZFS. Setting this flag will prevent upper layers from making useless decisions (such as reordering writes) based on incorrect assumptions about the behavior of the ZVOL.

Currently the "sync=always" property works for regular ZFS datasets, but not
for ZVOLs. This patch remedies that.

Fixes openzfs#374.
This operation allows "hole punching" in ZFS files. On Solaris this is done
via the vop_space() system call, which maps to the zfs_space() function. So
we just need to write zpl_truncate_range() as a wrapper around zfs_space().

Note that this only works for regular files, not ZVOLs.

This is currently an insecure implementation without permission checking,
although this isn't that big of a deal since truncate_range() isn't even
callable from userspace.

This is related to issue openzfs#334.
This isn't done on Solaris because on this OS zfs_space() can only be called
with an opened file handle. Since the addition of zpl_truncate_range() this
isn't the case anymore, so we need to enforce access rights.

This is related to issue openzfs#334.
Currently only the (FALLOC_FL_PUNCH_HOLE) flag combination is supported,
since it's the only one that matches the behavior of zfs_space(). This
makes it pretty much useless in its current form, but it's a start.

To support other flag combinations we would need to modify zfs_space()
to make it more flexible, or emulate the desired functionality in
zpl_fallocate().

This is related to issue openzfs#334.
DISCARD (REQ_DISCARD, BLKDISCARD) is useful for thin provisioning. It allows
ZVOL clients to discard (unmap, trim) block ranges from a ZVOL, thus
optimizing disk space usage by allowing a ZVOL to shrink instead of just
grow.

We can't use zfs_space() or zfs_freesp() here, since these functions only
work on regular files, not volumes. Fortunately we can use the low-level
function dmu_free_long_range() which does exactly what we want.

Currently the discard operation is not added to the log. That's not a big
deal since losing discard requests cannot result in data corruption. It would
however result in disk space usage higher than it should be. Thus adding
log support to zvol_discard() is probably a good idea for a future
improvement.
zvol_write() assumes that the write request must be written to stable storage
if rq_is_sync() is true. Unfortunately, this assumption is incorrect. Indeed,
"sync" does *not* mean what we think it means in the context of the Linux
block layer. This is well explained in linux/fs.h:

    WRITE:       A normal async write. Device will be plugged.
    WRITE_SYNC:  Synchronous write. Identical to WRITE, but passes down
                 the hint that someone will be waiting on this IO
                 shortly.
    WRITE_FLUSH: Like WRITE_SYNC but with preceding cache flush.
    WRITE_FUA:   Like WRITE_SYNC but data is guaranteed to be on
                 non-volatile media on completion.

In other words, SYNC does not *mean* that the write must be on stable storage
on completion. It just means that someone is waiting on us to complete the
write request. Thus triggering a ZIL commit for each SYNC write request on a
ZVOL is unnecessary and harmful for performance. To make matters worse, ZVOL
users have no way to express that they actually want data to be written to
stable storage, which means the ZIL is broken for ZVOLs.

The request for stable storage is expressed by the FUA flag, so we must
commit the ZIL after the write if the FUA flag is set. In addition, we must
commit the ZIL before the write if the FLUSH flag is set.

Also, we must inform the block layer that we actually support FLUSH and FUA.
The Linux block device queue subsystem exposes a number of configurable
settings described in Linux block/blk-settings.c. The defaults for these
settings are tuned for hard drives, and are not optimized for ZVOLs. Proper
configuration of these options would allow upper layers (I/O scheduler) to
take better decisions about write merging and ordering.

Detailed rationale:

 - max_hw_sectors is set to unlimited (UINT_MAX). zvol_write() is able to
   handle writes of any size, so there's no reason to impose a limit. Let the
   upper layer decide.

 - max_segments and max_segment_size are set to unlimited. zvol_write() will
   copy the requests' contents into a dbuf anyway, so the number and size of
   the segments are irrelevant. Let the upper layer decide.

 - physical_block_size and io_opt are set to the ZVOL's block size. This
   has the potential to somewhat alleviate issue openzfs#361 for ZVOLs, by warning
   the upper layers that writes smaller than the volume's block size will be
   slow.

 - The NONROT flag is set to indicate this isn't a rotational device.
   Although the backing zpool might be composed of rotational devices, the
   resulting ZVOL often doesn't exhibit the same behavior due to the COW
   mechanisms used by ZFS. Setting this flag will prevent upper layers from
   making useless decisions (such as reordering writes) based on incorrect
   assumptions about the behavior of the ZVOL.
Conflicts:
	module/zfs/zvol.c
@behlendorf
Copy link
Contributor

I'd love to just apply this but once again compatibility problems have cropped up. It turns out blk_queue_max_segments() wasn't introduced until 2.6.33. I added compatibility code to handle that, you can find it in my zvol branch, only to discover that 2.6.26 kernels didn't have any of the above blk_queue_max_hw_sectors(), blk_queue_max_physical_block_size(), or blk_queue_io_opt(). So some more compatibility code will unfortunately be needed... supporting Debian Lenny is become a real pain.

https://github.com/behlendorf/zfs/tree/zvol

@dechamps
Copy link
Contributor Author

dechamps commented Feb 3, 2012

I made a new, clean pull request to fix this mess. See #554.

@dechamps dechamps closed this Feb 3, 2012
sdimitro pushed a commit to sdimitro/zfs that referenced this pull request May 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants