Unified allocation throttling #17020

amotin · 2025-02-03T02:17:22Z

Motivation and Context

Existing allocation throttling had a goal to improve write speed by allocating more data to vdevs that are able to write it faster. But in the process it completely broken the original mechanism, designed to balance vdev space usage. With severe vdev space use imbalance it is possible that some with higher use start growing fragmentation sooner than others and after getting full will stop any writes at all. Also after vdev addition it might take a very long time for pool to restore the balance, since the new vdev does not have any real preference, unless the old one is already much slower due to fragmentation. Also the old throttling was request- based, which was unpredictable with block sizes varying from 512B to 16MB, neither it made much sense in case of I/O aggregation, when its 32-100 requests could be aggregated into few, leaving device underutilized, submitting fewer and/or shorter requests, or in opposite try to queue up to 1.6GB of writes per device.

Description

This change presents a completely new throttling algorithm. Unlike the request-based old one, this one measures allocation queue in bytes. It makes possible to integrate with the reworked allocation quota (aliquot) mechanism, which is also byte-based. Unlike the original code, balancing the vdevs amounts of free space, this one balances their free/used space fractions. It should result in a lower and more uniform fragmentation in a long run.

This algorithm still allows to improve write speed by allocating more data to faster vdevs, but does it in more controllable way. On top of space-based allocation quota, it also calculates minimum queue depth that vdev is allowed to maintain, and respectively the amount of extra allocations it can receive if it appear faster. That amount is based on vdev's capacity and space usage, but also applied only when the pool is busy. This way the code can choose between faster writes when needed and better vdev balance when not, with the choice gradually reducing together with the free space.

This change also makes allocation queues per-class, allowing them to throttle independently and in parallel. Allocations that are bounced between classes due to allocation errors will be able to properly throttle in the new class. Allocations that should not be throttled (ZIL, gang, copies) are not, but may still follow the rotor and allocation quota mechanism of the class without disrupting it.

How Has This Been Tested?

Test 1: 2 SSDs with 128GB and 256GB capacity written at full speed

Up to ~25% of space usage of the smaller one the SSDs are writing at about the same maximum speed. After that smaller device is gradually getting throttled to balance space usage. To the full space usage devices come with only few percent difference. Since users are typically discouraged to run at full capacity to reduce fragmentation, the performance at the beginning is more important than at the end.

Test 2: 2 SSDs with 128GB and 256GB capacity written at slower speed

Since we do not need more speed, the vdevs are keeping almost perfect space usage balance.

Test 3: SSD and HDD vdevs of the same capacity, but very different performance, written at full speed

While empty, the SSD is allowed to write 2.5 times faster than the HDD. With its space usage increase to ~50% the SSD is getting throttled to the HDD speed and after that even slower. To the full space usage devices come with only few percent difference.

Test 4: SSD and HDD vdevs of the same capacity, but very different performance, written at slower speed

Since we do not need more speed, the SSD is throttled to the HDD speed, but instead they are keeping almost perfect space usage balance.

Test 5: Second vdev addition

First the pool of one vdev is filled almost up to capacity. After that second vdev is added and the data is overwritten couple times. Single data overwrite is enough to re-balance the vdevs, even with some overshot, probably due to big sizes of TXG relative to device sizes used in the test and ZFS delayed frees.

Test 6: Parallel sequential write to 12x 5-wide RAIDZ1 of HDDs

block	old	new
16K	3076MiB/s	3621MiB/s
128K	3773MiB/s	4638MiB/s
1M	4434MiB/s	4541MiB/s
16M	4105MiB/s	4145MiB/s

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Breaking change (fix or feature that would cause existing functionality to change)
Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the OpenZFS code style requirements.
I have updated the documentation accordingly.
I have read the contributing document.
I have added tests to cover my changes.
I have run the ZFS Test Suite with this change applied.
All commit messages are properly formatted and contain Signed-off-by.

amotin · 2025-02-03T17:01:22Z

I am still thinking whether it would make sense to give smaller but fast vdev some smaller write boost at the beginning even on idle pool, so that it could be used more during reads, even by the cost of some lower write speeds later. Suppose it may have no a universal answer.

include/sys/metaslab_impl.h

allanjude · 2025-02-21T14:49:54Z

This looks great, I especially like the concept of maintaining the minimum queue depth to keep all of the devices busy.

I know in general you are against adding tunables, but I wonder if a few of the magic numbers could be controllable.
Are there any kstats that we might want to add to be able to measure how this is working under load?

amotin · 2025-02-21T15:09:52Z

I know in general you are against adding tunables, but I wonder if a few of the magic numbers could be controllable.

I was thinking about some, but I've decided there is a pretty thin margin between different factors where the algorithm would work as planned, and it would be much easier to mess it up by random tuning rather than improve. I'll think about some that could have sense.

Are there any kstats that we might want to add to be able to measure how this is working under load?

I've used some custom tools to collect some data while benchmarking the code, but not sure what I would expose as kstat. I might think about it, but ideas are welcome.

We don't need ms_selected_time resolution in nanoseconds. And even milliseconds of metaslab_unload_delay_ms are questionable. Reduce it to seconds to avoid gethrtime() calls under a congested lock. Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc.

Existing allocation throttling had a goal to improve write speed by allocating more data to vdevs that are able to write it faster. But in the process it completely broken the original mechanism, designed to balance vdev space usage. With severe vdev space use imbalance it is possible that some with higher use start growing fragmentation sooner than others and after getting full will stop any writes at all. Also after vdev addition it might take a very long time for pool to restore the balance, since the new vdev does not have any real preference, unless the old one is already much slower due to fragmentation. Also the old throttling was request- based, which was unpredictable with block sizes varying from 512B to 16MB, neither it made much sense in case of I/O aggregation, when its 32-100 requests could be aggregated into few, leaving device underutilized, submitting fewer and/or shorter requests, or in opposite try to queue up to 1.6GB of writes per device. This change presents a completely new throttling algorithm. Unlike the request-based old one, this one measures allocation queue in bytes. It makes possible to integrate with the reworked allocation quota (aliquot) mechanism, which is also byte-based. Unlike the original code, balancing the vdevs amounts of free space, this one balances their free/used space fractions. It should result in a lower and more uniform fragmentation in a long run. This algorithm still allows to improve write speed by allocating more data to faster vdevs, but does it in more controllable way. On top of space-based allocation quota, it also calculates minimum queue depth that vdev is allowed to maintain, and respectively the amount of extra allocations it can receive if it appear faster. That amount is based on vdev's capacity and space usage, but also applied only when the pool is busy. This way the code can choose between faster writes when needed and better vdev balance when not, with the choice gradually reducing together with the free space. This change also makes allocation queues per-class, allowing them to throttle independently and in parallel. Allocations that are bounced between classes due to allocation errors will be able to properly throttle in the new class. Allocations that should not be throttled (ZIL, gang, copies) are not, but may still follow the rotor and allocation quota mechanism of the class without disrupting it. Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc.

amotin added Status: Code Review Needed Ready for review and testing Status: Design Review Needed Architecture or design is under discussion labels Feb 3, 2025

amotin force-pushed the alloc branch 2 times, most recently from d883f10 to ff6e15d Compare February 3, 2025 16:28

amotin force-pushed the alloc branch 4 times, most recently from 7432c34 to bfeae7c Compare February 4, 2025 17:10

amotin requested review from grwilson, don-brady and behlendorf February 4, 2025 20:21

allanjude reviewed Feb 21, 2025

View reviewed changes

include/sys/metaslab_impl.h Outdated Show resolved Hide resolved

amotin added 2 commits February 21, 2025 10:12

amotin force-pushed the alloc branch from bfeae7c to 87f1e95 Compare February 21, 2025 15:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unified allocation throttling #17020

Unified allocation throttling #17020

amotin commented Feb 3, 2025 •

edited

Loading

amotin commented Feb 3, 2025

allanjude commented Feb 21, 2025

amotin commented Feb 21, 2025

Unified allocation throttling #17020

Are you sure you want to change the base?

Unified allocation throttling #17020

Conversation

amotin commented Feb 3, 2025 • edited Loading

Motivation and Context

Description

How Has This Been Tested?

Test 1: 2 SSDs with 128GB and 256GB capacity written at full speed

Test 2: 2 SSDs with 128GB and 256GB capacity written at slower speed

Test 3: SSD and HDD vdevs of the same capacity, but very different performance, written at full speed

Test 4: SSD and HDD vdevs of the same capacity, but very different performance, written at slower speed

Test 5: Second vdev addition

Test 6: Parallel sequential write to 12x 5-wide RAIDZ1 of HDDs

Types of changes

Checklist:

amotin commented Feb 3, 2025

allanjude commented Feb 21, 2025

amotin commented Feb 21, 2025

amotin commented Feb 3, 2025 •

edited

Loading