Exceeding throughput of pool with posix_fadvise causes excessive zio_cache growth, system OOM and process crashes #15776
Labels
Component: Memory Management
kernel memory management
Type: Defect
Incorrect behavior (e.g. crash, hang)
System information
Describe the problem you're observing
When executing a workload which exceeds the throughput of my zpool, kernel memory usage grows until my userspace applications are OOM-killed
Describe how to reproduce the problem
recordsize=16K
)zfs_vdev_async_read_max_active=1
posix_fadvise
watch zpool iostat -vqy 1 1
to monitor theasyncq_read pend
column.slabtop
to monitor thezio_cache
total size.Expected behavior: the
asyncq_read pend
andslabtop
measurements reach some sort of bound or equilibrium at a low total memory usage (<1GB). The application exceeding the read throughput of the pool has backpressure applied by slowing down syscalls.Actual behavior: the
asyncq_read pend
andslabtop
measurements grow without apparent bound, until the machine runs out of memory. The machine hangs for 2 minutes until a kernel task watchdog expires, and the OOM killer starts to kill processes. This repeats until the process(es) generating the read requests are killed, and finally the read queue shrinks and the system goes back to idle.Workaround: increase
zfs_vdev_async_read_max_active
until theasyncq_read pend
drains, and the memory usage returns to a stable equilibrium.Include any warning/errors/backtraces from the system logs
OOM-killer memory messages from journalctl: sanitized.txt
excerpt:
More info
My setup:
I was seeing the
asyncq-read pend
column >350k queued operations and trending upward, and the journalctl records thezio_cache
exceeding 4.5GB at the time of oom-killer being triggered.I'm uncertain about the accounting for all of the memory, as I am not very familiar with profiling kernel memory usage. But I was seeing a clear correlation between the exhaustion of my overall memory and the growth of the
zio_cache
andasyncq_read
pend size. And the problem went away when I tuned my async reads in the zio scheduler.Solutions?
While the workaround above does solve the pool throughput problem that is the root cause in my case, I think it's extremely user-unfriendly that a throughput problem surfaces as a kernel hang and process crash. There should be some smoother degradation that happens first to avoid reaching the OOM condition.
If there were
zfs_vdev_*_*_max_pending
or similar parameters that imposed a bound on pending queues, and hitting that bound slowed down the affected pool/process with increased read/write latency, this would just be another performance tuneable rather than a system-wide crash.If configurations for this already exist under a different name, please consider changing the defaults to more conservative values and documenting them so that they are easier to discover, as I could not find them in this page: https://openzfs.github.io/openzfs-docs/Performance%20and%20Tuning/Module%20Parameters.html#zio-scheduler
The text was updated successfully, but these errors were encountered: