Exceeding throughput of pool with posix_fadvise causes excessive zio_cache growth, system OOM and process crashes #15776

gharris1727 · 2024-01-16T00:34:58Z

System information

Type	Version/Name
Distribution Name	Arch Linux
Distribution Version	N/A
Kernel Version	6.6.10-arch1-1
Architecture	x86_64
OpenZFS Version	zfs-2.2.2-1

Describe the problem you're observing

When executing a workload which exceeds the throughput of my zpool, kernel memory usage grows until my userspace applications are OOM-killed

Describe how to reproduce the problem

Use slow hard drives (e.g. 7200rpm, simulated slow drives, etc)
Create a zpool/zfs with a small recordsize, (e.g. recordsize=16K)
Configure zfs_vdev_async_read_max_active=1
Start a random-read heavy workload with posix_fadvise
Run watch zpool iostat -vqy 1 1 to monitor the asyncq_read pend column.
Run slabtop to monitor the zio_cache total size.

Expected behavior: the asyncq_read pend and slabtop measurements reach some sort of bound or equilibrium at a low total memory usage (<1GB). The application exceeding the read throughput of the pool has backpressure applied by slowing down syscalls.

Actual behavior: the asyncq_read pend and slabtop measurements grow without apparent bound, until the machine runs out of memory. The machine hangs for 2 minutes until a kernel task watchdog expires, and the OOM killer starts to kill processes. This repeats until the process(es) generating the read requests are killed, and finally the read queue shrinks and the system goes back to idle.

Workaround: increase zfs_vdev_async_read_max_active until the asyncq_read pend drains, and the memory usage returns to a stable equilibrium.

Include any warning/errors/backtraces from the system logs

OOM-killer memory messages from journalctl: sanitized.txt
excerpt:

Jan 14 04:30:51 hostname kernel: Mem-Info:
Jan 14 04:30:51 hostname kernel: active_anon:3744 inactive_anon:16451 isolated_anon:0
                                      active_file:6079 inactive_file:2837 isolated_file:0
                                      unevictable:80 dirty:39 writeback:15
                                      slab_reclaimable:25758 slab_unreclaimable:1727102
                                      mapped:36299 shmem:410 pagetables:14436
                                      sec_pagetables:0 bounce:0
                                      kernel_misc_reclaimable:0
                                      free:70559 free_pcp:60 free_cma:0
Jan 14 04:30:51 hostname kernel: Node 0 active_anon:14976kB inactive_anon:65804kB active_file:24316kB inactive_file:11348kB unevictable:320kB isolated(anon):0kB isolated(file):0kB mapped:145196kB dirty:156kB writeback:60kB shmem:1640kB shmem_thp:0kB shmem_pmdmapped:0kB anon_thp:16384kB writeback_tmp:0kB kernel_stack:25712kB pagetables:57744kB sec_pagetables:0kB all_unreclaimable? no
Jan 14 04:30:51 hostname kernel: Node 0 DMA free:11272kB boost:0kB min:28kB low:40kB high:52kB reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15996kB managed:15368kB mlocked:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
Jan 14 04:30:51 hostname kernel: lowmem_reserve[]: 0 2896 31963 31963 31963
Jan 14 04:30:51 hostname kernel: Node 0 DMA32 free:146960kB boost:16384kB min:22504kB low:25468kB high:28432kB reserved_highatomic:20480KB active_anon:1364kB inactive_anon:8252kB active_file:4328kB inactive_file:1048kB unevictable:0kB writepending:12kB present:3044884kB managed:2978476kB mlocked:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
Jan 14 04:30:51 hostname kernel: lowmem_reserve[]: 0 0 29067 29067 29067
Jan 14 04:30:51 hostname kernel: Node 0 Normal free:124004kB boost:65536kB min:126964kB low:156728kB high:186492kB reserved_highatomic:190464KB active_anon:13188kB inactive_anon:56340kB active_file:19340kB inactive_file:12652kB unevictable:320kB writepending:156kB present:30395904kB managed:29771528kB mlocked:320kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
Jan 14 04:30:51 hostname kernel: lowmem_reserve[]: 0 0 0 0 0
Jan 14 04:30:51 hostname kernel: Node 0 DMA: 0*4kB 1*8kB (U) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 1*1024kB (U) 1*2048kB (M) 2*4096kB (M) = 11272kB
Jan 14 04:30:51 hostname kernel: Node 0 DMA32: 4154*4kB (UME) 1741*8kB (UME) 626*16kB (UME) 450*32kB (UME) 366*64kB (UME) 202*128kB (UME) 93*256kB (UMEH) 25*512kB (UMH) 7*1024kB (UMH) 0*2048kB 0*4096kB = 148016kB
Jan 14 04:30:51 hostname kernel: Node 0 Normal: 22143*4kB (UM) 1850*8kB (UM) 49*16kB (UM) 2*32kB (UM) 1*64kB (U) 1*128kB (U) 1*256kB (M) 10*512kB (UM) 13*1024kB (UM) 2*2048kB (M) 0*4096kB = 127196kB
Jan 14 04:30:51 hostname kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
Jan 14 04:30:51 hostname kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
Jan 14 04:30:51 hostname kernel: 14121 total pagecache pages
Jan 14 04:30:51 hostname kernel: 5212 pages in swap cache
Jan 14 04:30:51 hostname kernel: Free swap  = 129392380kB
Jan 14 04:30:51 hostname kernel: Total swap = 134217724kB
Jan 14 04:30:51 hostname kernel: 8364196 pages RAM
Jan 14 04:30:51 hostname kernel: 0 pages HighMem/MovableOnly
Jan 14 04:30:51 hostname kernel: 172853 pages reserved
Jan 14 04:30:51 hostname kernel: 0 pages cma reserved
Jan 14 04:30:51 hostname kernel: 0 pages hwpoisoned
Jan 14 04:30:51 hostname kernel: Unreclaimable slab info:
Jan 14 04:30:51 hostname kernel: Name                      Used          Total
...
Jan 14 04:30:51 hostname kernel: zio_cache            4880135KB    4880135KB

More info

My setup:

1 pool with 5 vdevs each with 2 mirrored 7200rpm drives
4 processes with a handful of threads fadvising and reading data with 16kb buffers
With the workaround applied, an empty zio_cache and full ARC, there is ~20gb free memory

I was seeing the asyncq-read pend column >350k queued operations and trending upward, and the journalctl records the zio_cache exceeding 4.5GB at the time of oom-killer being triggered.

I'm uncertain about the accounting for all of the memory, as I am not very familiar with profiling kernel memory usage. But I was seeing a clear correlation between the exhaustion of my overall memory and the growth of the zio_cache and asyncq_read pend size. And the problem went away when I tuned my async reads in the zio scheduler.

Solutions?

While the workaround above does solve the pool throughput problem that is the root cause in my case, I think it's extremely user-unfriendly that a throughput problem surfaces as a kernel hang and process crash. There should be some smoother degradation that happens first to avoid reaching the OOM condition.

If there were zfs_vdev_*_*_max_pending or similar parameters that imposed a bound on pending queues, and hitting that bound slowed down the affected pool/process with increased read/write latency, this would just be another performance tuneable rather than a system-wide crash.

If configurations for this already exist under a different name, please consider changing the defaults to more conservative values and documenting them so that they are easier to discover, as I could not find them in this page: https://openzfs.github.io/openzfs-docs/Performance%20and%20Tuning/Module%20Parameters.html#zio-scheduler

The text was updated successfully, but these errors were encountered:

gharris1727 added the Type: Defect Incorrect behavior (e.g. crash, hang) label Jan 16, 2024

behlendorf added the Component: Memory Management kernel memory management label Jan 17, 2024

osleg mentioned this issue Mar 28, 2024

OOM / Panic on files remove #16037

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exceeding throughput of pool with posix_fadvise causes excessive zio_cache growth, system OOM and process crashes #15776

Exceeding throughput of pool with posix_fadvise causes excessive zio_cache growth, system OOM and process crashes #15776

gharris1727 commented Jan 16, 2024

Exceeding throughput of pool with posix_fadvise causes excessive zio_cache growth, system OOM and process crashes #15776

Exceeding throughput of pool with posix_fadvise causes excessive zio_cache growth, system OOM and process crashes #15776

Comments

gharris1727 commented Jan 16, 2024

System information

Describe the problem you're observing

Describe how to reproduce the problem

Include any warning/errors/backtraces from the system logs

More info

Solutions?