Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exceeding throughput of pool with posix_fadvise causes excessive zio_cache growth, system OOM and process crashes #15776

Open
gharris1727 opened this issue Jan 16, 2024 · 0 comments
Labels
Component: Memory Management kernel memory management Type: Defect Incorrect behavior (e.g. crash, hang)

Comments

@gharris1727
Copy link

System information

Type Version/Name
Distribution Name Arch Linux
Distribution Version N/A
Kernel Version 6.6.10-arch1-1
Architecture x86_64
OpenZFS Version zfs-2.2.2-1

Describe the problem you're observing

When executing a workload which exceeds the throughput of my zpool, kernel memory usage grows until my userspace applications are OOM-killed

Describe how to reproduce the problem

  • Use slow hard drives (e.g. 7200rpm, simulated slow drives, etc)
  • Create a zpool/zfs with a small recordsize, (e.g. recordsize=16K)
  • Configure zfs_vdev_async_read_max_active=1
  • Start a random-read heavy workload with posix_fadvise
  • Run watch zpool iostat -vqy 1 1 to monitor the asyncq_read pend column.
  • Run slabtop to monitor the zio_cache total size.

Expected behavior: the asyncq_read pend and slabtop measurements reach some sort of bound or equilibrium at a low total memory usage (<1GB). The application exceeding the read throughput of the pool has backpressure applied by slowing down syscalls.

Actual behavior: the asyncq_read pend and slabtop measurements grow without apparent bound, until the machine runs out of memory. The machine hangs for 2 minutes until a kernel task watchdog expires, and the OOM killer starts to kill processes. This repeats until the process(es) generating the read requests are killed, and finally the read queue shrinks and the system goes back to idle.

Workaround: increase zfs_vdev_async_read_max_active until the asyncq_read pend drains, and the memory usage returns to a stable equilibrium.

Include any warning/errors/backtraces from the system logs

OOM-killer memory messages from journalctl: sanitized.txt
excerpt:

Jan 14 04:30:51 hostname kernel: Mem-Info:
Jan 14 04:30:51 hostname kernel: active_anon:3744 inactive_anon:16451 isolated_anon:0
                                      active_file:6079 inactive_file:2837 isolated_file:0
                                      unevictable:80 dirty:39 writeback:15
                                      slab_reclaimable:25758 slab_unreclaimable:1727102
                                      mapped:36299 shmem:410 pagetables:14436
                                      sec_pagetables:0 bounce:0
                                      kernel_misc_reclaimable:0
                                      free:70559 free_pcp:60 free_cma:0
Jan 14 04:30:51 hostname kernel: Node 0 active_anon:14976kB inactive_anon:65804kB active_file:24316kB inactive_file:11348kB unevictable:320kB isolated(anon):0kB isolated(file):0kB mapped:145196kB dirty:156kB writeback:60kB shmem:1640kB shmem_thp:0kB shmem_pmdmapped:0kB anon_thp:16384kB writeback_tmp:0kB kernel_stack:25712kB pagetables:57744kB sec_pagetables:0kB all_unreclaimable? no
Jan 14 04:30:51 hostname kernel: Node 0 DMA free:11272kB boost:0kB min:28kB low:40kB high:52kB reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15996kB managed:15368kB mlocked:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
Jan 14 04:30:51 hostname kernel: lowmem_reserve[]: 0 2896 31963 31963 31963
Jan 14 04:30:51 hostname kernel: Node 0 DMA32 free:146960kB boost:16384kB min:22504kB low:25468kB high:28432kB reserved_highatomic:20480KB active_anon:1364kB inactive_anon:8252kB active_file:4328kB inactive_file:1048kB unevictable:0kB writepending:12kB present:3044884kB managed:2978476kB mlocked:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
Jan 14 04:30:51 hostname kernel: lowmem_reserve[]: 0 0 29067 29067 29067
Jan 14 04:30:51 hostname kernel: Node 0 Normal free:124004kB boost:65536kB min:126964kB low:156728kB high:186492kB reserved_highatomic:190464KB active_anon:13188kB inactive_anon:56340kB active_file:19340kB inactive_file:12652kB unevictable:320kB writepending:156kB present:30395904kB managed:29771528kB mlocked:320kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
Jan 14 04:30:51 hostname kernel: lowmem_reserve[]: 0 0 0 0 0
Jan 14 04:30:51 hostname kernel: Node 0 DMA: 0*4kB 1*8kB (U) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 1*1024kB (U) 1*2048kB (M) 2*4096kB (M) = 11272kB
Jan 14 04:30:51 hostname kernel: Node 0 DMA32: 4154*4kB (UME) 1741*8kB (UME) 626*16kB (UME) 450*32kB (UME) 366*64kB (UME) 202*128kB (UME) 93*256kB (UMEH) 25*512kB (UMH) 7*1024kB (UMH) 0*2048kB 0*4096kB = 148016kB
Jan 14 04:30:51 hostname kernel: Node 0 Normal: 22143*4kB (UM) 1850*8kB (UM) 49*16kB (UM) 2*32kB (UM) 1*64kB (U) 1*128kB (U) 1*256kB (M) 10*512kB (UM) 13*1024kB (UM) 2*2048kB (M) 0*4096kB = 127196kB
Jan 14 04:30:51 hostname kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
Jan 14 04:30:51 hostname kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
Jan 14 04:30:51 hostname kernel: 14121 total pagecache pages
Jan 14 04:30:51 hostname kernel: 5212 pages in swap cache
Jan 14 04:30:51 hostname kernel: Free swap  = 129392380kB
Jan 14 04:30:51 hostname kernel: Total swap = 134217724kB
Jan 14 04:30:51 hostname kernel: 8364196 pages RAM
Jan 14 04:30:51 hostname kernel: 0 pages HighMem/MovableOnly
Jan 14 04:30:51 hostname kernel: 172853 pages reserved
Jan 14 04:30:51 hostname kernel: 0 pages cma reserved
Jan 14 04:30:51 hostname kernel: 0 pages hwpoisoned
Jan 14 04:30:51 hostname kernel: Unreclaimable slab info:
Jan 14 04:30:51 hostname kernel: Name                      Used          Total
...
Jan 14 04:30:51 hostname kernel: zio_cache            4880135KB    4880135KB

More info

My setup:

  • 1 pool with 5 vdevs each with 2 mirrored 7200rpm drives
  • 4 processes with a handful of threads fadvising and reading data with 16kb buffers
  • With the workaround applied, an empty zio_cache and full ARC, there is ~20gb free memory

I was seeing the asyncq-read pend column >350k queued operations and trending upward, and the journalctl records the zio_cache exceeding 4.5GB at the time of oom-killer being triggered.

I'm uncertain about the accounting for all of the memory, as I am not very familiar with profiling kernel memory usage. But I was seeing a clear correlation between the exhaustion of my overall memory and the growth of the zio_cache and asyncq_read pend size. And the problem went away when I tuned my async reads in the zio scheduler.

Solutions?

While the workaround above does solve the pool throughput problem that is the root cause in my case, I think it's extremely user-unfriendly that a throughput problem surfaces as a kernel hang and process crash. There should be some smoother degradation that happens first to avoid reaching the OOM condition.

If there were zfs_vdev_*_*_max_pending or similar parameters that imposed a bound on pending queues, and hitting that bound slowed down the affected pool/process with increased read/write latency, this would just be another performance tuneable rather than a system-wide crash.

If configurations for this already exist under a different name, please consider changing the defaults to more conservative values and documenting them so that they are easier to discover, as I could not find them in this page: https://openzfs.github.io/openzfs-docs/Performance%20and%20Tuning/Module%20Parameters.html#zio-scheduler

@gharris1727 gharris1727 added the Type: Defect Incorrect behavior (e.g. crash, hang) label Jan 16, 2024
@behlendorf behlendorf added the Component: Memory Management kernel memory management label Jan 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: Memory Management kernel memory management Type: Defect Incorrect behavior (e.g. crash, hang)
Projects
None yet
Development

No branches or pull requests

2 participants