OOM / Panic on files remove #16037

osleg · 2024-03-28T16:48:55Z

Problem
Test setup
Observed issue
Logs
Maybe related issues

Problem

Upon testing OpenZFS versions 2.1.13-2.1.15 and 2.2.2-2.2.3 on CentOS 8 Stream
with various kernel versions ranging from 4.18.0-408 to .547, and utilizing
Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHz with 8GB ECC RAM, we encountered
a memory consumption issue which leads to kerno panic during disk usage stress testing.

Test setup

Utilizing zpool with multiple configurations:

Prior to 2.2.3: Compression disabled
2.2.3: Compression set to default
All versions include variants with defaults and ashift=12, autoextend (planned for use), others set to defaults
Pools consist of non-mirrored configurations with block devices of varying
sizes but consistent speed and throughput, ranging from 147GB to 6.7TB.

The test involves running multiple writers to fill the disk with random-sized
files ranging from 1KB to 2GB. Once the disks are filled, all files are
removed, and the process is repeated.

Observed issue

Across all tested versions, particularly pronounced in versions prior to 2.2.3,
significant memory consumption occurs when files are removed.

Memory usage spikes, consuming all available memory.

The OOM killer activates in an attempt to free memory, resulting in kernel
panics when no further resources are available for the OOM killer to release.

With 8GB RAM, the issue consistently occurs in every test instance before
version 2.2.3, with a decreased frequency in version 2.2.3 (5 out of 20 CentOS
test instances experienced kernel panics).

Increasing RAM to 32GB sometimes mitigates the issue.
Removal of approximately 500GB of small files consumes around 20GB of memory on a 32GB machine.
Memory is predominantly consumed by zio_buff_... and zio_cache.

Logs

Machine info

Current instance is the only one that I left with for testing rn:

# cat /etc/os-release
NAME="CentOS Stream"
VERSION="8"

# zfs --version
zfs-2.2.3-1
zfs-kmod-2.2.3-1

# dmidecode -t memory
# dmidecode 3.3
Getting SMBIOS data from sysfs.
SMBIOS 2.7 present.

Handle 0x0008, DMI type 16, 23 bytes
Physical Memory Array
	Location: System Board Or Motherboard
	Use: System Memory
	Error Correction Type: Unknown
	Maximum Capacity: 8 GB
	Error Information Handle: Not Provided
	Number Of Devices: 1

Handle 0x0009, DMI type 17, 34 bytes
Memory Device
	Array Handle: 0x0008
	Error Information Handle: Not Provided
	Total Width: 72 bits
	Data Width: 64 bits
	Size: 8 GB
	Form Factor: DIMM
	Set: None
	Locator: Not Specified
	Bank Locator: Not Specified
	Type: DDR4
	Type Detail: Static Column Pseudo-static Synchronous Window DRAM
	Speed: 2933 MT/s
	Manufacturer: Not Specified
	Serial Number: Not Specified
	Asset Tag: Not Specified
	Part Number: Not Specified
	Rank: Unknown
	Configured Memory Speed: Unknown

# lscpu
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              4
On-line CPU(s) list: 0-3
Thread(s) per core:  2
Core(s) per socket:  2
Socket(s):           1
NUMA node(s):        1
Vendor ID:           GenuineIntel
BIOS Vendor ID:      Intel(R) Corporation
CPU family:          6
Model:               85
Model name:          Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHz
BIOS Model name:     Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHz
Stepping:            4
CPU MHz:             2999.998
BogoMIPS:            5999.99
Hypervisor vendor:   KVM
Virtualization type: full
L1d cache:           32K
L1i cache:           32K
L2 cache:            1024K
L3 cache:            25344K
NUMA node0 CPU(s):   0-3
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves ida arat pku ospke

# uname -srm
Linux 4.18.0-540.el8.x86_64 x86_64

issue demo

total 295G
-rw-r--r--. 1 root root 2.0G Mar 20 02:45 Swordsman-13505
-rw-r--r--. 1 root root 2.0G Mar 20 03:37 Swordsman-13554
....
-rw-r--r--. 1 root root 2.0G Mar 20 19:32 Swordsman-14370
-rw-r--r--. 1 root root 884M Mar 20 19:33 Swordsman-14371

# ll | wc -l
138

# du -ch
295G	.
295G	total

# rm -f *
# 
frclient_loop: send disconnect: Broken pipe

After re-ssh directory still has all the files

138
# du -ch /mnt/dir2
295G	/mnt/dir2
295G	total

zpool status

  pool: mnt
 state: ONLINE
remove: Removal of vdev 19 copied 28.6G in 0h3m, completed on Thu Mar 21 20:28:28 2024
	14.7M memory used for removed device mappings
config:

	NAME           STATE     READ WRITE CKSUM
	mnt            ONLINE       0     0     0
	  nvme4n1      ONLINE       0     0     0
	  nvme3n1      ONLINE       0     0     0
	  nvme1n1      ONLINE       0     0     0
	  nvme2n1      ONLINE       0     0     0

errors: No known data errors

zpool list

NAME            SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
mnt            3.05T  2.52T   549G        -         -     0%    82%  1.00x    ONLINE  -
  indirect-0       -      -      -        -         -      -      -      -    ONLINE
  indirect-1       -      -      -        -         -      -      -      -    ONLINE
  indirect-2       -      -      -        -         -      -      -      -    ONLINE
  indirect-3       -      -      -        -         -      -      -      -    ONLINE
  indirect-4       -      -      -        -         -      -      -      -    ONLINE
  indirect-5       -      -      -        -         -      -      -      -    ONLINE
  indirect-6       -      -      -        -         -      -      -      -    ONLINE
  indirect-7       -      -      -        -         -      -      -      -    ONLINE
  indirect-8       -      -      -        -         -      -      -      -    ONLINE
  indirect-9       -      -      -        -         -      -      -      -    ONLINE
  indirect-10      -      -      -        -         -      -      -      -    ONLINE
  indirect-11      -      -      -        -         -      -      -      -    ONLINE
  indirect-12      -      -      -        -         -      -      -      -    ONLINE
  indirect-13      -      -      -        -         -      -      -      -    ONLINE
  indirect-14      -      -      -        -         -      -      -      -    ONLINE
  indirect-15      -      -      -        -         -      -      -      -    ONLINE
  indirect-16      -      -      -        -         -      -      -      -    ONLINE
  nvme4n1      2.93T  2.47T   469G        -         -     0%  84.3%      -    ONLINE
  nvme3n1      25.0G  24.3G   169M        -         -     0%  99.3%      -    ONLINE
  indirect-19      -      -      -        -         -      -      -      -    ONLINE
  nvme1n1      25.0G  24.3G   232M        -         -    22%  99.1%      -    ONLINE
  nvme2n1      80.0G   680K  79.5G        -         -     0%  0.00%      -    ONLINE

zpool config

NAME  PROPERTY                       VALUE                          SOURCE
mnt   size                           3.05T                          -
mnt   capacity                       82%                            -
mnt   altroot                        -                              default
mnt   health                         ONLINE                         -
mnt   guid                           8946787721482689307            -
mnt   version                        -                              default
mnt   bootfs                         -                              default
mnt   delegation                     on                             default
mnt   autoreplace                    off                            default
mnt   cachefile                      -                              default
mnt   failmode                       wait                           default
mnt   listsnapshots                  off                            default
mnt   autoexpand                     on                             local
mnt   dedupratio                     1.00x                          -
mnt   free                           549G                           -
mnt   allocated                      2.52T                          -
mnt   readonly                       off                            -
mnt   ashift                         12                             local
mnt   comment                        -                              default
mnt   expandsize                     -                              -
mnt   freeing                        0                              -
mnt   fragmentation                  0%                             -
mnt   leaked                         0                              -
mnt   multihost                      off                            default
mnt   checkpoint                     -                              -
mnt   load_guid                      17249711793930708177           -
mnt   autotrim                       off                            default
mnt   compatibility                  off                            default
mnt   bcloneused                     0                              -
mnt   bclonesaved                    0                              -
mnt   bcloneratio                    1.00x                          -
mnt   feature@async_destroy          enabled                        local
mnt   feature@empty_bpobj            enabled                        local
mnt   feature@lz4_compress           active                         local
mnt   feature@multi_vdev_crash_dump  enabled                        local
mnt   feature@spacemap_histogram     active                         local
mnt   feature@enabled_txg            active                         local
mnt   feature@hole_birth             active                         local
mnt   feature@extensible_dataset     active                         local
mnt   feature@embedded_data          active                         local
mnt   feature@bookmarks              enabled                        local
mnt   feature@filesystem_limits      enabled                        local
mnt   feature@large_blocks           enabled                        local
mnt   feature@large_dnode            enabled                        local
mnt   feature@sha512                 enabled                        local
mnt   feature@skein                  enabled                        local
mnt   feature@edonr                  enabled                        local
mnt   feature@userobj_accounting     active                         local
mnt   feature@encryption             enabled                        local
mnt   feature@project_quota          active                         local
mnt   feature@device_removal         active                         local
mnt   feature@obsolete_counts        active                         local
mnt   feature@zpool_checkpoint       enabled                        local
mnt   feature@spacemap_v2            active                         local
mnt   feature@allocation_classes     enabled                        local
mnt   feature@resilver_defer         enabled                        local
mnt   feature@bookmark_v2            enabled                        local
mnt   feature@redaction_bookmarks    enabled                        local
mnt   feature@redacted_datasets      enabled                        local
mnt   feature@bookmark_written       enabled                        local
mnt   feature@log_spacemap           active                         local
mnt   feature@livelist               enabled                        local
mnt   feature@device_rebuild         enabled                        local
mnt   feature@zstd_compress          enabled                        local
mnt   feature@draid                  enabled                        local
mnt   feature@zilsaxattr             enabled                        local
mnt   feature@head_errlog            active                         local
mnt   feature@blake3                 enabled                        local
mnt   feature@block_cloning          enabled                        local
mnt   feature@vdev_zaps_v2           active                         local

zfs config

NAME  PROPERTY              VALUE                  SOURCE
mnt   type                  filesystem             -
mnt   creation              Sun Mar 10 13:54 2024  -
mnt   used                  2.52T                  -
mnt   available             451G                   -
mnt   referenced            2.52T                  -
mnt   compressratio         1.00x                  -
mnt   mounted               yes                    -
mnt   quota                 none                   default
mnt   reservation           none                   default
mnt   recordsize            128K                   default
mnt   mountpoint            /mnt                   local
mnt   sharenfs              off                    default
mnt   checksum              on                     default
mnt   compression           on                     default
mnt   atime                 on                     default
mnt   devices               on                     default
mnt   exec                  on                     default
mnt   setuid                on                     default
mnt   readonly              off                    default
mnt   zoned                 off                    default
mnt   snapdir               hidden                 default
mnt   aclmode               discard                default
mnt   aclinherit            restricted             default
mnt   createtxg             1                      -
mnt   canmount              on                     default
mnt   xattr                 on                     default
mnt   copies                1                      default
mnt   version               5                      -
mnt   utf8only              off                    -
mnt   normalization         none                   -
mnt   casesensitivity       sensitive              -
mnt   vscan                 off                    default
mnt   nbmand                off                    default
mnt   sharesmb              off                    default
mnt   refquota              none                   default
mnt   refreservation        none                   default
mnt   guid                  11115806655719226472   -
mnt   primarycache          all                    default
mnt   secondarycache        all                    default
mnt   usedbysnapshots       0B                     -
mnt   usedbydataset         2.52T                  -
mnt   usedbychildren        120M                   -
mnt   usedbyrefreservation  0B                     -
mnt   logbias               latency                default
mnt   objsetid              54                     -
mnt   dedup                 off                    default
mnt   mlslabel              none                   default
mnt   sync                  standard               default
mnt   dnodesize             legacy                 default
mnt   refcompressratio      1.00x                  -
mnt   written               2.52T                  -
mnt   logicalused           2.35T                  -
mnt   logicalreferenced     2.35T                  -
mnt   volmode               default                default
mnt   filesystem_limit      none                   default
mnt   snapshot_limit        none                   default
mnt   filesystem_count      none                   default
mnt   snapshot_count        none                   default
mnt   snapdev               hidden                 default
mnt   acltype               off                    default
mnt   context               none                   default
mnt   fscontext             none                   default
mnt   defcontext            none                   default
mnt   rootcontext           none                   default
mnt   relatime              on                     default
mnt   redundant_metadata    all                    default
mnt   overlay               on                     default
mnt   encryption            off                    default
mnt   keylocation           none                   default
mnt   keyformat             none                   default
mnt   pbkdf2iters           0                      default
mnt   special_small_blocks  0                      default

vmcore dmesg

dmesg.txt

maybe related:

#14732
#15776
#14914

The text was updated successfully, but these errors were encountered:

robn · 2024-03-28T23:01:06Z

@osleg can you post /proc/spl/kmem/slab from before and after the OOM event? Doesn't need to be exact, but I'd like to see what happens as more files are deleted, into the kernel attempting to reclaim memory, before finally giving up and killing something.

osleg · 2024-04-01T10:23:00Z

@robn sorry took me a bit of time to get those, here's 3 logs: first from before rm -f /mnt/dir2/* started, second is right after rm returned and third one is the last one I was able to fetch before the kernel panic
slab_1711966416_169.log
slab_1711966417_170.log
slab_1711966419_171.log

eliran-zada-zesty · 2024-04-10T12:49:45Z

Got this issue as well... :-(

sigkacey · 2024-04-14T06:59:42Z

I've also hit this recently and it looks like it is similar if not the same as #6783

robn · 2024-10-29T01:55:32Z

@osleg I'm so sorry I missed this.

Your slab output confirms it: see zio_cache, zio_link_cache, abd_t and multiple zio_buf_comb_* slabs ballooning in size. I've seen this pattern before in another context; it happens when we dump a ton of data accesses into the IO pipeline, because they all preallocate memory and just sit and wait their turn.

Intuitively, freeing objects shouldn't require large data allocations, however, freeing an object naturally involves metadata updates. #6783 involves dedup, which isn't in play here, but I notice all those indirect vdevs which I expect would need to be updated as frees come through, so maybe that's producing a similar effect.

I'm not very familiar with the file deletion codepaths, and even less so with indirect vdevs, so I'll need to read a bunch of code to get an idea of how this stuff works before I can go any further.

robn · 2024-10-30T02:33:22Z

I'm closing in on this. If you're still able to do the test, I could use one more bit of information. Create the files, but don't delete anything. Just run: zdb -b mnt (substitute name of pool for mnt, should you change it).

This will scan the entire pool from userspace, and produce a summary of all the blocks on the pool:

	bp count:              13812632
	ganged count:             46527
	bp logical:        821251308032      avg:  59456
	bp physical:       399499045888      avg:  28922     compression:   2.06
	bp allocated:      425925246976      avg:  30835     compression:   1.93
	bp deduped:                   0    ref>1:      0   deduplication:   1.00
	bp cloned:            303611904    count:  12628
	Normal class:      426201133056     used: 90.21%
	Embedded log class         311296     used:  0.01%

	additional, non-pointer bps of type 0:        305
	Dittoed blocks on same vdev: 1550574
	Dittoed blocks in same metaslab: 2

Basically what's happening is that if we can't process a "free block" operation on the spot, we put it on the IO pipeline. The overheads of that many ops suddenly landing on the pipeline is consuming a ton of memory (that's the zio_cache and zio_link_cache).

The thing is, we only do that if freeing the block may also require at least a read (and maybe an update) of some other metadata. Specifically:

the block is ganged (not you, pool is not fragmented)
the block is on the dedup table (not you, dedup=off)
the block possibly has a reference on the BRT (not you, bcloneused=0 and feature@block_cloning is not active)

#6783 and #16697 both involve dedup, which explains them. Yours is less clear. The output from zdb -b should rule out these cases entirely, and I'm hoping will give some other insight.

osleg added the Type: Defect Incorrect behavior (e.g. crash, hang) label Mar 28, 2024

robn mentioned this issue Oct 29, 2024

OOM after files remove with dedup on and fast dedup enabled #16697

Closed

This was referenced Oct 30, 2024

Optimized Large File Deletion to Prevent OOM #16708

Closed

dsl_dataset: put IO-inducing frees on the pool deadlist #16722

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OOM / Panic on files remove #16037

OOM / Panic on files remove #16037

osleg commented Mar 28, 2024 •

edited

Loading

robn commented Mar 28, 2024

osleg commented Apr 1, 2024

eliran-zada-zesty commented Apr 10, 2024

sigkacey commented Apr 14, 2024

robn commented Oct 29, 2024

robn commented Oct 30, 2024 •

edited

Loading

OOM / Panic on files remove #16037

OOM / Panic on files remove #16037

Comments

osleg commented Mar 28, 2024 • edited Loading

Problem

Test setup

Observed issue

Logs

Machine info

issue demo

zpool status

zpool list

zpool config

zfs config

vmcore dmesg

maybe related:

robn commented Mar 28, 2024

osleg commented Apr 1, 2024

eliran-zada-zesty commented Apr 10, 2024

sigkacey commented Apr 14, 2024

robn commented Oct 29, 2024

robn commented Oct 30, 2024 • edited Loading

osleg commented Mar 28, 2024 •

edited

Loading

robn commented Oct 30, 2024 •

edited

Loading