Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Linux 6.6.x(2.2.5) -> 6.12.x(2.2.7): WRITE FPDMA QUEUED (ATA bus error) #16873

Closed
IvanVolosyuk opened this issue Dec 15, 2024 · 23 comments
Closed
Labels
Type: Defect Incorrect behavior (e.g. crash, hang)

Comments

@IvanVolosyuk
Copy link
Contributor

IvanVolosyuk commented Dec 15, 2024

I tried to boot ZFS 2.2.7 with 6.12.4 kernel with new PREEMPT_RT.
Running local zfs send ... | zfs recv ... while running 3DMark / Port Royal in Windows 11 VM in kvm/vfio.

I ran this workload perfectly fine on 6.6.xx and ZFS 2.2.5, but now it actually failed badly with WRITE FPDMA QUEUED, hang zfs kthreads, zed errors. Somehow, the system was writing logs fine on another pool and zpool status showed zero errors. All my disks are parts of zpools, so no writes can happen outside of ZFS.

System information

Type Version/Name
Distribution Name Gentoo
Distribution Version Live
Kernel Version 6.12.4-gentoo
Architecture x86_64
OpenZFS Version 2.2.7
ECC ram Yes
Device mapper No

Describe the problem you're observing

  • SATA errors: WRITE FPDMA QUEUED. Those usually indicate failing disks, but all ZFS error counters were at zero when I ran zpool status. The smart data from disks show UDMA_CRC_Error_Count, Reallocated_Sector_Ct, Offline_Uncorrectable all at zero.
  • Multiple hang tasks including zfs kthreads
  • ZED complaining about delays upto 5 minutes.
  • kvm ignored msrs: irrelevant info, cause by 3dmark running in windows 11.

Describe how to reproduce the problem

Just got it once before pulling the plug. I believe it may be bad interaction of kernel compiled with PREEMPT_RT and ZFS while having a lot of disk activity from zpool send and kvm.

Run qemu/kvm/vfio with vCPU pinning and run 3dmark benchmark with PREEMPT_RT kernel.
zfs receive running in background storing around 50G into that pool.
Kernel params:

  • ZFS: zfs_arc_min=16G zfs_arc_max=40G spl.spl_taskq_thread_bind=0 spl.spl_taskq_thread_priority=0 init_on_alloc=0 zfs.l2arc_rebuild_blocks_min_l2size=1 zfs.zfs_bclone_enabled=0
  • zfs_txg_timeout=120, l2arc_trim_ahead=1, zfs_arc_pc_percent=50, l2arc_exclude_special=1
  • ZFS props: sync=disabled, compression=lz4, encryption=off
  • CPU pinning: nohz_full=0-7,16-23 rcu_nocbs=0-7,16-23 irqaffinity=8-15,24-31 rcu_nocb_poll cgroup_no_v1=all

Include any warning/errors/backtraces from the system logs

log.txt
kernel-config.txt

But zpool status reporting 0 errors. The logs were successfully stored by a different ssd-only pool. Interesting bits:

Dec 15 17:30:02 toster kernel: ata10.00: exception Emask 0x10 SAct 0x1000000 SErr 0x4050000 action 0xe frozen
Dec 15 17:30:02 toster kernel: ata10.00: irq_stat 0x00000040, connection status changed
Dec 15 17:30:02 toster kernel: ata10: SError: { PHYRdyChg CommWake DevExch }
Dec 15 17:30:02 toster kernel: ata10.00: failed command: WRITE FPDMA QUEUED
Dec 15 17:30:02 toster kernel: ata10.00: cmd 61/18:c0:d0:f6:1b/00:00:04:01:00/40 tag 24 ncq dma 12288 out
                                        res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x10 (ATA bus error)
Dec 15 17:30:02 toster kernel: ata10.00: status: { DRDY }
Dec 15 17:30:02 toster kernel: ata10: hard resetting link
Dec 15 17:30:03 toster kernel: ata10: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Dec 15 17:30:03 toster kernel: ata10.00: configured for UDMA/133
Dec 15 17:30:03 toster kernel: ata10: EH complete
.....

Dec 15 17:37:32 toster kernel: Call Trace:
Dec 15 17:37:32 toster kernel:  <TASK>
Dec 15 17:37:32 toster kernel:  __schedule+0x3b5/0xb60
Dec 15 17:37:32 toster kernel:  schedule+0x27/0xd0
Dec 15 17:37:32 toster kernel:  schedule_timeout+0x83/0x140
Dec 15 17:37:32 toster kernel:  ? timer_recalc_next_expiry+0x110/0x110
Dec 15 17:37:32 toster kernel:  io_schedule_timeout+0x51/0x70
Dec 15 17:37:32 toster kernel:  __cv_timedwait_common+0x116/0x150
Dec 15 17:37:32 toster kernel:  ? dequeue_task_stop+0x80/0x80
Dec 15 17:37:32 toster kernel:  __cv_timedwait_io+0x19/0x20
Dec 15 17:37:32 toster kernel:  zio_wait+0x10c/0x260
Dec 15 17:37:32 toster kernel:  dmu_tx_count_free+0x1ce/0x210
Dec 15 17:37:32 toster kernel:  dmu_free_long_range+0x214/0x500
Dec 15 17:37:32 toster kernel:  receive_writer_thread+0x449/0xb10
Dec 15 17:37:32 toster kernel:  ? preempt_schedule+0x33/0x50
Dec 15 17:37:32 toster kernel:  ? spl_taskq_fini+0x80/0x80
Dec 15 17:37:32 toster kernel:  ? receive_process_write_record+0x290/0x290
Dec 15 17:37:32 toster kernel:  ? spl_taskq_fini+0x80/0x80
Dec 15 17:37:32 toster kernel:  thread_generic_wrapper+0x5a/0x70
Dec 15 17:37:32 toster kernel:  kthread+0xcf/0x100
Dec 15 17:37:32 toster kernel:  ? kthread_park+0x90/0x90
Dec 15 17:37:32 toster kernel:  ret_from_fork+0x31/0x50
Dec 15 17:37:32 toster kernel:  ? kthread_park+0x90/0x90
Dec 15 17:37:32 toster kernel:  ret_from_fork_asm+0x11/0x20
.....
Dec 15 17:38:51 toster zed[6531]: Missed 116 events
Dec 15 17:38:51 toster zed[6531]: Bumping queue length to 2048
Dec 15 17:38:51 toster zed[17318]: eid=144 class=delay pool='archive' vdev=wwn-0x5000c500b699c096-part3 size=12288 offset=1661355368448 priority=0 err=0 flags=0x180180 delay=306724ms bookmark=1731:28:1:21110
Dec 15 17:38:51 toster zed[17319]: eid=145 class=delay pool='archive' vdev=wwn-0x5000c500b699c096-part3 size=12288 offset=613095067648 priority=3 err=0 flags=0x180080 delay=306842ms bookmark=1731:28:0:20924881
Dec 15 17:38:51 toster zed[17326]: eid=146 class=delay pool='archive' vdev=wwn-0x5000c500b699c096-part3 size=12288 offset=613095092224 priority=3 err=0 flags=0x180080 delay=306842ms bookmark=1731:28:0:20924883
Dec 15 17:38:51 toster zed[17330]: eid=148 class=delay pool='games' vdev=wwn-0x5000c500b699c096-part1 size=4096 offset=176128 priority=1 err=0 flags=0x1802c0 delay=210478ms
Dec 15 17:38:51 toster zed[17332]: eid=149 class=delay pool='games' vdev=wwn-0x5000c500b699c096-part1 size=4096 offset=438272 priority=1 err=0 flags=0x1802c0 delay=210478ms
@IvanVolosyuk IvanVolosyuk added the Type: Defect Incorrect behavior (e.g. crash, hang) label Dec 15, 2024
@IvanVolosyuk IvanVolosyuk changed the title ZFS + PREEMPT_RT: WRITE FPDMA QUEUED? ZFS + CONFIG_PREEMPT_RT: WRITE FPDMA QUEUED? Dec 15, 2024
@robn
Copy link
Member

robn commented Dec 15, 2024

@IvanVolosyuk Can you try zfs_vdev_disk_classic=0? I don't have a strong theory, but device timeouts on small writes (4K-12K) when the system is under heavy load is a pretty good broad summary of that entire class of problem.

@IvanVolosyuk
Copy link
Contributor Author

IvanVolosyuk commented Dec 15, 2024

$ cat /sys/module/zfs/parameters/zfs_vdev_disk_classic
0

Doesn't look like it helps:

[  137.150108] ata3.00: exception Emask 0x10 SAct 0x400000 SErr 0x4050000 action 0xe frozen
[  137.150111] ata3.00: irq_stat 0x00000040, connection status changed
[  137.150112] ata3: SError: { PHYRdyChg CommWake DevExch }
[  137.150114] ata3.00: failed command: WRITE FPDMA QUEUED
[  137.150115] ata3.00: cmd 61/08:b0:48:fc:9a/00:00:26:01:00/40 tag 22 ncq dma 4096 out
                        res 40/00:00:04:01:00/00:00:00:00:00/00 Emask 0x10 (ATA bus error)
[  137.150117] ata3.00: status: { DRDY }
[  137.150120] ata3: hard resetting link
[  138.031062] ata3: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[  138.061565] ata3.00: configured for UDMA/133
[  138.159183] ata3: EH complete
...
[  281.071042] ata3.00: exception Emask 0x10 SAct 0x80038ee0 SErr 0x40d0000 action 0xe frozen
[  281.071045] ata3.00: irq_stat 0x00000040, connection status changed
[  281.071046] ata3: SError: { PHYRdyChg CommWake 10B8B DevExch }
[  281.071048] ata3.00: failed command: WRITE FPDMA QUEUED
[  281.071049] ata3.00: cmd 61/10:28:e0:6a:12/00:00:20:01:00/40 tag 5 ncq dma 8192 out
                        res 40/00:00:01:4f:c2/00:00:00:00:00/00 Emask 0x10 (ATA bus error)
[  281.071052] ata3.00: status: { DRDY }
[  281.071052] ata3.00: failed command: WRITE FPDMA QUEUED
[  281.071053] ata3.00: cmd 61/08:30:f0:6a:12/00:00:20:01:00/40 tag 6 ncq dma 4096 out
                        res 40/00:c0:00:00:00/00:00:00:00:00/40 Emask 0x10 (ATA bus error)
[  281.071055] ata3.00: status: { DRDY }
[  281.071056] ata3.00: failed command: WRITE FPDMA QUEUED
[  281.071056] ata3.00: cmd 61/10:38:f8:6a:12/00:00:20:01:00/40 tag 7 ncq dma 8192 out
                        res 40/00:00:00:4f:c2/00:00:00:00:00/40 Emask 0x10 (ATA bus error)
[  281.071058] ata3.00: status: { DRDY }

Nothing in zpool events -v. zpool status - no errors.

@IvanVolosyuk
Copy link
Contributor Author

IvanVolosyuk commented Dec 15, 2024

I reproduced that on linux 6.12, zfs 2.2.7 without PREEMPT_RT (zfs_vdev_disk_classic=1) :(
I went back to linux-6.6 (zfs-2.2.5) to lick my wounds and restore my sanity, otherwise I'm starting to believe that something is wrong with my disks.

@IvanVolosyuk IvanVolosyuk changed the title ZFS + CONFIG_PREEMPT_RT: WRITE FPDMA QUEUED? Linux 6.6.x(2.2.5) -> 6.12.x(2.2.7): WRITE FPDMA QUEUED Dec 15, 2024
@RinCat
Copy link

RinCat commented Dec 16, 2024

I saw the same thing in my Gentoo 6.12.4 with ZFS 2.2.7 during a zfs scrub.

zpool status show no error, hard to tell it is zfs or sata issues. Maybe something changed how zfs read under high I/O pressure.

[12364.365287] ata6.00: exception Emask 0x10 SAct 0x2009000 SErr 0x280100 action 0x6 frozen
[12364.365292] ata6.00: irq_stat 0x08000000, interface fatal error
[12364.365294] ata6: SError: { UnrecovData 10B8B BadCRC }
[12364.365298] ata6.00: failed command: READ FPDMA QUEUED
[12364.365300] ata6.00: cmd 60/00:60:e8:47:b5/01:00:00:00:00/40 tag 12 ncq dma 131072 in
                        res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x10 (ATA bus error)
[12364.365306] ata6.00: status: { DRDY }
[12364.365308] ata6.00: failed command: READ FPDMA QUEUED
[12364.365309] ata6.00: cmd 60/00:78:e8:46:b5/01:00:00:00:00/40 tag 15 ncq dma 131072 in
                        res 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x10 (ATA bus error)
[12364.365314] ata6.00: status: { DRDY }
[12364.365316] ata6.00: failed command: READ FPDMA QUEUED
[12364.365317] ata6.00: cmd 60/00:c8:e8:45:b5/01:00:00:00:00/40 tag 25 ncq dma 131072 in
                        res 40/00:01:00:4f:c2/00:00:00:00:00/00 Emask 0x10 (ATA bus error)
[12364.365321] ata6.00: status: { DRDY }
[12364.365324] ata6: hard resetting link
[12364.828292] ata6: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[12364.829027] ata6.00: supports DRM functions and may not be fully accessible
[12364.829855] ata6.00: supports DRM functions and may not be fully accessible
[12364.830572] ata6.00: configured for UDMA/133
[12364.830583] ata6: EH complete

@IvanVolosyuk
Copy link
Contributor Author

@RinCat what kernel / zfs version you had before you started experiencing this?

@RinCat
Copy link

RinCat commented Dec 16, 2024

I think my last linux 6.11.11 with zfs 2.2.6 works fine. Only saw this when I bump to 6.12 with zfs 2.2.7.

@IvanVolosyuk
Copy link
Contributor Author

IvanVolosyuk commented Dec 16, 2024

@RinCat Another question, do you use rcu_nocbs, nohz_full and/or realtime priority tasks? My qemu runs at realtime priority with vCPU pinning to rcu offloaded CPUs.

@RinCat
Copy link

RinCat commented Dec 16, 2024

@IvanVolosyuk no, it just normal "Preemptible Kernel (Low-Latency Desktop)". I did not try the new RT config.

@IvanVolosyuk
Copy link
Contributor Author

IvanVolosyuk commented Dec 16, 2024

I have a bad feeling about this.
I ran the kernel with CONFIG_PROVE_LOCKING=y and no issues found, but still failing.
I randomly enabled the CONFIG_FORCE_TASKS_RCU=y and still failing.
On the otherhand, I looked at the power draw it was pretty significant ~680W on 1000W PSU. I wonder if my system is at a tipping point and 6.12 scheduler improvements just made it closer to power instability manifesting this way, because I can load it slightly more than before.
Update: On the other hand, I managed to load system a bit more (upto 720W) on 6.6.66 kernel (zfs 2.2.7) and didn't get any FPDMA messages there. So, not quite sure about anything.

@RichardBelzer
Copy link

Maybe it's a coincidence but I'm having the same issue after upgrading to zfs 2.2.7. A ton of errors under high I/O pressure. Switched back to 2.2.6 and the errors went away.

@IvanVolosyuk
Copy link
Contributor Author

IvanVolosyuk commented Dec 16, 2024

Maybe it's a coincidence but I'm having the same issue after upgrading to zfs 2.2.7. A ton of errors under high I/O pressure. Switched back to 2.2.6 and the errors went away.

@RichardBelzer Did you change the kernel version as well with ZFS 2.2.7? I went back to Linux 6.6.66 keeping ZFS 2.2.7 and I don't see the errors. What kernel version do you use?

@IvanVolosyuk
Copy link
Contributor Author

IvanVolosyuk commented Dec 17, 2024

Ok, there are 2 issues here it seems.

I figured out '[WRITE FPDMA QUEUED]' part. As of linux 6.11 (5433f0e7427ae4f5b128d89ec16ccaafc9fef5ee) the default sata link_power_management_policy was changed from max_performance to med_power_with_dipm. I hit it since I was upgrading from 6.6 LTS to 6.12 LTS. The new link power management makes my system very sad, but it works fine with the old setting. I guess, I might want to check the cables at some point. The way to restore the setting is to change CONFIG_SATA_MOBILE_LPM_POLICY back to 1 or via udev rule:
https://wiki.archlinux.org/title/Power_management#SATA_Active_Link_Power_Management

As for CONFIG_PREEMT_RT, I'll give up for now. It seems I get multiple lockups with no fs reads coming through and I'm not sure if it ZFS related on not.

@IvanVolosyuk IvanVolosyuk changed the title Linux 6.6.x(2.2.5) -> 6.12.x(2.2.7): WRITE FPDMA QUEUED Linux 6.6.x(2.2.5) -> 6.12.x(2.2.7): WRITE FPDMA QUEUED (ATA bus error) Dec 18, 2024
@RinCat
Copy link

RinCat commented Dec 18, 2024

@IvanVolosyuk I changed link_power_management_policy to max_performance, and still have this issue, so it maybe something else.

@IvanVolosyuk
Copy link
Contributor Author

You were on Linux 6.11.x which already changed link_power_management_policy, so it should be something different in your case. And this issue is specially about "ATA bus error" subclass of the errors. In general the WRITE FPDMA QUEUED errors can be caused literally by anything from failing disk to power delivery problems.
The specific error is identified by SErr 0x40d0000 (PHYRdyChg CommWake 10B8B DevExch). Which is the link issue in my case, which can be fixed by changing the link_power_management_policy.

@RinCat in your case 'SError: { UnrecovData 10B8B BadCRC }' means the disk reports that it is dying or at least has unreadable sectors.

@ebenali
Copy link

ebenali commented Dec 25, 2024

on debian /usr/src/zfs-2.2.7/dkms.conf has an express rule for NOT ON PRREMPT_RT:

PACKAGE_NAME="zfs"
PACKAGE_VERSION="2.2.7"
PACKAGE_CONFIG="/etc/dkms/zfs.conf"
NO_WEAK_MODULES="yes"
if [ -f $kernel_source_dir/.config ]; then
    . $kernel_source_dir/.config
    if [ "$CONFIG_PREEMPT_RT" = "y" ]; then
        BUILD_EXCLUSIVE_KERNEL="NOT ON PREEMPT_RT"
    fi
fi

previous maybe related ref #11097 referencing rw_sempahore so there may be a hint there even if the snippets from back then have entirely changed in current tree

@bsdice
Copy link

bsdice commented Jan 25, 2025

Just hit this error on Arch when testing out kernel 6.12.11, coming from 6.6.72. Disks are HGST 14 TB SATA and ZFS is 2.3.0.

Problem disappears on 6.6.74 but here CONFIG_SATA_MOBILE_LPM_POLICY=3 is already set.

Is there a final judgement what is causing this? Is it the new preempt, a SATA driver bug with preempt, or ZFS and 6.12, or something else?

@IvanVolosyuk
Copy link
Contributor Author

Just hit this error on Arch when testing out kernel 6.12.11, coming from 6.6.72. Disks are HGST 14 TB SATA and ZFS is 2.3.0.

Problem disappears on 6.6.74 but here CONFIG_SATA_MOBILE_LPM_POLICY=3 is already set.

Is there a final judgement what is causing this? Is it the new preempt, a SATA driver bug with preempt, or ZFS and 6.12, or something else?

Something else might been changing the link pm policy. Check

cat /sys/class/scsi_host/host*/link_power_management_policy

It should say max_performance or should be the same between the kernel versions. I also have the CONFIG_PREEMPT_RT disabled for now. I don't have the problem anymore after set the link policy to max_performance.

Also, do you have ATA bus error? The bug is about it specifically. There can be various reasons for WRITE FPDMA QUEUED including failing disks.

@bsdice
Copy link

bsdice commented Jan 25, 2025

Thank you for taking the time to reply! Appreciate it. Yes, ATA bus error, on multiple disks.

link_power_management_policy says indeed max_performance right now. Will check the kernel in a couple days again.

How did you tweak it, with kernel config switch during compilation or udev or similar? I would like the kernel to just do the right thing from get go, without needing a tweak in initramfs.

Jan 25 02:18:20 meran kernel: ata5.00: exception Emask 0x10 SAct 0x200 SErr 0x840000 action 0x6 frozen
Jan 25 02:18:20 meran kernel: ata5.00: irq_stat 0x08000000, interface fatal error
Jan 25 02:18:20 meran kernel: ata5: SError: { CommWake LinkSeq }
Jan 25 02:18:20 meran kernel: ata5.00: failed command: READ FPDMA QUEUED
Jan 25 02:18:20 meran kernel: ata5.00: cmd 60/08:48:70:79:05/00:00:8f:03:00/40 tag 9 ncq dma 4096 in
res 40/00:01:06:4f:c2/00:00:00:00:00/00 Emask 0x10 (ATA bus error)
Jan 25 02:18:20 meran kernel: ata5.00: status: { DRDY }
Jan 25 02:18:20 meran kernel: ata5: hard resetting link
Jan 25 02:18:20 meran kernel: ata5: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Jan 25 02:18:20 meran kernel: ata5.00: ACPI cmd f5/00:00:00:00:00:00(SECURITY FREEZE LOCK) filtered out
Jan 25 02:18:20 meran kernel: ata5.00: ACPI cmd b1/c1:00:00:00:00:00(DEVICE CONFIGURATION OVERLAY) filtered out
Jan 25 02:18:20 meran kernel: ata5.00: supports DRM functions and may not be fully accessible
Jan 25 02:18:20 meran kernel: ata5.00: ACPI cmd f5/00:00:00:00:00:00(SECURITY FREEZE LOCK) filtered out
Jan 25 02:18:20 meran kernel: ata5.00: ACPI cmd b1/c1:00:00:00:00:00(DEVICE CONFIGURATION OVERLAY) filtered out
Jan 25 02:18:20 meran kernel: ata5.00: supports DRM functions and may not be fully accessible
Jan 25 02:18:20 meran kernel: ata5.00: configured for UDMA/133
Jan 25 02:18:20 meran kernel: ata5: EH complete

@IvanVolosyuk
Copy link
Contributor Author

IvanVolosyuk commented Jan 25, 2025

I just set it in kernel config CONFIG_SATA_MOBILE_LPM_POLICY=1. It can be overridden in user-space though as pointed out in the commit which changed the default and caused issues for me:
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=5433f0e7427ae4f5b128d89ec16ccaafc9fef5ee
Default was zero at that point though and if I set the policy to zero (firmware default) it causes way more problems for me.

@bsdice
Copy link

bsdice commented Jan 25, 2025

Arch Linux has CONFIG_SATA_MOBILE_LPM_POLICY=3 for 6.6 (old linux-lts package) which I am using, as well as 6.12 (linux-lts package since a few days, they rolled).

Checking
https://github.com/torvalds/linux/blob/v6.6/drivers/ata/ahci.c#L1623 vs.
https://github.com/torvalds/linux/blob/v6.12/drivers/ata/ahci.c#L1724

makes me think perhaps the AHCI_HFLAG_USE_LPM_POLICY check in 6.6 prevented SATA links to the hard disks from being set to Device Initiated Power Management. And nobody noticed because who uses hard drives any more, right.

Another workaround is ahci.mobile_lpm_policy=1 kernel parameter to nail the setting to max performance. If you run spinners, link power saving might not be the most important issue in overall power usage.

@bsdice
Copy link

bsdice commented Jan 25, 2025

Addendum: Observe the lpm-pol 4 on vanilla 6.12 for the mainboard (Supermicro X11SAT, C236 chipset) AHCI ports.

This looks to be ATA_LPM_MIN_POWER_WITH_PARTIAL and is one step too much for my HGST HC530 14 TB drives:
https://github.com/torvalds/linux/blob/v6.12/include/linux/libata.h#L517

So newer kernels seem to take link power management suggestions from the AHCI chip or wherever and this trips up hard drives. Question remains is where this target_lpm_policy 4 is set or coming from, because kernel's CONFIG_SATA_MOBILE_LPM_POLICY is set to 3.

Of note, this is the second time I see a non-benign regression introduced by Niklas Cassel, after shooting udisks in the kneecaps storaged-project/udisks#732.

Jan 25 02:15:21 meran kernel: ata1: SATA max UDMA/133 abar m2048@0xdd74b000 port 0xdd74b100 irq 130 lpm-pol 4
Jan 25 02:15:21 meran kernel: ata2: SATA max UDMA/133 abar m2048@0xdd74b000 port 0xdd74b180 irq 130 lpm-pol 4
Jan 25 02:15:21 meran kernel: ata3: SATA max UDMA/133 abar m2048@0xdd74b000 port 0xdd74b200 irq 130 lpm-pol 4
Jan 25 02:15:21 meran kernel: ata4: SATA max UDMA/133 abar m2048@0xdd74b000 port 0xdd74b280 irq 130 lpm-pol 4
Jan 25 02:15:21 meran kernel: ata5: SATA max UDMA/133 abar m2048@0xdd74b000 port 0xdd74b300 irq 130 lpm-pol 4
Jan 25 02:15:21 meran kernel: ata6: SATA max UDMA/133 abar m2048@0xdd74b000 port 0xdd74b380 irq 130 lpm-pol 4
Jan 25 02:15:21 meran kernel: ata7: SATA max UDMA/133 abar m8192@0xdd210000 port 0xdd210100 irq 131 lpm-pol 0
Jan 25 02:15:21 meran kernel: ata8: SATA max UDMA/133 abar m8192@0xdd210000 port 0xdd210180 irq 132 lpm-pol 0
Jan 25 02:15:21 meran kernel: ata9: SATA max UDMA/133 abar m8192@0xdd210000 port 0xdd210200 irq 133 lpm-pol 0
Jan 25 02:15:21 meran kernel: ata10: SATA max UDMA/133 abar m8192@0xdd210000 port 0xdd210280 irq 134 lpm-pol 0
Jan 25 02:15:21 meran kernel: ata11: SATA max UDMA/133 abar m8192@0xdd210000 port 0xdd210300 irq 135 lpm-pol 0

@bsdice
Copy link

bsdice commented Jan 25, 2025

Addendum: ahci.mobile_lpm_policy=1 is without effect, lpm-pol stays on a Xeon Cascade Lake system at 0.

I am stopping here to crawl around in this rabbit hole. Will use 6.6 for longer it seems, showstopper bug.

@bsdice
Copy link

bsdice commented Jan 28, 2025

While this issue is closed (and for good reason, since not connected to ZFS), here are some more links about the issue. It is on front page of all search machines for this bug and Perplexity and Grok have started quoting it.

My own patch for my type of drives (WDC WD140 "shucked" white-labels) and kernel 6.12.11:

--- a/drivers/ata/libata-core.c 2025-01-29 00:11:52.312553627 +0100
+++ b/drivers/ata/libata-core.c 2025-01-29 00:24:35.000690010 +0100
@@ -4090,6 +4090,9 @@
 	/* Crucial devices with broken LPM support */
 	{ "CT*0BX*00SSD1",		NULL,	ATA_QUIRK_NOLPM },
 
+	/* WD white label hard disks with broken LPM support */
+	{ "WDC WD[81][02468]*",		NULL,	ATA_QUIRK_NOLPM },
+
 	/* 512GB MX100 with MU01 firmware has both queued TRIM and LPM issues */
 	{ "Crucial_CT512MX100*",	"MU01",	ATA_QUIRK_NO_NCQ_TRIM |
 						ATA_QUIRK_ZERO_AFTER_TRIM |

I haven't bothered to submit this upstream because I think they can figure it out themselves in the brief moments when they are actually working on the kernel and aren't busy setting 97% of the budget on fire in non-Linux projects.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Defect Incorrect behavior (e.g. crash, hang)
Projects
None yet
Development

No branches or pull requests

6 participants