-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
task txg_sync blocked for more than 120 seconds. #9130
Comments
We had these issues with Ubuntu 18.04 I am not sure but running the rsync with "nice -n 19" may have helped. We have not seen many issues with latest patches for Ubuntu Eoan. |
@BrainSlayer isn't that the same problem we had with #9034? o.O |
looks identical, yes.
but we triggered the memory allocation problems by wrong flags. here it
might be a real out of memory problem which again triggers the endless
loop in memory allocation procedure i love so much
Am 07.08.2019 um 22:29 schrieb Michael Niewöhner:
…
@BrainSlayer <https://github.com/BrainSlayer> isn't that the same
problem we had with #9034
<#9034>? o.O
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#9130>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AB2WNE5CEJJPMOM7JBATEE3QDMWCXANCNFSM4IJY3MZA>.
|
which should be fixed by our patch; so maybe we even fix a very common, recurring problem - great! |
I may have the same problem, altough my hang appears when I delete many snapshots (>10000). Is this the same problem? RAIDZ1 on Ubuntu 18.04 LTS with HWE kernel (5.0.0-31-generic). If this is indeed the same problem, i can send you more information, because i still have 30000 snapshots i need to delete and everytime I try deleting large amounts, ZFS hangs. |
Having the same problem while pools under heavy IO System information
Describe the problem you're observingtxg_sync hang during heavy IO, in my case during fio benchmark of NAS and S860 pool st the same time. Describe how to reproduce the problemRun fio benchmark of both pools at the same time:
Include any warning/errors/backtraces from the system logs
|
I've seen the same issue and just like @BrainSlayer and @c0d3z3r0 it was due to low memory. |
I have the same issue. A bandwidth limited rsync job at 12MB/s, or vm starting in virtual box triggers this and the entire file system freezes. Doesn't seem to be memory related nor CPU related as both stay low during transfers and file access. Disc IO also stays low. OS - Debian Stretch ov 13 11:23:32 nas kernel: [59329.141166] INFO: task txg_sync:1228 blocked for more than 120 seconds. avg-cpu: %user %nice %system %iowait %steal %idle Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
|
This might all be related to: |
I'm experiencing a similar issue, seen with ZFS packages from Ubuntu Eoan and packages built from ba434b1, containing the fix from #9583. The issue is seen on a mirrored pool with encrypted datasets, running on a Raspberry Pi 4B, 4 GB RAM, zfs_arc_max=536870912. The Pi is running various Docker workloads backed by the dockerd ZFS storage driver. The average free memory is around 1.5 GB.
|
@lassebm |
@Ornias1993 I wasn't quite sure what to look for. Would you think it's worth a new issue? |
@lassebm Actually, now I take a second look at it... |
System information
I don't know if it's related but I have a similar issue on Ubuntu Eoan when attempting to rsync large amounts of NAS data from one machine to another. Transfer speeds are around 30MB/s and then suddenly they'll hit 5MB/s when these errors occur and will stay at that speed. Cancelling the rsync and waiting for the io to catch up temporarily fixes it.
|
@kebian |
Ok, I edited my above comment and added those details. Thanks. |
@kebian Awesome... 8GB ram, you sure you aren't out of memory? |
can you share the disk scheduler that is in use by the underlying storage devices? |
Welcome to github mister bug stuff! :) |
Im having same issue on OpenMediaVault 5.3.10-1 with kernel 5.4.0-0-bpo.4-amd64 and ZFS v0.8.3-1~bpo10+1 although i remember having this issue for long time.
|
@Dinth, how much RAM do you have, and how big are the files you're moving (total and individual) when you see the error? My errors went away when I increased the RAM I have (from 8GB to 32GB), but obviously that was only a workaround and not a solution. It also only happened when moving thousands of files adding to over a terabyte across a file system boundary. |
16GB of RAM. The files ive been deleting and moving were not big. Today when this happened i've been moving 1GB file and deleting circa 200GB of files. |
Actually since the last time, zfs started doing that constantly, even after restart on idling system in single user mode and with pool unmounted. iostat -mx doesnt show anything abnormal and short SMART tests of drives were all successful. I have tried increasing RAM amount to 24 but has not fixed the issue.
(sda is not in a pool and on a different controller) |
Does it import read only? Can you scrub then? What kind of drives? |
It actually freezes the computer when trying to mount ZFS during normal boot.
and i have run'd zpool upgrade as thats what zpool is suggesting and i also found some posts online asking users to update the pool when it freezes a kernel. |
also iotop shows that txg_sync process is reading 250-290K/s, taking 99.9% of available IO. |
You may also see a difference by playing with the 'sync=' zfs property,
particularly setting 'sync=always' helped me a little, although it may have
been a coincidence. You should also read up on the effects of it because it
may not be suitable, especially in production server.
…On Thu, 7 May 2020 at 14:31, Michal ***@***.***> wrote:
also iotop shows that txg_sync process is reading 250-290K/s, taking 99.9%
of available IO.
Im still waiting for zpool upgrade to finish
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#9130 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABIKST6BHXAJWFF6JTEXURTRQKZ25ANCNFSM4IJY3MZA>
.
|
I seem to have hit this last night. I ran
after that command, pretty much everything stopped responding. Apps that were previously open were responding, and I could ssh in, but most programs were not responding or hung when run. The system just hung instantly, I saw <5% CPU utilization, RAM utilization did not change, and I wasn't able to see any disk IO dmesg:
Had to hard reset with the power button, then after a reboot zpool showed everything healthy and when I ran the same command edit: after a reboot and reloading deluged, I found that a large number of files which had been open in deluge were all missing. On me for not stopping deluged before running the operation, but I also do not believe it should have crashed. Good thing for external backups, right? :) |
@rejsmont Don't suppose you're using WD NVMe sticks (as per #14793, mentioned by @no-usernames-left above)? |
We are facing similar issue on a hetzner dedicated server, any clue on how to further troubleshoot our case and hopefully fix? The server got restarted multiple times per month and the root cause seems related to this, we are raising a case with hetzner support too to investigate the hardware error, please find below more info: Type Version/Name zpool status
errors: No known data errors zfs list free -g 2024-07-09T00:05:01.588756+02:00 CRON[10298]: (root) CMD (if zfs mount | grep -qs "/pool0/Ebackups" ; then /usr/bin/rsnapshot daily > /home/backupuser/log/rsnapshotDaily 2024-07-09T00:08:49.591141+02:00 kernel: [62833.725235] INFO: task txg_sync:1440 blocked for more than 120 seconds. 2024-07-09T00:20:05.431212+02:00 kernel: [63509.565341] mce: [Hardware Error]: Machine check events logged 2024-07-09T00:24:01.595919+02:00 backup-hetzner CRON[265086]: (root) CMD (if [ $(date +%w) -eq 0 ] && [ -x /usr/lib/zfs-linux/scrub ]; then /usr/lib/zfs-linux/scrub; fi) 2024-07-09T00:26:23.063173+02:00 systemd[1]: Configuration file /etc/systemd/system/zfs-load-key-custom.service is marked executable. Please remove executable permission b cat /etc/systemd/system/zfs-load-key-custom.service [Service] [Install] Any help is appreciated! |
You could try doing this on ds's that you use as a backup:
zfs set primarycache=none $pool/$ds
zfs set secondarycache=none $pool/$ds
…On Wed, Jul 10, 2024 at 6:25 AM cg ***@***.***> wrote:
We are facing similar issue on a hetzner dedicated server, any clue on how
to further troubleshoot our case and hopefully fix? The server got
restarted multiple times per month and the root cause seems related to
this, we are raising a case with hetzner support too to investigate the
hardware error, please find below more info:
Type Version/Name
Debian 12.4
Kernel 6.1.0-17-amd64
ZFS Loaded module v2.1.11-1
ZFS pool version 5000
ZFS filesystem vers 5
zpool status
pool: pool0
state: ONLINE
scan: scrub repaired 0B in 1 days 17:19:33 with 0 errors on Mon Jun 10
17:44:25 2024
config:
NAME STATE READ WRITE CKSUM
pool0 ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
sda ONLINE 0 0 0
sdb ONLINE 0 0 0
errors: No known data errors
zfs list
USED AVAIL
7.84T 8.54
free -g
total used free shared buff/cache available
Mem: 62 35 26 0 2 27
Swap: 31 0 31
2024-07-09T00:05:01.588756+02:00 CRON[10298]: (root) CMD (if zfs mount |
grep -qs "/pool0/Ebackups" ; then /usr/bin/rsnapshot daily >
/home/backupuser/log/rsnapshotDailyd ate +%a.log ; else echo "ZFS not
mounted. Failed Rsnapshot" ; fi)
2024-07-09T00:08:49.591141+02:00 kernel: [62833.725235] INFO: task
txg_sync:1440 blocked for more than 120 seconds.
2024-07-09T00:08:49.591154+02:00 kernel: [62833.725246] Tainted: P OE
6.1.0-17-amd64 #1 <#1> Debian
6.1.69-1
2024-07-09T00:08:49.591155+02:00 kernel: [62833.725251] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
2024-07-09T00:08:49.591155+02:00 kernel: [62833.725255] task:txg_sync
state:D stack:0 pid:1440 ppid:2 flags:0x00004000
2024-07-09T00:08:49.591157+02:00 kernel: [62833.725264] Call Trace:
2024-07-09T00:08:49.591157+02:00 kernel: [62833.725266]
2024-07-09T00:08:49.591158+02:00 kernel: [62833.725272]
__schedule+0x34d/0x9e0
2024-07-09T00:08:49.591158+02:00 kernel: [62833.725281] schedule+0x5a/0xd0
2024-07-09T00:08:49.591159+02:00 kernel: [62833.725285]
schedule_timeout+0x94/0x150
2024-07-09T00:08:49.591160+02:00 kernel: [62833.725290] ?
__bpf_trace_tick_stop+0x10/0x10
2024-07-09T00:08:49.591161+02:00 kernel: [62833.725296]
io_schedule_timeout+0x4c/0x80
2024-07-09T00:08:49.591161+02:00 kernel: [62833.725301]
__cv_timedwait_common+0x12f/0x170 [spl]
2024-07-09T00:08:49.591162+02:00 kernel: [62833.725316] ?
cpuusage_read+0x10/0x10
2024-07-09T00:08:49.591163+02:00 kernel: [62833.725322]
__cv_timedwait_io+0x15/0x20 [spl]
2024-07-09T00:08:49.591164+02:00 kernel: [62833.725336]
zio_wait+0x136/0x2b0 [zfs]
2024-07-09T00:08:49.591164+02:00 kernel: [62833.725478] ?
bplist_iterate+0x101/0x120 [zfs]
2024-07-09T00:08:49.591165+02:00 kernel: [62833.725589]
spa_sync+0x5b3/0xf90 [zfs]
2024-07-09T00:08:49.591165+02:00 kernel: [62833.725721] ?
mutex_lock+0xe/0x30
2024-07-09T00:08:49.591177+02:00 kernel: [62833.725725] ?
spa_txg_history_init_io+0x113/0x120 [zfs]
2024-07-09T00:08:49.591178+02:00 kernel: [62833.725857]
txg_sync_thread+0x227/0x3e0 [zfs]
2024-07-09T00:08:49.591179+02:00 kernel: [62833.725987] ?
txg_fini+0x260/0x260 [zfs]
2024-07-09T00:08:49.591179+02:00 kernel: [62833.726109] ?
__thread_exit+0x20/0x20 [spl]
2024-07-09T00:08:49.591180+02:00 kernel: [62833.726124]
thread_generic_wrapper+0x5a/0x70 [spl]
2024-07-09T00:08:49.591180+02:00 kernel: [62833.726139] kthread+0xda/0x100
2024-07-09T00:08:49.591181+02:00 kernel: [62833.726144] ?
kthread_complete_and_exit+0x20/0x20
2024-07-09T00:08:49.591182+02:00 kernel: [62833.726149]
ret_from_fork+0x22/0x30
2024-07-09T00:08:49.591182+02:00 kernel: [62833.726157]
2024-07-09T00:20:05.431212+02:00 kernel: [63509.565341] mce: [Hardware
Error]: Machine check events logged
2024-07-09T00:20:05.431223+02:00 kernel: [63509.565352] [Hardware Error]:
Corrected error, no action required.
2024-07-09T00:20:05.431224+02:00 kernel: [63509.565356] [Hardware Error]:
CPU:0 (17:71:0) MC17_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|Scrub]:
0xdc2041000000011b
2024-07-09T00:20:05.431225+02:00 kernel: [63509.565367] [Hardware Error]:
Error Addr: 0x00000003c24f2c00
2024-07-09T00:20:05.431226+02:00 kernel: [63509.565370] [Hardware Error]:
IPID: 0x0000009600050f00, Syndrome: 0x004000400a801203
2024-07-09T00:20:05.431227+02:00 kernel: [63509.565376] [Hardware Error]:
Unified Memory Controller Ext. Error Code: 0, DRAM ECC error.
2024-07-09T00:20:05.431228+02:00 kernel: [63509.565390] EDAC MC0: 1 CE on
mc#0csrow#3channel#0 (csrow:3 channel:0 page:0xf493cb offset:0x0 grain:64
syndrome:0x40)
2024-07-09T00:24:01.595919+02:00 backup-hetzner CRON[265086]: (root) CMD
(if [ $(date +%w) -eq 0 ] && [ -x /usr/lib/zfs-linux/scrub ]; then
/usr/lib/zfs-linux/scrub; fi)
2024-07-09T00:26:23.063173+02:00 systemd[1]: Configuration file
/etc/systemd/system/zfs-load-key-custom.service is marked executable.
Please remove executable permission b
its. Proceeding anyway.
cat /etc/systemd/system/zfs-load-key-custom.service
[Unit]
Description=Load ZFS encryption keys
DefaultDependencies=no
After=zfs-import.target
Before=zfs-mount.service
[Service]
Type=oneshot
RemainAfterExit=yes
ExecStart=/usr/sbin/zfs load-key -a
StandardInput=tty-force
[Install]
WantedBy=zfs-mount.service
Any help is appreciated!
—
Reply to this email directly, view it on GitHub
<#9130 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABTQJTZD72Z7BHPCBH34B2TZLUD2PAVCNFSM4IJY3MZKU5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TEMRSGAYTENJSG4YQ>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
System information
Describe the problem you're observingGot the above to happen, free indicated that the kernel was using something like 50GB, and it was after using Describe how to reproduce the problemNot sure, above might do it but I'm not going to try to copy 1.5TB again. Include any warning/errors/backtraces from the system logs
Note the date; the system seemed to be working fine since then and I only noticed it because I compulsively do |
Ouch, so @bombcar Did the scrub show any errors on disk or other weirdness? |
@bombcar what hardware, drives CMR/SMR?, connectors, controller and backplane are you using? Can you share |
Ok that "TANDBERG" thing is a USB RDX drive, which can sometimes ... be a bit touchy, but it's been working. I don't know how to tell if the traceback came from the main pool, root pool, or the backup pool. There were no entries in the kernel log indicating disk problems whatsoever:
The system is an older Dell Chassis thing with this controller:
I can get the full
The copy appears to have finished correctly, hopefully I didn't get file corruption (which I don't really have any way to see above ZFS itself). |
My memory could be wrong but I think it's more likely to see this timeout when writing to USB drives because of the nature of the usb-hdd "pipeline" and the fact this pipeline can see increased latency and drop outs/disconnects, and I'd wager the chances are higher (of seeing this timeout) when the throughput or IOPS are reaching the saturation point of the given device or interface. I couldn't quickly decode the model of drive in the USB drive. If it's a larger capacity it's very likely to be SMR. SMR drives are known to have performance issues with the current OpenZFS codebase. I know this is somewhat anecdotal feedback but that is what I can say after watching this issue for some years. |
It’s possible, though the copy was from tank to tank and wasn’t hitting the USB at all. |
Ah, ok. So your tank zpool looks like Seagate EXOS X 12TB CMR drives. The copy is very likely to of been fine but performance would of suffered during the operation especially around the time of the error. ZFS should throw CRC errors if something went wrong with the data integrity during the operation. I've experienced this error multiple times (dozens) and never seen data integrity issues, only performance issues. For peace of mind, if you still have both SRC and DST files/datasets you could use a fast hashing algorithm like xxh to do a checksum of both SRC and DST files. You can also look at https://github.com/psy0rz/zfs_autobackup/wiki/zfs-check So we are starting to get off topic. If you want to chat further send me a DM on reddit. Same username. |
|
Just experienced this on my system: 12x MG09 18TiB Toshiba Root is on separate zfs pool with 2x SATA ssd. nfs hung, client devices hung, restarting nfs-server hung. Host is running proxmox VE 8.2.4. ZFS version in use was 2.2.4. Only resolution was a force shutdown. |
Interesting. Proxmox has ZFS 2.2.6 in it's repos. Well, that's in it's No idea if that would potentially help eliminate this bug though. |
I had to recover the pool as it wouldn't import after the failure. I had to use the -F option to import it, which worked luckily. This caused a considerate downtime for me. I've now updated to zfs 2.2.6 which was indeed in the repo. I sure hope this won't happen again. EDIT: occurred again 28/10/2024 |
Hitting this issue now. Using proxmox 8.3, zfs 2.2.6, kernel 6.8.12-4. Almost exact stack as #9130 (comment)
This happens during zfs automount on boot. After I used the systemd debug shell to mask that job, I was able to start the system. However, it happened again - triggered by copying large (4TB) files from remove over 10gbe network connection, using smb running in a LXC host. In fact, the very first time I hit this issue, it was caused by the same file copy operation. I'm also running zrepl, which constantly creates snapshots, and performed send/recv to an remote machine. It prunes old snapshots occasionally. Here are the dmesg logs for when it happened again, slightly different stack
Has anyone figured out a cause or a workaround? I'm about to restore/rebuild my pool from backup. Considering using FreeBSD, as some coments above mention this is not a problem there. |
Just to check an idea, is that system using ZFS encryption? |
Yes, encrypted. Using LSI HBA and Intel RES2SV240 E91267-203 expander, SAS drives. 128 ECC ram. Ryzen 9 5950X, Asrock Taichi x570.
|
Interesting. I've seen other people (credible ones) mention that using ZFS encryption + ZFS send/recv (ie to ship the snapshots to a remote system) is buggy and not at all a good idea for production systems. I'd kind of hoped that the bug fixes in ZFS 2.2.6 might have somehow fixed those particular bugs, but it sounds like there are still problems hanging around. For my systems, that meant switching them to using LUKS/cryptsetup for the encryption part of things instead (still using ZFS snapshots). That's been working in small scale production for several months now. You might need to look into doing something similar too to try and work around the problem. Sorry I don't have better ideas for resolving this. 😦 |
This doesn't seem to be related to encryption necessarily, nor does it seem to be related to |
exactly. we also observed this. we were running Gitlab on an affected server and anytime large cleanup tasks where happening or a lot of CI artifacts came flying in we would see this. |
Indeed, i am not using snapshots nor encryption and im still affected. Rarely, but still. |
If its not related to encryption, then its worse. I suspect this may have caused permanent corruption in the pool. If @1e100 is correct, and its related to writing at or over the vdev capacity, then why would it also occur during boot up, during I've used ZFS for 10+yrs now, starting with FreeNAS 8, and this has totally shaken my belief in this filesystem. I need to unblock myself as quickly as possible. Here's what I plan to try:
After I finish my video edits, I'll redo my primary pool from a backup, this time without encryption. I'll keep encryption on my remote backup server, and not use raw sends like I was doing previously. Of course, this all assumes that TrueCore works - if it doesn't, I'm going to have to rethink using ZFS. |
Not sure if this will help....
I am running zfs 2.2.2-0ubuntu9.1 on Ubuntu 24.04 and have been working
around a few of these issues. Maybe they are slightly different. I have 2
main servers and a backup server ubuntu 22.04 and their offsite twins on
24.04. Approx 71T of data and 400 million files.
About once a week we will get a kernel gpf on the 24.04 when the 22.04
sends to a 24.04. Some of the datasets are not encrypted being sent to an
encrypted dataset. Some are encrypted going to an encrypted dataset.
I couple days ago I followed some direction from:
https://zfsonlinux.topicbox.com/groups/zfs-discuss/T3bffa1f6549240fe-Maa0c78175ffb4bec3c36e413
Running:
for i in a b c d e f g h i j k ; do
echo mq-deadline > /sys/block/sd$i/queue/scheduler
hdparm -W1 -B254 -S240 -A1 -a128 /dev/sd$i >/dev/null 2>&1
echo 64 > /sys/block/sd$i/queue/read_ahead_kb
echo "write through" > /sys/block/sd$i/queue/write_cache
echo 1536 > /sys/block/sd$i/queue/nr_requests
echo 1280 > /sys/block/sd$i/queue/max_sectors_kb
echo 0 > /sys/block/sd$i/queue/nomerges
echo 32767 > /sys/block/sd$i/queue/iosched/read_expire
echo 32767 > /sys/block/sd$i/tail/iosched/write_expire
echo 32 > /sys/block/sd$i/device/queue_depth
done
Seems to have added a little stability. I think nr_requests may be at the
heart of the issues. On another thread someone suggested changing it to 2
or 4 as the disks may not be the best at managing the queue.
The only other thing that is somewhat of an issue is that when the 24.04
server is rebooted the dataset has to run through a z_upgrade process that
takes awhile on the next receive.
…On Tue, Dec 10, 2024 at 9:01 AM Min Idzelis ***@***.***> wrote:
If its not related to encryption, then its worse. I suspect this may have
caused permanent corruption in the pool. If @1e100
<https://github.com/1e100> is correct, and its related to writing at or
over the vdev capacity, then why would it also occur during boot up, during zfs
mount -a? Additionally, previously, I was able to max out my vdev at
700MB/sec over 10Gbe doing simple rsyncs and file copies over SMB with no
problems. However, both times this happened, it was only writing out at
about 200-300MB/sec. In this case, I was using FinalCutP Pro to transcode
8K raw files to ProRes directly to the SMB share. What may be interesting
is that this would generally produce 1 single very large file (3.5TB+)
sequentially.
I've used ZFS for 10+yrs now, starting with FreeNAS 8, and this has
totally shaken my belief in this filesystem.
I need to unblock myself as quickly as possible. Here's what I plan to try:
1. Proxmox has a 6.11 kernel in their test repository. Will give this
a try.
2. If that doesn't work, then will setup a TrueNAS Core 13 and see how
that works.
After I finish my video edits, I'll redo my primary pool from a backup,
this time without encryption. I'll keep encryption on my remote backup
server, and not use raw sends like I was doing previously. Of course, this
all assumes that TrueCore works - if it doesn't, I'm going to have to
rethink using ZFS.
—
Reply to this email directly, view it on GitHub
<#9130 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABTQJTY2CVJO4NKXKDSBMF32E3X3NAVCNFSM4IJY3MZKU5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TENJTGE3TEMBTGMZA>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
I did notice this on some of my systems, I think the pattern is that I was seeing it mainly on systems with integrated SATA ports. When I swapped to LSI SAS HBAs instead of the motherboard ports this went away, so maybe ZFS is heavily using some queuing feature that the onboard SATA controller does not implement correctly. |
I tried doing what @prgwiz suggested. It didn't help. FWIW, @howels - I was not using SATA ports, all my drives are in fact SAS drives. Last update: I was able to trigger it when deleting folders directly from the host. (
It did seem related only to one dataset - however, I didn't want to chance it. I'm decided to nuke my primary pool - currently restoring off my backup pool. 24h to go! Nice side effect of rebalancing data on my drives after adding some new capacity. Decided to remove encryption while I was at it, just in case it was a problem. If it happens again, I'll run smb/nfs directly on the host instead of in a container/cgroups to see if thats related. |
On Sun, Dec 22, 2024 at 9:07 PM Min Idzelis ***@***.***> wrote:
I tried doing what @prgwiz <https://github.com/prgwiz> suggested. It
didn't help. FWIW, @howels <https://github.com/howels> - I was not using
SATA ports, all my drives are in fact SAS drives.
Did you try the nr_requests set to 2?
… Message ID: ***@***.***>
|
There are a few different issues here. I will try to respond to each as it is clear this slipped through the cracks. @AlexOwen The txg_sync thread is waiting excessively is often a symptom of the underlying storage being slow at servicing IOs. The historically most common cause of this would be a disk having trouble getting sector ECC to return a sensible value. Many disks will retry reads an excessively long time before deciding the sector is bad and reporting it to ZFS. It would seem counter-intuitive for this to block the txg_sync thread because the txg_sync thread primarily writes, but at a low level, ZFS will do copy on write and if it only knows a partial amount of what must be written, then it must read the existing record on disk to get the remainder before it can do a write. This can happen more often than you would think given that various metadata updates only affect partial records. For example, if we are modifying a partial record on a large file, we must read each level of the indirect block tree until we know where the original record is. Then we need to write that out, plus new records for each level of the indirect block tree. Thankfully, these will be cached from the initial reads, sparing us the need to read them again, but if the drive enters error correction on sectors on a few of these levels, the reads needed for us to be able to begin the writes that the txg_sync thread does can be delayed by a significant amount of time. If each one is delayed by 30 seconds, and there are 5 levels, then that would easily exceed 120 seconds and cause a report that the txg_sync thread has hung. In any case, something like this could be what happened on your machine. You might get some hint about this by looking at smart data, but usually, it won't report how many errors it corrected and only reports reallocated sector counts, which makes this problem a headache to diagnose. It is possible on some drives to set Error Recovery Control to address this issue: https://en.wikipedia.org/wiki/Error_recovery_control The recommendation to set error recovery control is documented here: Documentation on how to set this is here: I wrote the original recommendation to set it to 0.1 in the documentation and I suggest setting it to that for performance purposes. Interestingly, wikipedia claims that ZFS has handling for this, but I am not actually aware of any such feature. I have been less active for the past few years, so either the addition of the feature was added when I was not looking or wikipedia is wrong. My suggestion is to follow the documentation on this one, as the others should have modified it if there were any change in this area. Unfortunately, not all drives support error recovery control, but there are still some things you can do, so keep reading. Regular scrubs might help with this. It will force the drive to attempt to read all written sectors. That will cause delays if it is slow at reading (unless you set error recovery control), but after a while, it will either self correct or pass an error to ZFS, which will correct the issue. After the scrub has finished, any sectors in a state that causes delays should be fixed. If regular scrubs do not work, I suggest watching the IO completion times of your drives. One is likely much slower than the others. You can sort of use iostat to do this, but it generates averages rather than histograms. What you really want is a tool that generates histograms so you can catch outliers. biolatency from bcc could work for doing this: https://github.com/iovisor/bcc/blob/master/tools/biolatency.py Alternatively, you could skip to using this utility to identify the bad drive upfront and then try using a scrub and see if it helps to remediate the drive. If things continue to happen even after a scrub and you have identified the bad drive, you should just replace the drive. That said, I am not ruling out the possibility of a bug, but I think we should try to rule out the more obvious explanations. Others posting that these issues are not present on non-Linux operating systems suggests the possibility that there is a Linux bug that is affecting us. This would give similar symptoms to what I described here and likely would be detectable using the biolatency tool, although in the case of a Linux bug, I suspect we would see many drives having high outlier IO completion times rather than just one. There were also a few replies saying that this affects proxmox, but not Ubuntu and that changing the IO elevator helps. Interestingly, ZFS when given the disks directly sets the IO elevator to noop, while proxmox might be doing its own partitioning, which would prevent ZFS from setting the IO elevator. This ordinarily would not be the end of the world, but it might be related to the issues people are having. What IO elevator is set at There is a report that replacing a SMR drive with a non-SMR drive solved this problem. SMR could cause sluggishness whenever ZFS tries to write to a sector. Are you using any SMR drives? Finally, are you using data deduplication? @HiFiPhile @seanmikhaels @kebian @rkorzeniewski @Blackclaws @lukaszaoralek @kyle0r @laurensbl @MatthiasKuehneEllerhold @rgrunbla @sotiris-bos @carmenguarino @bombcar @midzelis Your issues look the same as the original poster's and my advice is the same for you. It is possible that there are multiple issues causing the same symptoms, but I cannot tell your issues apart from the original poster's issue based on what you reported. @lassebm @rejsmont Your issue is different. In your case, ZFS submitted the request to the block layer and it got stuck. This issue is either in the block layer or your hardware. I suggest filing a separate issue for this. @Dinth I see from the history that your issue is deduplication related, which is another mechanism by which the txg_sync thread can be delayed. @queenkjuul @foxx Your issues are different than the others here. You might have hit a bug, although I have a suspicion this bug has been fixed since you posted. @SpikedCola I really wish I had seen this 4 years ago. Since you ran a scrub and it did not address the issue, my suggestion would have been to skip the advice I gave to the original poster and use a tool such as biolatency from BCC to monitor IO completion times. @kirscheGIT @0x3333 Your issues are similar to the others, but different. The txg_sync thread is making forward progress faster than others' here, but not fast enough for the zpool command to unblock on it. My advice is the same as it is for the others. @dm17 USB flash drives tend to be bad at random IO. I suspect the one you have is exceptionally bad at it. I suggest getting a different USB flash drive. You will likely have problems with this one for anything other than writing large files to it using FAT32. I suggest filing a separate issue for this to discuss this there. Maybe we could make this better, but some USB flash drives are really terrible and there is not much hope. @kjelderg You did not post a backtrace, but you said you had almost exactly the same problem as @rkorzeniewski. My suggestion for you would have been the same as it is for him, although upon reading your later remark that switching to FreeBSD fixed the issue, I am beginning to wonder if there is a Linux bug that is affecting us. :/ @matti Unfortunately, I need more information than "same here" to give advice, but if it really is the same backtrace, then my remarks to the original poster apply. @khinsen Your remark makes me think there might be a Linux bug affecting us. The biolatency tool I suggested could help prove it. @Red-Eyed Your backtrace is slightly different than the others. Python did a synchronous write, which must wait for past transaction groups to have finished so that the ZIL record for it will be read upon resume if there is a power loss. However, the transaction group commits are running slowly. They are not slow enough to trigger a warning for the txg_sync thread, but they are slow enough to cause this synchronous write to issue a warning. I suggest filing a separate issue with @kirscheGIT and @0x3333. I believe your issues are either the same or very similar. @ThatCoffeeGuy This affecting proxmox but not Ubuntu is a very interesting data point. Thank you. @agrevtcev This is an interesting datapoint as well. Thank you. @phjr That is a great data point. Thanks. @markusdd The data points on a pool scrub and on Ubuntu Linux not having this issue when Rocky Linux did are interesting. The first means something other than drive error correction is at work on your system. Did you partition the drives yourself or did you let ZFS handle partitioning? If you did partitioning yourself, what IO elevator is in use? I wonder if the IO elevator being used on Rocky Linux and on Ubuntu are different. Does setting it to none make things better on Rocky Linux? @kobuki Hearing that this affects you on Ubuntu is interesting considering that a number of others stated that switching to Ubuntu solved their problem. I suspect that you are having a different underlying problem than the others, but the symptoms are the same. My advice is to try what I suggested to the original poster. @Real-Ztrawberry You and @MatthiasKuehneEllerhold are correct that this is typically the result of the storage hardware running slowly. The reasons for this can vary, but a common cause would be error correction, which is why I am suggesting people try setting up error recovery control if their drives support it. @ananthb That is a good data point on this affecting Linux 5.4.y. Did you partition the drives yourself or did you let ZFS handle the partitioning? If you partitioned them yourself, what IO elevator is being used? @furicle Your issue is distinct from the one others are reporting here. Does running @dperelman That is a good data point on the USB hub. As for it causing ZFS to be unrecoverable, there is not much that we can do when a pool is on a single drive and that drive hangs. That issue hangs all filesystems, not just ZFS. In any case, your single drive hanging issue is a distinct problem from what the majority of others are having here, although it might be related to the issues that most of the other raspberry pi users are having. That said, I suggest you file a separate issue with your suggestion that we be more robust in this scenario. Please include as much as you can with your suggestion. @ms-jpq The data not being there when you force a reboot is because the transaction group commit is what contains that data and forcing a reboot before it has finished means that it never happened. This does not solve the problem of it running slowly however. I suggest using biolatency to see what is happening in terms of your disks' IO completion times. As for send/recv of a zvol being ridiculously slow, does running @c-imp11 Unfortunately, I need more data than that to provide a useful reply, although perhaps what I suggested for the others might help you. @4xoc Nobody running FreeBSD has reported this issue. It is Linux exclusive, and it seems to largely only affect certain Linux distributions. Ubuntu is largely unaffected. There is one person who claimed to be affected on Ubuntu, but I suspect his issue is different than the others and has a different underlying cause, with the same symptoms. |
If anyone is affected by txg_sync being blocked for more than 120 seconds, here is a list of questions:
If you can readily reproduce the problem on your machine, please install iovisor/bcc and use the biolatency tool to see if any drives have abnormal IO latency outliers. You will know any if you see them, since they will be >100ms and possibly even in the 10s of seconds. Note that you will need to specify a long period for biolatency's summaries to catch extremely long IO hangs (for example, 120 seconds would catch most cases of the latencies being in the 10s of seconds). If one drive has abnormal IO latency outliers, does replacing it fix the problem? If all drives have abnormal IO latency outliers, then that indicates a serious problem between the drives and ZFS (e.g. bad controller, bad IO elevator, bad storage driver). Please let me know if you are in a situation where all drives have IO latency outliers. |
I was curious whether the biolatency utility would catch latencies caused by the Linux IO elevator, so I checked the kernel sources. It turns out that the instrumentation used by it is before the Linux IO elevator is given IOs, so biolatency should catch latencies caused by the Linux IO elevator. In specific, the starting tracepoint is trace_block_io_start(), which is in this call stack (based on the Linux 6.12.6 sources):
This assumes that device mapper, MD, ZRAM, BRD, multipath and a few other extremely obscure things are not being used under ZFS. Those use a different call path and they do not use the trace point. The full list as found by cscope is:
Interstingly, zvols also qualify to be in that list. biolatency does not instrument bdev_start_io_acct(), which is what most of those things in that list use. |
System information
The pool is 40TB raw (8 x 5TB) in RAIDZ2. De-duplication is off.
Drives are new and verified working with badblocks (4 passes) and smart tests.
Compression is set to lz4.
This has been verified with native encryption on and off on the same hardware.
The drives are in 2 x external thunderbolt enclosures, which would be a red flag had they not run for > 1 year on another system with different (same model) drives without fault.
Describe the problem you're observing
The IO hangs during long rsync jobs (e.g. 1TB+) over the network, or shorter ones locally (one pool to another) (1GB+).
This causes rsync to hang and any other programs that use the pool will also go unresponsive if started (as would be expected if the IO is hung. E.g. the console is responsive over SSH, but
ls
on the root of the pool hangs.When this happened on ZFS 0.7.x it kernel panicked and crashed the computer, but since 0.8.1 it only causes pool IO related tasks to hang but the server remains responsive otherwise.
This seems related to #2934, #7946, #4361, #2611 and #4011.
The system only has 8GB RAM, which may or may not be an issue, but I wouldn't expect a kernel panic regardless. Other users on the related issues report increasing the RAM, or higher RAM initially with the same error.
CPU usage is generally high (>80%) and write speeds are low throughout the transfer when the pool is encrypted, but this is expected as I believe the missing kernel symbols patch is not in this release (scheduled for 0.8.2?).
Describe how to reproduce the problem
Any long rsync job seems to crash it after a few hours, a fast, large local rsync job (1GB+ pool to pool) will crash it almost immediately.
Include any warning/errors/backtraces from the system logs
Initially:
Then periodically:
The text was updated successfully, but these errors were encountered: