Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ZFS corruption related to snapshots post-2.0.x upgrade #12014

Open
jgoerzen opened this issue May 8, 2021 · 251 comments
Open

ZFS corruption related to snapshots post-2.0.x upgrade #12014

jgoerzen opened this issue May 8, 2021 · 251 comments
Labels
Component: Encryption "native encryption" feature Status: Triage Needed New issue which needs to be triaged Type: Defect Incorrect behavior (e.g. crash, hang)

Comments

@jgoerzen
Copy link

jgoerzen commented May 8, 2021

System information

Type Version/Name
Distribution Name Debian
Distribution Version Buster
Linux Kernel 5.10.0-0.bpo.5-amd64
Architecture amd64
ZFS Version 2.0.3-1~bpo10+1
SPL Version 2.0.3-1~bpo10+1

Describe the problem you're observing

Since upgrading to 2.0.x and enabling crypto, every week or so, I start to have issues with my zfs send/receive-based backups. Upon investigating, I will see output like this:

zpool status -v
  pool: rpool
 state: ONLINE
status: One or more devices has experienced an error resulting in data
	corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
	entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
  scan: scrub repaired 0B in 00:03:37 with 0 errors on Mon May  3 16:58:33 2021
config:

	NAME         STATE     READ WRITE CKSUM
	rpool        ONLINE       0     0     0
	  nvme0n1p7  ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

        <0xeb51>:<0x0>

Of note, the <0xeb51> is sometimes a snapshot name; if I zfs destroy the snapshot, it is replaced by this tag.

Bug #11688 implies that zfs destroy on the snapshot and then a scrub will fix it. For me, it did not. If I run a scrub without rebooting after seeing this kind of zpool status output, I get the following in very short order, and the scrub (and eventually much of the system) hangs:

[393801.328126] VERIFY3(0 == remove_reference(hdr, NULL, tag)) failed (0 == 1)
[393801.328129] PANIC at arc.c:3790:arc_buf_destroy()
[393801.328130] Showing stack for process 363
[393801.328132] CPU: 2 PID: 363 Comm: z_rd_int Tainted: P     U     OE     5.10.0-0.bpo.5-amd64 #1 Debian 5.10.24-1~bpo10+1
[393801.328133] Hardware name: Dell Inc. XPS 15 7590/0VYV0G, BIOS 1.8.1 07/03/2020
[393801.328134] Call Trace:
[393801.328140]  dump_stack+0x6d/0x88
[393801.328149]  spl_panic+0xd3/0xfb [spl]
[393801.328153]  ? __wake_up_common_lock+0x87/0xc0
[393801.328221]  ? zei_add_range+0x130/0x130 [zfs]
[393801.328225]  ? __cv_broadcast+0x26/0x30 [spl]
[393801.328275]  ? zfs_zevent_post+0x238/0x2a0 [zfs]
[393801.328302]  arc_buf_destroy+0xf3/0x100 [zfs]
[393801.328331]  arc_read_done+0x24d/0x490 [zfs]
[393801.328388]  zio_done+0x43d/0x1020 [zfs]
[393801.328445]  ? zio_vdev_io_assess+0x4d/0x240 [zfs]
[393801.328502]  zio_execute+0x90/0xf0 [zfs]
[393801.328508]  taskq_thread+0x2e7/0x530 [spl]
[393801.328512]  ? wake_up_q+0xa0/0xa0
[393801.328569]  ? zio_taskq_member.isra.11.constprop.17+0x60/0x60 [zfs]
[393801.328574]  ? taskq_thread_spawn+0x50/0x50 [spl]
[393801.328576]  kthread+0x116/0x130
[393801.328578]  ? kthread_park+0x80/0x80
[393801.328581]  ret_from_fork+0x22/0x30

However I want to stress that this backtrace is not the original cause of the problem, and it only appears if I do a scrub without first rebooting.

After that panic, the scrub stalled -- and a second error appeared:

zpool status -v
  pool: rpool
 state: ONLINE
status: One or more devices has experienced an error resulting in data
	corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
	entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
  scan: scrub in progress since Sat May  8 08:11:07 2021
	152G scanned at 132M/s, 1.63M issued at 1.41K/s, 172G total
	0B repaired, 0.00% done, no estimated completion time
config:

	NAME         STATE     READ WRITE CKSUM
	rpool        ONLINE       0     0     0
	  nvme0n1p7  ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

        <0xeb51>:<0x0>
        rpool/crypt/debian-1/home/jgoerzen/no-backup@[elided]-hourly-2021-05-07_02.17.01--2d:<0x0>

I have found the solution to this issue is to reboot into single-user mode and run a scrub. Sometimes it takes several scrubs, maybe even with some reboots in between, but eventually it will clear up the issue. If I reboot before scrubbing, I do not get the panic or the hung scrub.

I run this same version of ZoL on two other machines, one of which runs this same kernel version. What is unique about this machine?

  • It is a laptop
  • It uses ZFS crypto (the others use LUKS)

I made a significant effort to rule out hardware issues, including running several memory tests and the built-in Dell diagnostics. I believe I have rules that out.

Describe how to reproduce the problem

I can't at will. I have to wait for a spell.

Include any warning/errors/backtraces from the system logs

See above

Potentially related bugs

@jgoerzen jgoerzen added Status: Triage Needed New issue which needs to be triaged Type: Defect Incorrect behavior (e.g. crash, hang) labels May 8, 2021
@jgoerzen
Copy link
Author

jgoerzen commented May 8, 2021

Two other interesting tidbits...

When I do the reboot after this issue occurs, the mounting of the individual zfs datasets is S L O W. Several seconds each, and that normally just flies by. After scrubbing, it is back to normal speed of mounting.

The datasets that have snapshot issues vary with each one. Sometimes it's just one, sometimes many. But var is almost always included. (Though its parent, which has almost no activity ever, also is from time to time, so that's odd.)

@jstenback
Copy link
Contributor

Same symptoms here, more or less. See also issue #11688.

@glueckself
Copy link

glueckself commented May 9, 2021

I also have the symptom with the corrupted snapshots, without kernel panics so far.

So far it only affected my Debian system with Linux 5.10 and zfs 2.0.3 (I've turned the server off for today, I can check the exact versions tomorrow). Also, while the system has the 2.0.3 zfs utils + module, the pool is still left on 0.8.6 format. I wasn't able to execute zfs list -r -t all <affected dataset> - it displayed cannot iterate filesystems and only a few snapshots (instead of tens it should've). Also, I couldn't destroy the affected snapshots because it said they didn't exist anymore. I couldn't send the dataset with syncoid at all.

On the corrupted system, after I got the mail from ZED, I manually ran a scrub at first, after which the zpool status said that there were no errors. However, the next zpool status, seconds after the first, again said that there were errors. Subsequent scrubs didn't clean the errors.

I've rebooted the server into an Ubuntu 20.10 live with zfs 0.8.4-1ubuntu11 (again, sorry that I haven't noted the version, can add it tomorrow) and after a scrub the errors were gone. Following scrubs haven't detected errors anymore. zfs list -r -t all ... again displayed a large list of snapshots.

The errors didn't seem to affect the data on the zvols (all 4 affected snapshots are of zvols). The zvols are used as disks for VMs with ext4 on them. I will verify them tomorrow.
EDIT: I checked one of the VM disks, neither fsck nor dpkg -V (verify checksums of all files installed from a package) could find any errors (except mismatching dpkg-checksums of config files I've changed - that is to be expected).

I have two other Ubuntu 21.04 based Systems with zfs-2.0.2-1ubuntu5 which are not affected until now. However, they have their pools already upgraded to 2. All are snapshotted with sanoid and have the datasets encrypted.

My next step will be to downgrade zfs back to 0.8.6 on the Debian system and see what happens.

EDIT:
More points I've noted while investigating with 0.8.4-1ubuntu11:

  • Creating new snapshots continued working for affected datasets, however destroying them didn't (right now I have 127 "frequently" (sanoids term for the most often snapshot - in my case 15 minutes) instead of the 10 sanoid is configured to keep.
  • With 0.8, the destroying of the affected snapshots worked. Scrubbing afterwards didn't find any errors.

EDIT 2:

  • On 2.0.2 (Ubuntu 21.04 again), sanoid managed to successfully prune (destroy) all remaining snapshots that past their valid-time. A scrub afterwards didn't find any errors. I'll be running the 2.0.2 for a while and see what happens.

@dcarosone
Copy link

dcarosone commented May 21, 2021

I'm seeing this too, on Ubuntu 21.04, also using zfs encryption

I have znapzend running, and it makes a lot of snapshots. Sometimes, some of them are bad, and can't be used (for example, attempting to send them to a replica destination fails). I now use the skipIntermediates option, and so at least forward progress is made on the next snapshot interval.

In the most recent case (this morning) I had something like 4300 errors (many more than I'd seen previously). There are no block-level errors (read/write/cksum). They're cleared after destroying the affected snapshots and scrubbing (and maybe a reboot, depending on .. day?)

Warning! Speculation below:

  • this may be related to a race condition?
  • znapzend wakes up and makes recursive snapshots of about 6 first-level child datasets ot rpool (ROOT, home, data, ...) all at the same time (as well as a couple of other pools, some of those still using LUKS for encryption underneath instead).
  • I have been having trouble with the ubuntu-native zsysd, whch gets stuck at 100% cpu. Normally I get frustrated and just disable it.
  • However, recently, I have been trying to understand what it's doing and what's going wrong (it tries to collect every dataset and snapshot and property in memory on startup). It seems like this has happened several times in the past few days while I have been letting zsysd run (so more contention for libzfs operations)
  • Update I haven't seen this again since disabling zsysd .. ~3 weeks and counting.

@aerusso
Copy link
Contributor

aerusso commented Jun 12, 2021

@jgoerzen Can you

  1. Capture the zpool events -v report when one of these "bad" snapshots is created?
  2. Try to zfs send that snapshot (i.e., to zfs send ... | cat >/dev/null; notice the need to use cat).
  3. Reboot, and try to zfs send the snapshot.

In my case, #11688 (which you already reference), I've discovered that rebooting "heals" the snapshot -- at least using the patchset I mentioned there

@jgoerzen
Copy link
Author

I'll be glad to. Unfortunately, I rebooted the machine yesterday, so I expect it will be about a week before the problem recurs.

It is interesting to see the discussion today in #11688. The unique factor about the machine that doesn't work for me is that I have encryption enabled. It wouldn't surprise me to see the same thing here, but I will of course wait for it to recur and let you know.

@jgoerzen
Copy link
Author

Hello @aerusso,

The problem recurred over the weekend and I noticed it this morning.

Unfortunately, the incident that caused it had already expired out of the zpool events buffer (apparently), as it only went as far back as less than an hour ago. However, I did find this in syslog:

Jun 20 01:17:39 athena zed: eid=34569 class=authentication pool='rpool' bookmark=12680:0:0:98
Jun 20 01:17:39 athena zed: eid=34570 class=data pool='rpool' priority=2 err=5 flags=0x180 bookmark=12680:0:0:242
Jun 20 01:17:40 athena zed: eid=34571 class=data pool='rpool' priority=2 err=5 flags=0x180 bookmark=12680:0:0:261
...
Jun 20 17:17:39 athena zed: eid=37284 class=authentication pool='rpool' bookmark=19942:0:0:98
Jun 20 17:17:39 athena zed: eid=37285 class=data pool='rpool' priority=2 err=5 flags=0x180 bookmark=19942:0:0:242
Jun 20 17:17:40 athena zed: eid=37286 class=data pool='rpool' priority=2 err=5 flags=0x180 bookmark=19942:0:0:261
...
Jun 20 18:17:28 athena zed: eid=37376 class=data pool='rpool' priority=2 err=5 flags=0x180 bookmark=21921:0:0:2072
Jun 20 18:17:29 athena zed: eid=37377 class=authentication pool='rpool' priority=2 err=5 flags=0x80 bookmark=21921:0:0:2072
Jun 20 18:17:29 athena zed: eid=37378 class=data pool='rpool' priority=2 err=5 flags=0x80 bookmark=21921:0:0:2072
Jun 20 18:17:40 athena zed: eid=37411 class=authentication pool='rpool' bookmark=21923:0:0:0

It should be noted that my hourly snap/send stuff runs at 17 minutes past the hour, so that may explain this timestamp correlation.

zpool status reported:

  pool: rpool
 state: ONLINE
status: One or more devices has experienced an error resulting in data
	corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
	entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
  scan: scrub repaired 0B in 00:04:12 with 0 errors on Sun Jun 13 00:28:13 2021
config:

	NAME         STATE     READ WRITE CKSUM
	rpool        ONLINE       0     0     0
	  nvme0n1p7  ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

        <0x5c81>:<0x0>
        <0x3188>:<0x0>
        rpool/crypt/debian-1@athena-hourly-2021-06-20_23.17.01--2d:<0x0>
        rpool/crypt/debian-1/var@athena-hourly-2021-06-20_23.17.01--2d:<0x0>
        <0x4de6>:<0x0>

Unfortunately I forgot to attempt to do a zfs send before reboot. Those snapshots, though not referenced directly, would have been included in a send -I that would have been issued. From my logs:

Jun 20 18:17:03 athena simplesnapwrap[4740]: Running: /sbin/zfs send -I rpool/crypt/debian-1/var@__simplesnap_bakfs1_2021-06-20T22:17:02__ rpool/crypt/debian-1/var@__simplesnap_bakfs1_2021-06-20T23:17:03__
Jun 20 18:17:03 athena simplesnap[2466/simplesnapwrap]: internal error: warning: cannot send 'rpool/crypt/debian-1/var@athena-hourly-2021-06-20_23.17.01--2d': Invalid argument

So I think that answers the question.

After a reboot but before a scrub, the zfs send you gave executes fine.

@cbreak-black
Copy link

I have similar symptoms, on an encrypted single-ssd ubuntu 21.04 boot pool, using stock zfs from ubuntu's repos. Deleting the affected snapshots and scrubbing previously cleared the errors, but on reoccurence, repeated scrubbing (without deleting them) caused a deadlock. My system has ECC memory, so it's probably not RAM related.

  • Does this problem happen with slower pools (like hard disk pools?)
  • Does this problem happen with pools that have redundancy?
  • Does this problem happen with with pools that don't trim (hard disk pools again?)

@aerusso
Copy link
Contributor

aerusso commented Jul 4, 2021

@cbreak-black Was there a system restart between the occurrence of the corrupted snapshot and the problems? Restarting has "fixed" this symptom for me (though you will need to scrub twice for the message to disappear, I believe).

I have a suspicion that this may be a version of #10737 , which has an MR under way there. The behavior I am experiencing could be explained by that bug (syncoid starts many zfs sends on my machine, some of which are not finished; SSDs do the send much faster, so are more likely to get deeper into the zfs send before the next command in the pipeline times out; a reboot heals the issue, for me; there's no on disk corruption, as far as I can tell).

I'm holding off on trying to bisect this issue (at least) until testing that MR. (And all the above is conjecture!)

@cbreak-black
Copy link

@aerusso No, without a restart I got into the scrub-hang, and had to restart hard. Afterwards, the scrub finished, and several of the errors vanished. The rest of the errors vanished after deleting the snapshots and scrubbing again.

@InsanePrawn
Copy link
Contributor

InsanePrawn commented Jul 4, 2021

Can I join the club too? #10019
Note how it's also at 0x0. Sadly I deleted said snapshot and dataset by now.

@aerusso
Copy link
Contributor

aerusso commented Jul 4, 2021

@InsanePrawn I can't seem to find commit 4d5b4a33d in any repository I know of (and neither can github, apparently, either). However, in your report you say this was a "recent git master" and the commit I'm currently betting on being guilty is da92d5c, which was committed in November of the previous year, so I can't use your data point to rule out my theory!

Also, it sounds like you didn't have any good way to reproduce the error --- however, you were using a test pool. Compared to my reproduction strategy (which is just, turn my computer on and browse the web, check mail, etc.) it might be easier to narrow in on a test case (or might have been easier a year and a half ago, when this was all fresh). Anyway, if you have any scripts or ideas of what you were doing that caused this besides "snapshots being created and deleted every couple minutes", it might be useful too. (I already tried lots of snapshot creations and deletions during fio on several datasets in a VM).

@InsanePrawn
Copy link
Contributor

InsanePrawn commented Jul 4, 2021

Yeah, idk why I didn't go look for the commit in my issue - luckily for us, that server (and pool; it does say yolo, but it's my private server's root pool. it's just that i won't cry much if it breaks; originally due to then unreleased crypto) and the git repo on it still exist. Looks like 4d5b4a33d was two systemd-generator commits by me after 610eec4

@InsanePrawn
Copy link
Contributor

InsanePrawn commented Jul 4, 2021

FWIW the dataset the issue appeared on was an empty filesystem (maybe a single small file inside) dataset that had snapshots (without actual fs activity) taken in quick intervals (somewhere between 30s and 5m intervals) in parallel with a few (5-15) other similarly empty datasets.
Edit: These were being snapshotted and replicated by zrepl, probably in a similar manner to what znapzend does.

The pool is a raidz2 on 3.5" spinning SATA disks.
I'm afraid I have nothing more to add in terms of reproduction :/

Edit: Turns out the dataset also still exists, the defective snapshot however does not anymore. I doubt that's helpful?

@aerusso
Copy link
Contributor

aerusso commented Jul 5, 2021

@InsanePrawn Does running the zrepl workload reproduce the bug on 2.0.5 (or another recent release?)

I don't think the snapshot is terribly important --- unless you're able to really dig into it with zdb (which I have not developed sufficient expertise to do). Rather, I think it's the workload, hardware setup, and (possibly, but I don't understand the mechanism at all) the dataset itself. Encryption also is a common theme, but that might just affect the presentation (i.e., there's no MAC to fail in the unencrypted, unauthenticated, case).

Getting at zpool events -v showing the error would probably tell us something (see mine).

@cbreak-black
Copy link

I've since added redundancy to my pool (it's now a mirror with two devices), and disabled autotrim. The snapshot corruption still happens. Still don't know what is causing it. And I also don't know if the corruption happens when creating the snapshot, and only later gets discovered (when I try to zfs send the snapshots), or if snapshots get corrupted some time in between creation and sending.

@aerusso
Copy link
Contributor

aerusso commented Aug 14, 2021

@cbreak-black Can you enable the all-debug.sh ZEDlet, and put the temporary directory somewhere permanent (i.e., not the default of /tmp/zed.debug.log)?

This will get the output of zpool events -v as it is generated, and will give times, which you can conceivably triangulate with your other logs. There's other information in those logs that is probably useful, too.

I'll repeat this here: if anyone gets me a reliable reproducer on a new pool, I have no doubt we'll be able to solve this in short order.

@wohali
Copy link

wohali commented Sep 1, 2021

Just mentioning here that we saw this on TrueNAS 12.0-U5 with OpenZFS 2.0.5 as well -- see #11688 (comment) for our story.

@rincebrain
Copy link
Contributor

Since I don't see anyone mentioning it here yet, #11679 contains a number of stories about the ARC getting confused when encryption is involved and, in a very similar looking illumos bug linked from there, eating data at least once.

@gamanakis
Copy link
Contributor

gamanakis commented Sep 30, 2021

@jgoerzen are you using raw send/receive? If yes this is closely related to #12594.

@jgoerzen
Copy link
Author

@gamanakis Nope, I'm not using raw (-w).

@phreaker0
Copy link

it's present in v2.1.1 as well:

Okt 09 01:01:14 tux sanoid[2043026]: taking snapshot ssd/container/debian-test@autosnap_2021-10-08_23:01:14_hourly
Okt 09 01:01:16 tux sanoid[2043026]: taking snapshot ssd/container/debian-test@autosnap_2021-10-08_23:01:14_frequently
Okt 09 01:01:16 tux kernel: VERIFY3(0 == remove_reference(hdr, NULL, tag)) failed (0 == 1)
Okt 09 01:01:16 tux kernel: PANIC at arc.c:3836:arc_buf_destroy()
Okt 09 01:01:16 tux kernel: Showing stack for process 435
Okt 09 01:01:16 tux kernel: CPU: 2 PID: 435 Comm: z_rd_int_1 Tainted: P           OE     5.4.0-84-generic #94-Ubuntu
Okt 09 01:01:16 tux kernel: Hardware name: GIGABYTE GB-BNi7HG4-950/MKHM17P-00, BIOS F1 05/24/2016
Okt 09 01:01:16 tux kernel: Call Trace:
Okt 09 01:01:16 tux kernel:  dump_stack+0x6d/0x8b
Okt 09 01:01:16 tux kernel:  spl_dumpstack+0x29/0x2b [spl]
Okt 09 01:01:16 tux kernel:  spl_panic+0xd4/0xfc [spl]
Okt 09 01:01:16 tux kernel:  ? kfree+0x231/0x250
Okt 09 01:01:16 tux kernel:  ? spl_kmem_free+0x33/0x40 [spl]
Okt 09 01:01:16 tux kernel:  ? kfree+0x231/0x250
Okt 09 01:01:16 tux kernel:  ? zei_add_range+0x140/0x140 [zfs]
Okt 09 01:01:16 tux kernel:  ? spl_kmem_free+0x33/0x40 [spl]
Okt 09 01:01:16 tux kernel:  ? zfs_zevent_drain+0xd3/0xe0 [zfs]
Okt 09 01:01:16 tux kernel:  ? zei_add_range+0x140/0x140 [zfs]
Okt 09 01:01:16 tux kernel:  ? zfs_zevent_post+0x234/0x270 [zfs]
Okt 09 01:01:16 tux kernel:  arc_buf_destroy+0xfa/0x100 [zfs]
Okt 09 01:01:16 tux kernel:  arc_read_done+0x251/0x4a0 [zfs]
Okt 09 01:01:16 tux kernel:  zio_done+0x407/0x1050 [zfs]
Okt 09 01:01:16 tux kernel:  zio_execute+0x93/0xf0 [zfs]
Okt 09 01:01:16 tux kernel:  taskq_thread+0x2fb/0x510 [spl]
Okt 09 01:01:16 tux kernel:  ? wake_up_q+0x70/0x70
Okt 09 01:01:16 tux kernel:  ? zio_taskq_member.isra.0.constprop.0+0x60/0x60 [zfs]
Okt 09 01:01:16 tux kernel:  kthread+0x104/0x140
Okt 09 01:01:16 tux kernel:  ? task_done+0xb0/0xb0 [spl]
Okt 09 01:01:16 tux kernel:  ? kthread_park+0x90/0x90
Okt 09 01:01:16 tux kernel:  ret_from_fork+0x1f/0x40

@phreaker0
Copy link

@aerusso you wrote that da92d5c may be the cause of this issue. My workstation at work panics after a couple of days and I need to reset it. Could you provide a branch of 2.1.1 with this commit reverted (as revert causes merge conflicts I can't fix myself) so I could test if the machine no longer crashes?

@aerusso
Copy link
Contributor

aerusso commented Oct 14, 2021

@phreaker0 Unfortunately, the bug that da92d5c introduced (#10737) was fixed by #12299, which I believe is present in all maintained branches now. It does not fix #11688, (which I suspect is the same as this bug).

I'm currently running 0.8.6 on Linux 5.4.y, and am hoping to wait out this bug (I don't have a lot of time right now, or for the foreseeable future). But, If you have a reliable reproducer (or a whole lot of time) you could bisect while running 5.4 (or some other pre-5.10 kernel). I can help anyone who wants to do that. If we can find the guilty commit, I have no doubt this can be resolved.

@HankB
Copy link

HankB commented Feb 11, 2025

@Germano0 Yes, probably.

I will, of course, defer to the ZFS and kernel devs for any direction they would like this to take. My desire was to see if there was a way to reproducibly provoke corruption as the first step in tracking down the regression. My repo (https://github.com/HankB/provoke_ZFS_corruption) is a bit rough but provides that capability. I'm working to improve that, particularly the instructions.

The next step is to determine which commit introduced the regression via git bisect. I have not yet started that. Anyone following my process will likely find errors and omissions and things that are just not clear. Those should probably be referred to me at my repo and not here. It would certainly help to have others try to reproduce my results.

I'll repeat my offer to either start another issue relevant to this issue in OpenZFS or continue discussion at my repo if anyone thinks this is unnecessarily cluttering this issue.

@wohali
Copy link

wohali commented Feb 12, 2025

@wohali from what I can tell pyznap does not use "receive -s" by default. Do you know if you were using this feature?

It was likely enabled, yes. I don't recally for sure, but we move enough data that restarting a full transfer would be very bad.

I agree that I don't see how something on the receiving end could trigger this bug, though.

@dcarosone
Copy link

dcarosone commented Feb 12, 2025

When I was seeing it with znapzend, there was no -s. And, yes, it shouldn't matter how it's received, other than perhaps via some timing and back-pressure effect.

@gamanakis
Copy link
Contributor

So, I am able to finally hit this on 2.3.0 with the scripts @HankB provides, but takes a couple of hours on x86_64:
I think it actually fails to decrypt some dbufs, don't understand why yet.

The failing path which generates the authentication errors is:
arc_untransform() -> arc_buf_fill() -> arc_fill_hdr_crypt() -> arc_hdr_decrypt() -> spa_do_crypt_abd() -> zio_do_crypt_data()

Now the last one fails with ECKSUM. I instrumented the code with:

--- a/module/zfs/dsl_crypt.c
+++ b/module/zfs/dsl_crypt.c
@@ -2827,6 +2832,7 @@ spa_do_crypt_abd(boolean_t encrypt, spa_t *spa, const zbookmark_phys_t *zb,
        int ret;
        dsl_crypto_key_t *dck = NULL;
        uint8_t *plainbuf = NULL, *cipherbuf = NULL;
+       const uint8_t zeroed_mac[ZIO_DATA_MAC_LEN] = {0};

        ASSERT(spa_feature_is_active(spa, SPA_FEATURE_ENCRYPTION));

@@ -2873,6 +2879,25 @@ spa_do_crypt_abd(boolean_t encrypt, spa_t *spa, const zbookmark_phys_t *zb,
        ret = zio_do_crypt_data(encrypt, &dck->dck_key, ot, bswap, salt, iv,
            mac, datalen, plainbuf, cipherbuf, no_crypt);

+       if (ret != 0) {
+
+               if (memcmp(mac, zeroed_mac, ZIO_DATA_MAC_LEN) != 0) {
+                       cmn_err(CE_NOTE, "mac is not zeroed out");
+               }
+
+               cmn_err(CE_NOTE,
+                   "(%i, %s, %p, %d, %p, %p, %u, %s, %p, %p, %d)\n",
+                   encrypt ? "encrypt" : "decrypt",
+                   salt, ot, iv, mac, datalen,
+                   bswap ? "byteswap" : "native_endian", plainbuf,
+                   cipherbuf, *no_crypt);
+
+               cmn_err(CE_NOTE, "\tkey = {");
+               for (int i = 0; i < dck->dck_key.zk_current_key.ck_length/8; i++)
+                       cmn_err(CE_NOTE, "%02x ", ((uint8_t *)dck->dck_key.zk_current_key.ck_data)[i]);
+               cmn_err(CE_NOTE, "do_crypt fails");
+       }
+
        /*
         * Handle injected decryption faults. Unfortunately, we cannot inject
         * faults for dnode blocks because we might trigger the panic in

Getting insights to whether the mac is indeed zeroed out (and that's why the decryption fails) or to what no_crypt is set might help further. I have yet to hit this though again.

@gamanakis
Copy link
Contributor

gamanakis commented Feb 14, 2025

Ok, so using the instumentation above we have:

[Feb14 16:13] NOTICE: MAC is not zeroed out!
[  +0.000006] NOTICE: (52, decrypt, 00000000ef65fc3e, 10, 00000000342f65bd, 0000000017f46b24, 16384, native_endian, 00000000ff8a0891, 000000005eafd8f1, 0)

[  +0.000002] NOTICE:   key = {
[  +0.000001] NOTICE: f3
[  +0.000001] NOTICE: 24
[  +0.000000] NOTICE: 4b
[  +0.000001] NOTICE: 45
[  +0.000001] NOTICE: a9
[  +0.000000] NOTICE: dd
[  +0.000001] NOTICE: b6
[  +0.000001] NOTICE: 4e
[  +0.000001] NOTICE: 2c
[  +0.000000] NOTICE: e3
[  +0.000001] NOTICE: 70
[  +0.000001] NOTICE: 8f
[  +0.000000] NOTICE: 8d
[  +0.000001] NOTICE: a6
[  +0.000001] NOTICE: 10
[  +0.000001] NOTICE: 9a
[  +0.000000] NOTICE: eb
[  +0.000001] NOTICE: 25
[  +0.000001] NOTICE: a1
[  +0.000001] NOTICE: c6
[  +0.000000] NOTICE: 49
[  +0.000001] NOTICE: 04
[  +0.000001] NOTICE: 35
[  +0.000000] NOTICE: 9a
[  +0.000001] NOTICE: 64
[  +0.000001] NOTICE: e9
[  +0.000001] NOTICE: 0a
[  +0.000000] NOTICE: 48
[  +0.000001] NOTICE: a5
[  +0.000001] NOTICE: 4e
[  +0.000001] NOTICE: c1
[  +0.000000] NOTICE: e4
[  +0.000001] NOTICE: do_crypt fails

EDIT: This arc_untransform() (in dmu_send_impl()) is not the one involved..

2393         /*
2394          * If this is a non-raw send of an encrypted ds, we can ensure that
2395          * the objset_phys_t is authenticated. This is safe because this is
2396          * either a snapshot or we have owned the dataset, ensuring that
2397          * it can't be modified.
2398          */
2399         if (!dspp->rawok && os->os_encrypted &&
2400             arc_is_unauthenticated(os->os_phys_buf)) {
2401                 zbookmark_phys_t zb;
2402
2403                 SET_BOOKMARK(&zb, to_ds->ds_object, ZB_ROOT_OBJECT,
2404                     ZB_ROOT_LEVEL, ZB_ROOT_BLKID);
2405                 err = arc_untransform(os->os_phys_buf, os->os_spa,
2406                     &zb, B_FALSE);
2407                 if (err != 0) {
2408                         dsl_pool_rele(dp, tag);
2409                         return (err);
2410                 }
2411
2412                 ASSERT0(arc_is_unauthenticated(os->os_phys_buf));
2413         }

I added a dumpstack into arc_untransform, lets see:

+#include <sys/debug.h>
+#ifdef _KERNEL
+               spl_dumpstack();
+#endif

@gamanakis
Copy link
Contributor

So, when arc_untransform() fails, we have this stack:

[  +0.000001] Showing stack for process 3634749
[  +0.000002] CPU: 2 PID: 3634749 Comm: zfs Tainted: P           OE     5.15.0-131-generic #141-Ubuntu
[  +0.000003] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS Hyper-V UEFI Release v4.1 05/13/2024
[  +0.000001] Call Trace:
[  +0.000002]  <TASK>
[  +0.000002]  show_stack+0x52/0x5c
[  +0.000005]  dump_stack_lvl+0x4a/0x63
[  +0.000003]  dump_stack+0x10/0x16
[  +0.000004]  spl_dumpstack+0x29/0x2f [spl]
[  +0.000011]  arc_untransform+0x96/0xb0 [zfs]
[  +0.000196]  dbuf_read_verify_dnode_crypt+0x16d/0x210 [zfs]
[  +0.000162]  dbuf_read+0x3c/0x5c0 [zfs]
[  +0.000182]  dmu_buf_hold_by_dnode+0x66/0xa0 [zfs]
[  +0.000096]  zap_lockdir+0x87/0xf0 [zfs]
[  +0.000160]  zap_lookup+0x51/0x110 [zfs]
[  +0.000110]  zfs_get_zplprop+0xb7/0x1b0 [zfs]
[  +0.000109]  nvl_add_zplprop+0x36/0xb0 [zfs]
[  +0.000108]  zfs_ioc_objset_zplprops+0xef/0x190 [zfs]
[  +0.000108]  zfsdev_ioctl_common+0x7d1/0xa00 [zfs]
[  +0.000107]  ? kvmalloc_node+0x5e/0xa0
[  +0.000006]  ? _copy_from_user+0x31/0x70
[  +0.000003]  zfsdev_ioctl+0x57/0xf0 [zfs]
[  +0.000100]  __x64_sys_ioctl+0x92/0xd0
[  +0.000004]  x64_sys_call+0x1e5f/0x1fa0
[  +0.000004]  do_syscall_64+0x56/0xb0
[  +0.000003]  ? syscall_exit_to_user_mode+0x2c/0x50
[  +0.000002]  ? x64_sys_call+0x1e5f/0x1fa0
[  +0.000003]  ? clear_bhb_loop+0x45/0xa0
[  +0.000002]  ? clear_bhb_loop+0x45/0xa0
[  +0.000002]  ? clear_bhb_loop+0x45/0xa0
[  +0.000002]  ? clear_bhb_loop+0x45/0xa0
[  +0.000001]  ? clear_bhb_loop+0x45/0xa0
[  +0.000002]  entry_SYSCALL_64_after_hwframe+0x6c/0xd6
[  +0.000004] RIP: 0033:0x7f6eaabac94f
[  +0.000003] Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24 10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00 00 0f 05 <41> 89 c0 3d 00 f0 ff ff 77 1f 48 8b 44 24 18 64 48 2b 04 25 28 00
[  +0.000002] RSP: 002b:00007ffc5473db80 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[  +0.000003] RAX: ffffffffffffffda RBX: 0000000000000021 RCX: 00007f6eaabac94f
[  +0.000001] RDX: 00007ffc5473dc50 RSI: 0000000000005a13 RDI: 0000000000000003
[  +0.000002] RBP: 00007ffc54741250 R08: 0000000000000000 R09: 00005597d138ca50
[  +0.000001] R10: 00007f6eaacad460 R11: 0000000000000246 R12: 00005597d13720b0
[  +0.000001] R13: 00007ffc54741288 R14: 00007ffc54741290 R15: 00007ffc5473dc50
[  +0.000003]  </TASK>

@gamanakis
Copy link
Contributor

gamanakis commented Feb 14, 2025

It is zfs_get_zplprop() -> ... -> dbuf_read() -> dbuf_read_verify_dnode_crypt() -> arc_untransform() which eventually fails, suggesting the arc_untransform() there is not safe.
Kindly pinging @amotin, since you already took a look at that code with 4036b8d.

@amotin
Copy link
Member

amotin commented Feb 15, 2025

@gamanakis This trace looks interesting. Since the mentioned commit I always thought that the problem might be caused by simultaneous access to encrypted and plain-text data, which dbuf layer was never designed to handle. But I thought the problem is about access to snapshot data, for which I could not see a race, since before snapshot is fully received and the TXG is committed it is not mounted and so can not be accessed. But in the stack quoted I see zfs_get_zplprop, which means it is not a data access, but a dataset/snapshot properties, which I see are stored in ZAP in dnode 1 of the objset, which I guess is sharing the same dnode block with other, potentially encrypted dnodes. I'd need to investigate what zfs get ... does for snapshot that is still in progress. It might be the missing trigger. I am on vacation till next week, so unless I get to it between skiing and resting, somebody could try to create clean synthetic reproduction (probably including some zfs get while doing encrypted receive) I could look on when I'm back.

PS: Looking again, it may mean not any properties, but one of: ZFS_PROP_VERSION, ZFS_PROP_NORMALIZE, ZFS_PROP_UTF8ONLY or ZFS_PROP_CASE. May be some others may trigger it also, but through a different code path.

@dcarosone
Copy link

just to reiterate for clarity, this problem occurs on the sender, not the receiver

@amotin
Copy link
Member

amotin commented Feb 15, 2025

Hmm. Then I have no idea how can we corrupt something without writing, unless it happens during snapshot creation. And IIRC send does not even use dbuf layer, using ARC and ZIO layers directly, so that guess might be wrong too. But any way more specific reproduction would be good, and the properties access still sounds interesting, or at least a new.

@dcarosone
Copy link

There might be send holds getting written, though i suspect it's not essential for repro because not all the tools will use them.

@dcarosone
Copy link

might be caused by simultaneous access to encrypted and plain-text data

I stopped seeing the problem after switching to raw sends, so the data is not getting decrypted at the sender..

(there were other changes at the same time, and of course a time window before it was clear it hadn't recurred in a while, but I always suspected this was the key one)

@robn
Copy link
Member

robn commented Feb 15, 2025

It might be during snapshot creation.

We have a customer with encrypted datasets. They have a script which is effectively zfs snapshot && zfs send -I, that is, very close together, incremental send, not raw send. Fairly regularly (multiple times a day), the send will fail with an IO error, as will followup sends on that snapshot. Deleting the snapshot is the only useful thing they can do with it. The next time it works, or it doesn't, so it definitely feels like something racy. The datasets are under heavy continuous write load, so there will be a delta between the snapshot and the filesystem almost immediately.

I don't have much to go on yet; we're still in information gathering, so not much to offer. But, it has had me thinking about an older issue where a receive could blow up in part because dnode encryption params for an object range being recieved could be overwritten by the params for the next range before the first range could be synced.

(That was reported in #11893, mitigated in #14358 but possibly never properly understood; we heard about it through a different customer that did not yet have the "fix", and it manifested differently; it wasn't until our own analysis that we understood what was happening, and a good-enough workaround at the time was to just upgrade them).

Now, I don't think its the same problem, but thinking about properties access is new, and it makes me wonder if it has a similar flavour. Could writing and reading from the "same" dbuf in the course of creating the snapshot cause one to get the other's encryption parameters, such that it now can't be read? If so I wonder if a strategically-placed txg_wait_synced() might mitigate the issue?

To be very clear: this is the edge of an idea, not even close to a theory. Mostly, I wanted to hop in and say that actually the write may be involved, and maybe I've seen something like this. Hopefully someone can connect the dots.

@dcarosone
Copy link

Does taking a new snapshot cause any property updates to the previous one (which might be undergoing send)

@HankB
Copy link

HankB commented Feb 15, 2025

Hmm. Then I have no idea how can we corrupt something without writing, unless it happens during snapshot creation.

I have two thoughts on this (with no knowledge of what's happening under the covers.)

  • Something gets written to mark the boundary from one snapshot to the live dataset.
  • Something gets shared between the snapshot/decryption/send that results in interpretation of corruption when none actually exists.

My first experience with this is on a laptop that is using a single drive VDEV (bpool/rpool split) and is using native encryption. Over about 5 years (going back to when Buster == Testing) I have backups using syncoid configured for several datasets in my personal directory which are sent with a syncoid invocation for each send. When only these are employed the corruption bug is never triggered. I also have sanoid configured to capture snapshots for these datasets (and the entire pool.)

At one point I thought it would be useful to capture the entire pool so I set up a recursive syncoid backup to do this on an hourly basis. Soon the "permanent errors" began to appear so I stopped the whole pool backups. (It may be important to know that syncoid does not use ZFS recursion but rather iterates through the pool whereas I have snapshots configured using ZFS recursion.)

After I stopped the whole pool backups, the "permanent errors" eventually go away. Most recently I turned on whole pool backups prior to upgrading from Stable to Testing and by the next morning there was one permanent error. It has been weeks and is still there but I an confident it will eventually fall off the end. The receiving dataset is not encrypted so the -w flag is not used. All syncoid operations are incremental (except when there are no matching datasets on the receiving end.) The eventual disappearance of these errors leads me to believe the corruption is in the snapshots. Corruption has never been identified in the receiving pool. (Sorry for the shaggy dog story, but I don't know which of these factors might make a difference.)

Thank you all for your interest in this issue.

Edit: (More shaggy dog.) I have also been testing on FreeBSD (on a Pi 4B) and have provoked corruption with both 15 and 13 with the installed ZFS version. With 13 the scripts exited when zpool status reported corruption. Subsequent activity (scrubs only) cleared the error. On Linux, zfs send (invoked by syncoid) always reported errors in conjunction with corruption but I could not find evidence of that on FreeBSD. I mention this in case differences between OpenZFS on BSD vs. Linux and resulting differences in symptoms might offer a further clue.

@gamanakis
Copy link
Contributor

gamanakis commented Feb 16, 2025

Edit: The more I look into this, the more I believe #12001 has to do with multi slot dnodes (which may change upon creating new snaps and may affect older snaps).

For the present one, most interfaces in zfs_ioctl.c read from non-owned objsets. Which may be a problem if we have multi slot dnodes.

@gamanakis gamanakis mentioned this issue Feb 16, 2025
13 tasks
@IvanVolosyuk
Copy link
Contributor

People who still experience the issue, what 'dnodesize' property do you use? The default is 'legacy' and it might be the missing piece of the puzzle. I saw the reproducer uses dnodesize=auto.

@Sieboldianus
Copy link

Sieboldianus commented Feb 16, 2025

Below is mine.

zfs get all tank_ssd/lxc | grep dnodesize
> tank_ssd/lxc  dnodesize             legacy                    default
zfs get all tank_hdd/data | grep dnodesize
> tank_hdd/data  dnodesize             legacy                    default

I do get Critical Error while doing syncoid pulls with encrypted datasets (with synoptions Rw) once in a while, but my zfs pool never saw actual data corruption.

(obvious but:) dnodesize can't be changed after pool creation.

@HankB
Copy link

HankB commented Feb 16, 2025

People who still experience the issue, what 'dnodesize' property do you use? The default is 'legacy' and it might be the missing piece of the puzzle. I saw the reproducer uses dnodesize=auto.

The instructions for configuring root on ZFS for Debian (https://openzfs.github.io/openzfs-docs/Getting%20Started/Debian/Debian%20Bookworm%20Root%20on%20ZFS.html#step-2-disk-formatting) include the option -O dnodesize=auto. Perhaps @rlaager can comment on why that was chosen. I copied those options for pool creation for creating test pools and use that when I install Debian with root on ZFS and generally any time I create pools.

Edit: At present I'm testing zfs-0.8.6-1 on kernel 4.19.0-27-amd64 on Debian Buster. It has run for several days without reporting corruption. Newer versions of the S/W on this H/W reported corruption within hours. This configuration also uses dnodesize=auto.

@gamanakis
Copy link
Contributor

gamanakis commented Feb 16, 2025

I thought that doing:

diff --git a/module/zfs/zfs_ioctl.c b/module/zfs/zfs_ioctl.c
index b1b0ae544..1c652c8d5 100644
--- a/module/zfs/zfs_ioctl.c
+++ b/module/zfs/zfs_ioctl.c
@@ -2228,9 +2228,11 @@ zfs_ioc_objset_zplprops(zfs_cmd_t *zc)
 {
        objset_t *os;
        int err;
+       static const char *setsl_tag = "mine_tag";

-       /* XXX reading without owning */
-       if ((err = dmu_objset_hold(zc->zc_name, FTAG, &os)))
+       err = dmu_objset_own(zc->zc_name, zc->zc_objset_type,
+           B_TRUE, B_TRUE, setsl_tag, &os);
+       if (err != 0)
                return (err);

        dmu_objset_fast_stat(os, &zc->zc_objset_stats);
@@ -2255,7 +2257,7 @@ zfs_ioc_objset_zplprops(zfs_cmd_t *zc)
        } else {
                err = SET_ERROR(ENOENT);
        }
-       dmu_objset_rele(os, FTAG);
+       dmu_objset_disown(os, B_TRUE, setsl_tag);
        return (err);
 }

would help. However, it doesn't. Slight different trace involving setup_featureflags():

[ 8239.651281] Showing stack for process 2801801
[ 8239.651283] CPU: 1 PID: 2801801 Comm: zfs Tainted: P           OE     5.15.0-131-generic #141-Ubuntu
[ 8239.651286] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS Hyper-V UEFI Release v4.1 05/13/2024
[ 8239.651287] Call Trace:
[ 8239.651289]  <TASK>
[ 8239.651291]  show_stack+0x52/0x5c
[ 8239.651296]  dump_stack_lvl+0x4a/0x63
[ 8239.651299]  dump_stack+0x10/0x16
[ 8239.651303]  spl_dumpstack+0x29/0x2f [spl]
[ 8239.651316]  arc_untransform+0x96/0xb0 [zfs]
[ 8239.651479]  dbuf_read_verify_dnode_crypt+0x196/0x350 [zfs]
[ 8239.651642]  dbuf_read+0x58/0x760 [zfs]
[ 8239.651773]  ? RW_READ_HELD+0x1a/0x30 [zfs]
[ 8239.651910]  dmu_buf_hold_by_dnode+0x66/0xa0 [zfs]
[ 8239.652045]  zap_lockdir+0x87/0xf0 [zfs]
[ 8239.652213]  zap_lookup_norm+0x5c/0xd0 [zfs]
[ 8239.652378]  zap_lookup+0x16/0x20 [zfs]
[ 8239.652544]  zfs_get_zplprop+0x8d/0x1b0 [zfs]
[ 8239.652713]  setup_featureflags+0x267/0x2e0 [zfs]
[ 8239.652871]  dmu_send_impl+0xe7/0xcb0 [zfs]
[ 8239.653013]  ? queued_spin_unlock+0x9/0x20 [zfs]
[ 8239.653160]  dmu_send_obj+0x265/0x360 [zfs]
[ 8239.653307]  zfs_ioc_send+0x10c/0x280 [zfs]
[ 8239.653473]  ? dump_bytes_cb+0x30/0x30 [zfs]
[ 8239.653644]  zfsdev_ioctl_common+0x697/0x780 [zfs]
[ 8239.653811]  ? __check_object_size.part.0+0x4a/0x150
[ 8239.653814]  ? _copy_from_user+0x31/0x70
[ 8239.653818]  zfsdev_ioctl+0x57/0xf0 [zfs]
[ 8239.653984]  __x64_sys_ioctl+0x92/0xd0
[ 8239.653988]  x64_sys_call+0x1e5f/0x1fa0
[ 8239.653992]  do_syscall_64+0x56/0xb0
[ 8239.654031]  ? clear_bhb_loop+0x45/0xa0
[ 8239.654034]  ? clear_bhb_loop+0x45/0xa0
[ 8239.654035]  ? clear_bhb_loop+0x45/0xa0
[ 8239.654037]  ? clear_bhb_loop+0x45/0xa0
[ 8239.654039]  ? clear_bhb_loop+0x45/0xa0
[ 8239.654040]  entry_SYSCALL_64_after_hwframe+0x6c/0xd6
[ 8239.654044] RIP: 0033:0x7ff196af994f
[ 8239.654047] Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24 10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00 00 0f 05 <41> 89 c0 3d 00 f0 ff ff 77 1f 48 8b 44 24 18 64 48 2b 04 25 28 00
[ 8239.654050] RSP: 002b:00007ffcd069f160 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[ 8239.654053] RAX: ffffffffffffffda RBX: 000055aac69f5f10 RCX: 00007ff196af994f
[ 8239.654054] RDX: 00007ffcd069f1f0 RSI: 0000000000005a1c RDI: 0000000000000003
[ 8239.654056] RBP: 00007ffcd06a2bf0 R08: 0000000007a6c895 R09: 0000000000000000
[ 8239.654057] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000001
[ 8239.654059] R13: 0000000000027fbb R14: 00007ffcd069f1f0 R15: 0000000000000000
[ 8239.654062]  </TASK>

But again, zfs_get_zplprop() is involved.

@tjikkun
Copy link

tjikkun commented Feb 17, 2025

People who still experience the issue, what 'dnodesize' property do you use? The default is 'legacy' and it might be the missing piece of the puzzle. I saw the reproducer uses dnodesize=auto.

For us it's legacy

@gamanakis
Copy link
Contributor

Just another thought: accessing the filesystem properties probably shouldn't require decrypting the dbuf, since zfsprops are not encrypted per se.

@gamanakis
Copy link
Contributor

As fas as I can see, zfs_get_zplprops() does not need access to encrypted buffers.
Perhaps this might provide an easy solution.

@IvanVolosyuk
Copy link
Contributor

Sorry in advance if I say something stupid, but I was also carried away by the gold rush of bug hunting in this code without much prio knowledge. Here are two finding and the second one is looking pretty suspicious to me.

  1. I can see in dbuf_read_verify_dnode_crypt() the dnbuf is basically a dndb->db_buf and arc_is_encrypted(dnbuf) is checked outside of dndb->db_mtx locked mutex, while it seems in the majority of the other code it is checked behind the lock. Can this early return cause problems?

  2. I can see that dbuf_read_done() can call dbuf_set_data() which changes db_buf and will broadcast db->db_changed. On the other hand dbuf_read_verify_dnode_crypt() uses unusual pattern where it caches dndb->db_buf into a variable and doesn't reload it after cv_wait(&dndb->db_changed, &dndb->db_mtx) in the loop in this function and later can pass potentially stale pointer into arc_untransform(dnbuf, os->os_spa, &zb, B_TRUE).

@IvanVolosyuk
Copy link
Contributor

@HankB as you able easily reproduce the issue, can you try a patch I drafted #17069 here. I wonder if it will make any difference.

@HankB
Copy link

HankB commented Feb 19, 2025

@IvanVolosyuk I'll give it a shot but it will be a couple days before I can get to it.

@HankB
Copy link

HankB commented Feb 24, 2025

@HankB as you able easily reproduce the issue, can you try a patch I drafted #17069 here. I wonder if it will make any difference.

@IvanVolosyuk
I've just produced corruption with ZFS 2.0.0 in a couple hours and had planned to use that as a baseline for testing your patch. Unfortunately the code for dbuf_read_verify_dnode_crypt() has changed significantly compared to the diff in your PR and I cannot see where the changes would go. Does it make more sense to test this with the latest ZFS release (2.3.0) vs. 2.0.0 which seems to have to introduced the bug?

@IvanVolosyuk
Copy link
Contributor

Yeah, you should probably test it on ZFS 2.3.0 and 2.3.0 with the patch. @amotin did quite a bit of changes to fix ZFS encryption problems. I wonder if you'll be able to reproduce the failure with your reproducer on recent ZFS release.

@HankB
Copy link

HankB commented Feb 24, 2025

2.3.0 builds on Debian but something's not working - doesn't fine the zfs.ko module. I've asked on the mailing list at https://zfsonlinux.topicbox.com/groups/zfs-discuss/Tf3aa320b5d3f11ef/building-using-2-3-0-on-debian-bookworm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: Encryption "native encryption" feature Status: Triage Needed New issue which needs to be triaged Type: Defect Incorrect behavior (e.g. crash, hang)
Projects
None yet
Development

No branches or pull requests