-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ZFS corruption related to snapshots post-2.0.x upgrade #12014
Comments
Two other interesting tidbits... When I do the reboot after this issue occurs, the mounting of the individual zfs datasets is S L O W. Several seconds each, and that normally just flies by. After scrubbing, it is back to normal speed of mounting. The datasets that have snapshot issues vary with each one. Sometimes it's just one, sometimes many. But var is almost always included. (Though its parent, which has almost no activity ever, also is from time to time, so that's odd.) |
Same symptoms here, more or less. See also issue #11688. |
I also have the symptom with the corrupted snapshots, without kernel panics so far. So far it only affected my Debian system with Linux 5.10 and zfs 2.0.3 (I've turned the server off for today, I can check the exact versions tomorrow). Also, while the system has the 2.0.3 zfs utils + module, the pool is still left on 0.8.6 format. I wasn't able to execute On the corrupted system, after I got the mail from ZED, I manually ran a scrub at first, after which the I've rebooted the server into an Ubuntu 20.10 live with zfs 0.8.4-1ubuntu11 The errors didn't seem to affect the data on the zvols (all 4 affected snapshots are of zvols). The zvols are used as disks for VMs with ext4 on them. I have two other Ubuntu 21.04 based Systems with zfs-2.0.2-1ubuntu5 which are not affected until now. However, they have their pools already upgraded to 2. All are snapshotted with sanoid and have the datasets encrypted.
EDIT:
EDIT 2:
|
I'm seeing this too, on Ubuntu 21.04, also using zfs encryption I have znapzend running, and it makes a lot of snapshots. Sometimes, some of them are bad, and can't be used (for example, attempting to send them to a replica destination fails). I now use the In the most recent case (this morning) I had something like 4300 errors (many more than I'd seen previously). There are no block-level errors (read/write/cksum). They're cleared after destroying the affected snapshots and scrubbing (and maybe a reboot, depending on .. day?) Warning! Speculation below:
|
@jgoerzen Can you
In my case, #11688 (which you already reference), I've discovered that rebooting "heals" the snapshot -- at least using the patchset I mentioned there |
I'll be glad to. Unfortunately, I rebooted the machine yesterday, so I expect it will be about a week before the problem recurs. It is interesting to see the discussion today in #11688. The unique factor about the machine that doesn't work for me is that I have encryption enabled. It wouldn't surprise me to see the same thing here, but I will of course wait for it to recur and let you know. |
Hello @aerusso, The problem recurred over the weekend and I noticed it this morning. Unfortunately, the incident that caused it had already expired out of the
It should be noted that my hourly snap/send stuff runs at 17 minutes past the hour, so that may explain this timestamp correlation. zpool status reported:
Unfortunately I forgot to attempt to do a
So I think that answers the question. After a reboot but before a scrub, the |
I have similar symptoms, on an encrypted single-ssd ubuntu 21.04 boot pool, using stock zfs from ubuntu's repos. Deleting the affected snapshots and scrubbing previously cleared the errors, but on reoccurence, repeated scrubbing (without deleting them) caused a deadlock. My system has ECC memory, so it's probably not RAM related.
|
@cbreak-black Was there a system restart between the occurrence of the corrupted snapshot and the problems? Restarting has "fixed" this symptom for me (though you will need to scrub twice for the message to disappear, I believe). I have a suspicion that this may be a version of #10737 , which has an MR under way there. The behavior I am experiencing could be explained by that bug (syncoid starts many I'm holding off on trying to bisect this issue (at least) until testing that MR. (And all the above is conjecture!) |
@aerusso No, without a restart I got into the scrub-hang, and had to restart hard. Afterwards, the scrub finished, and several of the errors vanished. The rest of the errors vanished after deleting the snapshots and scrubbing again. |
Can I join the club too? #10019 |
@InsanePrawn I can't seem to find commit 4d5b4a33d in any repository I know of (and neither can github, apparently, either). However, in your report you say this was a "recent git master" and the commit I'm currently betting on being guilty is da92d5c, which was committed in November of the previous year, so I can't use your data point to rule out my theory! Also, it sounds like you didn't have any good way to reproduce the error --- however, you were using a test pool. Compared to my reproduction strategy (which is just, turn my computer on and browse the web, check mail, etc.) it might be easier to narrow in on a test case (or might have been easier a year and a half ago, when this was all fresh). Anyway, if you have any scripts or ideas of what you were doing that caused this besides "snapshots being created and deleted every couple minutes", it might be useful too. (I already tried lots of snapshot creations and deletions during fio on several datasets in a VM). |
Yeah, idk why I didn't go look for the commit in my issue - luckily for us, that server (and pool; it does say yolo, but it's my private server's root pool. it's just that i won't cry much if it breaks; originally due to then unreleased crypto) and the git repo on it still exist. Looks like 4d5b4a33d was two systemd-generator commits by me after 610eec4 |
FWIW the dataset the issue appeared on was an empty filesystem (maybe a single small file inside) dataset that had snapshots (without actual fs activity) taken in quick intervals (somewhere between 30s and 5m intervals) in parallel with a few (5-15) other similarly empty datasets. The pool is a raidz2 on 3.5" spinning SATA disks. Edit: Turns out the dataset also still exists, the defective snapshot however does not anymore. I doubt that's helpful? |
@InsanePrawn Does running the zrepl workload reproduce the bug on 2.0.5 (or another recent release?) I don't think the snapshot is terribly important --- unless you're able to really dig into it with zdb (which I have not developed sufficient expertise to do). Rather, I think it's the workload, hardware setup, and (possibly, but I don't understand the mechanism at all) the dataset itself. Encryption also is a common theme, but that might just affect the presentation (i.e., there's no MAC to fail in the unencrypted, unauthenticated, case). Getting at |
I've since added redundancy to my pool (it's now a mirror with two devices), and disabled autotrim. The snapshot corruption still happens. Still don't know what is causing it. And I also don't know if the corruption happens when creating the snapshot, and only later gets discovered (when I try to zfs send the snapshots), or if snapshots get corrupted some time in between creation and sending. |
@cbreak-black Can you enable the all-debug.sh ZEDlet, and put the temporary directory somewhere permanent (i.e., not the default of This will get the output of I'll repeat this here: if anyone gets me a reliable reproducer on a new pool, I have no doubt we'll be able to solve this in short order. |
Just mentioning here that we saw this on TrueNAS 12.0-U5 with OpenZFS 2.0.5 as well -- see #11688 (comment) for our story. |
Since I don't see anyone mentioning it here yet, #11679 contains a number of stories about the ARC getting confused when encryption is involved and, in a very similar looking illumos bug linked from there, eating data at least once. |
@gamanakis Nope, I'm not using raw (-w). |
it's present in v2.1.1 as well:
|
@aerusso you wrote that da92d5c may be the cause of this issue. My workstation at work panics after a couple of days and I need to reset it. Could you provide a branch of 2.1.1 with this commit reverted (as revert causes merge conflicts I can't fix myself) so I could test if the machine no longer crashes? |
@phreaker0 Unfortunately, the bug that da92d5c introduced (#10737) was fixed by #12299, which I believe is present in all maintained branches now. It does not fix #11688, (which I suspect is the same as this bug). I'm currently running 0.8.6 on Linux 5.4.y, and am hoping to wait out this bug (I don't have a lot of time right now, or for the foreseeable future). But, If you have a reliable reproducer (or a whole lot of time) you could bisect while running 5.4 (or some other pre-5.10 kernel). I can help anyone who wants to do that. If we can find the guilty commit, I have no doubt this can be resolved. |
@Germano0 Yes, probably. I will, of course, defer to the ZFS and kernel devs for any direction they would like this to take. My desire was to see if there was a way to reproducibly provoke corruption as the first step in tracking down the regression. My repo (https://github.com/HankB/provoke_ZFS_corruption) is a bit rough but provides that capability. I'm working to improve that, particularly the instructions. The next step is to determine which commit introduced the regression via git bisect. I have not yet started that. Anyone following my process will likely find errors and omissions and things that are just not clear. Those should probably be referred to me at my repo and not here. It would certainly help to have others try to reproduce my results. I'll repeat my offer to either start another issue relevant to this issue in OpenZFS or continue discussion at my repo if anyone thinks this is unnecessarily cluttering this issue. |
It was likely enabled, yes. I don't recally for sure, but we move enough data that restarting a full transfer would be very bad. I agree that I don't see how something on the receiving end could trigger this bug, though. |
When I was seeing it with |
So, I am able to finally hit this on 2.3.0 with the scripts @HankB provides, but takes a couple of hours on x86_64: The failing path which generates the authentication errors is: Now the last one fails with ECKSUM. I instrumented the code with:
Getting insights to whether the mac is indeed zeroed out (and that's why the decryption fails) or to what no_crypt is set might help further. I have yet to hit this though again. |
Ok, so using the instumentation above we have:
EDIT: This arc_untransform() (in dmu_send_impl()) is not the one involved..
I added a dumpstack into arc_untransform, lets see:
|
So, when arc_untransform() fails, we have this stack:
|
@gamanakis This trace looks interesting. Since the mentioned commit I always thought that the problem might be caused by simultaneous access to encrypted and plain-text data, which dbuf layer was never designed to handle. But I thought the problem is about access to snapshot data, for which I could not see a race, since before snapshot is fully received and the TXG is committed it is not mounted and so can not be accessed. But in the stack quoted I see PS: Looking again, it may mean not any properties, but one of: ZFS_PROP_VERSION, ZFS_PROP_NORMALIZE, ZFS_PROP_UTF8ONLY or ZFS_PROP_CASE. May be some others may trigger it also, but through a different code path. |
just to reiterate for clarity, this problem occurs on the sender, not the receiver |
Hmm. Then I have no idea how can we corrupt something without writing, unless it happens during snapshot creation. And IIRC send does not even use dbuf layer, using ARC and ZIO layers directly, so that guess might be wrong too. But any way more specific reproduction would be good, and the properties access still sounds interesting, or at least a new. |
There might be send holds getting written, though i suspect it's not essential for repro because not all the tools will use them. |
I stopped seeing the problem after switching to raw sends, so the data is not getting decrypted at the sender.. (there were other changes at the same time, and of course a time window before it was clear it hadn't recurred in a while, but I always suspected this was the key one) |
It might be during snapshot creation. We have a customer with encrypted datasets. They have a script which is effectively I don't have much to go on yet; we're still in information gathering, so not much to offer. But, it has had me thinking about an older issue where a receive could blow up in part because dnode encryption params for an object range being recieved could be overwritten by the params for the next range before the first range could be synced. (That was reported in #11893, mitigated in #14358 but possibly never properly understood; we heard about it through a different customer that did not yet have the "fix", and it manifested differently; it wasn't until our own analysis that we understood what was happening, and a good-enough workaround at the time was to just upgrade them). Now, I don't think its the same problem, but thinking about properties access is new, and it makes me wonder if it has a similar flavour. Could writing and reading from the "same" dbuf in the course of creating the snapshot cause one to get the other's encryption parameters, such that it now can't be read? If so I wonder if a strategically-placed To be very clear: this is the edge of an idea, not even close to a theory. Mostly, I wanted to hop in and say that actually the write may be involved, and maybe I've seen something like this. Hopefully someone can connect the dots. |
Does taking a new snapshot cause any property updates to the previous one (which might be undergoing send) |
I have two thoughts on this (with no knowledge of what's happening under the covers.)
My first experience with this is on a laptop that is using a single drive VDEV ( At one point I thought it would be useful to capture the entire pool so I set up a recursive After I stopped the whole pool backups, the "permanent errors" eventually go away. Most recently I turned on whole pool backups prior to upgrading from Stable to Testing and by the next morning there was one permanent error. It has been weeks and is still there but I an confident it will eventually fall off the end. The receiving dataset is not encrypted so the Thank you all for your interest in this issue. Edit: (More shaggy dog.) I have also been testing on FreeBSD (on a Pi 4B) and have provoked corruption with both 15 and 13 with the installed ZFS version. With 13 the scripts exited when |
Edit: The more I look into this, the more I believe #12001 has to do with multi slot dnodes (which may change upon creating new snaps and may affect older snaps). For the present one, most interfaces in zfs_ioctl.c read from non-owned objsets. Which may be a problem if we have multi slot dnodes. |
People who still experience the issue, what 'dnodesize' property do you use? The default is 'legacy' and it might be the missing piece of the puzzle. I saw the reproducer uses dnodesize=auto. |
Below is mine.
I do get Critical Error while doing syncoid pulls with encrypted datasets (with synoptions Rw) once in a while, but my zfs pool never saw actual data corruption. (obvious but:) |
The instructions for configuring root on ZFS for Debian (https://openzfs.github.io/openzfs-docs/Getting%20Started/Debian/Debian%20Bookworm%20Root%20on%20ZFS.html#step-2-disk-formatting) include the option Edit: At present I'm testing |
I thought that doing:
would help. However, it doesn't. Slight different trace involving
But again, |
For us it's |
Just another thought: accessing the filesystem properties probably shouldn't require decrypting the dbuf, since zfsprops are not encrypted per se. |
As fas as I can see, zfs_get_zplprops() does not need access to encrypted buffers. |
Sorry in advance if I say something stupid, but I was also carried away by the gold rush of bug hunting in this code without much prio knowledge. Here are two finding and the second one is looking pretty suspicious to me.
|
@IvanVolosyuk I'll give it a shot but it will be a couple days before I can get to it. |
@IvanVolosyuk |
Yeah, you should probably test it on ZFS 2.3.0 and 2.3.0 with the patch. @amotin did quite a bit of changes to fix ZFS encryption problems. I wonder if you'll be able to reproduce the failure with your reproducer on recent ZFS release. |
2.3.0 builds on Debian but something's not working - doesn't fine the zfs.ko module. I've asked on the mailing list at https://zfsonlinux.topicbox.com/groups/zfs-discuss/Tf3aa320b5d3f11ef/building-using-2-3-0-on-debian-bookworm |
System information
Describe the problem you're observing
Since upgrading to 2.0.x and enabling crypto, every week or so, I start to have issues with my zfs send/receive-based backups. Upon investigating, I will see output like this:
Of note, the
<0xeb51>
is sometimes a snapshot name; if Izfs destroy
the snapshot, it is replaced by this tag.Bug #11688 implies that zfs destroy on the snapshot and then a scrub will fix it. For me, it did not. If I run a scrub without rebooting after seeing this kind of
zpool status
output, I get the following in very short order, and the scrub (and eventually much of the system) hangs:However I want to stress that this backtrace is not the original cause of the problem, and it only appears if I do a scrub without first rebooting.
After that panic, the scrub stalled -- and a second error appeared:
I have found the solution to this issue is to reboot into single-user mode and run a scrub. Sometimes it takes several scrubs, maybe even with some reboots in between, but eventually it will clear up the issue. If I reboot before scrubbing, I do not get the panic or the hung scrub.
I run this same version of ZoL on two other machines, one of which runs this same kernel version. What is unique about this machine?
I made a significant effort to rule out hardware issues, including running several memory tests and the built-in Dell diagnostics. I believe I have rules that out.
Describe how to reproduce the problem
I can't at will. I have to wait for a spell.
Include any warning/errors/backtraces from the system logs
See above
Potentially related bugs
arc_buf_destroy
is in silent corruption for thousands files gives input/output error but cannot be detected with scrub - at least for openzfs 2.0.0 #11443. The behavior described there has some parallels to what I observe. I am uncertain from the discussion what that means for this.The text was updated successfully, but these errors were encountered: