-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
silent corruption for thousands files gives input/output error but cannot be detected with scrub - at least for openzfs 2.0.0 #11443
Comments
We're currently investigating this issue here: https://jira.ixsystems.com/browse/NAS-108627 |
FWIW I'm seeing this on one of my pools as well. So far I've only ever seen the errors in snapshots, never on actual file data. No signs of any actual HW problems, and this all started after upgrading to 2.0. The pattern I'm seeing is that after a system boot things are fine for a number of days, then the errors start showing up. This is a system with ECC ram, 8 disks in 4x2 mirrored configuration, was solid for ~4 years before the upgrade to 2.0. My other pool is also running 2.0 with no signs of this issue. Also ECC ram, the two differences that come to mind is # of disks (11 disk RAIDZ-3 config) and no 2.0 features enabled (i.e. I didn't run zpool upgrade -a on that system). Scrub has never detected anything on the pool showing these errors, but does clear them once I delete the snapshots where they happen. I scrub weekly. These are both Linux systems. |
@shuther Could you please confirm that the bug here is the same as in #10697 ? I.e., are you also experiencing this same issue on the 0.8.x branch? @jstenback Could you provide some more details about the corruption you're experiencing (i.e., what is your workload, how frequently are you taking snapshots, what are the exact errors you are getting, relevant logs, etc.) ? And can you also confirm that the errors you are getting only showed up after upgrading to 2.0? If this is indeed the case, could you please open another issue, since it sounds like there may be two underlying problems, and it will make diagnosing the problem easier if we can figure out what is a regression, and what is a preexisting bug. |
For me, it is different to 10697. Files were saved using freenas/samba/windows back in 2012 (as per the last modified time); iscsi was not involved. The files were changed from a VM (windows) running on the same server (so I guess no network involved). Given the number of files (>17,000, only from one dataset), I don't think the problem happened when I stored the files, but I expect zfs to have caused the corruption later on (resilver, scrub, ....) I did not write on the files using zfs 2.0. |
Sure. The system I'm seeing these errors on is my main home system that runs a number of misc things, but nothing particularly heavy. I snapshot hourly, and for one dataset I snapshot every 10 min. The second system I mentioned does little other than replicate the main one as a backup. Thus far every error I've seen has come from the backup system issuing zfs send commands on the main one. The nature of what I see in the logs is that zfs send fails, and in the few instances where I've been able to capture zpool events output when this has happened I see:
The output in zpool status -v lists errors in I did indeed only start seeing this behavior with 2.0. Happy to open a new issue if folks believe there's enough information here to move forward on what I have alone. |
@jstenback Thanks for the follow up. Are you also getting any errors reported from the
(That's a mostly untested command.) You might also do that in a healthy snapshot and an unhealthy snapshot, and see if they give errors. I would think the above command is almost equivalent to a (partial) scrub, but apparently that is part of the reported bug. |
Thanks @shuther, the background information is certainly helpful for everyone. What was the order of events---i.e., how long after the |
@aerusso I just noticed the issue now, a bit by chance, so I can't confirm when it happened the 1st time. The last resilver/replace was under 2.0. The upgrade was also under 2.0; I added a slog/cache at that time also. What is strange for me is that the problem is only under one dataset covering 20% of the used pool (I have about 50 different datasets); mainly pictures (in fact very small files?) And it is not an active one (only used in read). I don't use dedup; the only thing a bit particular is that I have nfsv4 ACL. Maybe we can focus on why scrub is not detecting the problem; let me know if I can help in any way? NB. I just checked some files in my backup; the files were fine on 2017; and they were not any more in December 2020; the last change (inode related) reported by stats (October 19th 2020) seems to make sense (zfs 8.x) as I upgraded freenas to version 12-0-RELEASE on Oct 27th from Freenas 11.3-U5 on Oct 1st. |
Can you elaborate on the FreeBSD/Linux versions you are using? Are these both TrueNAS (CORE and SCALE?) or have you used ZFS from upstream FreeBSD/Linux distros? |
I have FreeNas (only released version, I used to keep upgrading when a new release shows up within 2 weeks) as the main platform (up to TrueNAS Core 12) - no development version. |
If I understand correctly the silent corruption is caused by the "async dmu" patch shipped with freenas v12 but not yet merged in openzfs 2.0 (#10377) I don't know why freenas maintainers decided to ship code marked "work in progess" (according to github tags) in a RELEASE version. |
@shuther can you boot the debug kernel in TrueNAS and see if you can get a panic out of it? |
@emichael @willymilkytron iXsystems are working on upstreaming the async DMU and async CoW features. The code in our branch is regularly rebased to keep the upstream commit history clean, that is why the commit in our repo appears to have only been a few days ago. The pull requests to land these features upstream are from us and that integration effort is still in progress, but the code itself has been around for years. We ported the features to OpenZFS 2.0 for TrueNAS 12. Given the report in this issue that the problem also is present with OpenZFS 2.0 on Arch Linux (SystemRescue-ZFS), it doesn't seem to me that the async DMU/async CoW code is to blame. We did initially suspect it, but now it's not clear where the problem lies. |
The trouble we're having right now is trying to reproduce the problem. If anyone manages to come up with a simple test case that will reproduce this issue from scratch it will be a great help. |
were these files accessible until the upgrade? When were they last backed up so you could have an idea when the corruption happened? |
@freqlabs With the debug kernel, it crashes during the boot (when it imports the pool), and it just reboots. I may need some guidance to capture the error messages (and avoid the automatic reboot). Somebody else (https://www.truenas.com/community/threads/freenas-now-truenas-is-no-longer-stable.89445/post-624526) is facing what seems a corrupted superblock. Not sure if it is related. The corruption happened between October and last week. |
@shuther Please download the debug archive from System -> Advanced -> Save Debug, and provide the contents of ixdiagnose/textdump/panic.txt and ixdiagnose/textdump/ddb.txt here. |
It may be unclear where the problem lies, but yesterday someone called "Ryan Moeller" attached a new version of openzfs-debug.ko compiled without the "async dmu" patch in https://jira.ixsystems.com/browse/NAS-108627 and was asking the reporter to boot their production system with it. Who do we trust here? It still seems to me that it seems to you the async DMU/async CoW code is to blame here. |
@freqlabs see attached. I did not find any ddb.txt, could you confirm the path? |
I'm seeing the same symptom as @jstenback: I was able to destroy the offending snapshot, but now The pool was created with openzfs 2.0 about 2 weeks ago on 2 brand new 2 TB drives. It is a single mirrored vdev with encryption and lz4 compression. No other devices are on that pool. I use pyznap to take snapshots every 15 minutes on the pool. The machine is my home desktop computer and is mostly lightly loaded aside from doing its zfs send/recv backups and backing up one other computer from over the network. My versions are: I have the following traceback in my logs which may be related:
I have several more tracebacks from hung tasks. I can get those together too if they might be helpful. |
@shuther Thanks, I was able to get the information I need from that. panic:
backtrace:
|
@willymilkytron like I said, it's not clear yet what the problem is. We're seeing that reverting the async DMU and async CoW patch seems to be helping in some cases, but we're also seeing that people like @rbrewer123 are experiencing similar issues without the async patches. |
@rbrewer123 thanks for the additional information. Thus far I haven't been able to reproduce the issue but I'll setup a system and workload similar to yours to try and reproduce it. Is there anything else special about your configuration, perhaps a non-default module option? For your backups are you doing normal or raw zfs sends? At a minimum the stack trace does indicate and issue in the ARC reference counting. If you're in a position to do so rebuilding ZFS with debugging enabled will enable additional sanity checks throughout the code and may help us isolate the issue.
|
@behlendorf I don't think I have any special module options. For backups, pyznap takes snapshots every 15 minutes on the affected pool, and once per day sends them to another pool (mounted locally on the same machine). Both pools are encrypted and I don't use raw sends. As for workload, I've been fiddling with my backups in the last few days, so it's possible I was either running a deduplication program on my main pool (the affected pool) and the backup pool (a read-heavy workload), or backing up the networked machine (a write-heavy load, though it's over a wifi link that only gets about 3 MBps). I'll try to build zfs with debugging to see if anything else turns up. |
@behlendorf according to my notes I'm pretty sure this is exactly how I created my pool:
The devices are a partition on each of 2 2TB consumer-market SATA HDDs at 7200RPM. After creating the pool, I used something like:
to load it up with a replication stream from my previous non-encrypted iteration of the pool which I think was created under zfs 0.7.x or 0.8.x and was working fine for a couple years. To clarify, all datasets are now encrypted on the new pool. The old pool contained a bunch of snapshots of its own, but the corrupted snapshot(s) were created a week or two after moving to the new pool. |
FYI: The |
@mabod, thanks for the tip! Apparently I forgot that this pool started life as a single device, and then I attached the second device a day later after loading it up and scrubbing. |
@behlendorf is there an easy way to verify on a running system if my zfs kernel module was compiled with |
@rbrewer123 yes, there is. When you load the kmod it'll print
|
FYI for the TrueNAS users, 12.0-U1.1 was released with a workaround for this issue. |
@freqlabs I upgraded to 12.0-U1.1; while my files still report input/output error, are you saying that I should not see any corruption any more (but my 17,000 are corrupted and I should delete them)? Is there a recommended approach to recover the files (maybe using zdb) as my guess (but I could be wrong) is that the content of the files are ok? |
@shuther OK, then unfortunately your issue must be different from the one we fixed. There should not have been any actual corruption on disk, only in the buffers passed to applications. This may still be the case but for a different reason. I would like to believe the scrub is correct. Can you not read the files or detect errors via scrub with an older version of ZFS? |
@freqlabs scrub returns no issue. I can't read the files (input/output error); I can't downgrade zfs (I enabled some features). Not sure how I can help further. |
I wonder if this one can be similar to
#10697 (null acls). This will explain
no scrub errors.
|
@IvanVolosyuk I am able to see the content of the file (partially) as per #10697 (comment). Also, changing the ACL does not help (getfacl was returning the same for every file) |
@behlendorf thanks for the tip. A quick update: I was able to upgrade my ZFS kernel module to 2.0.1 a week or so ago. My system got another corrupted snapshot a few days ago that is also permanent like the first, so (as expected) it's still happening with 2.0.1. It's possible I was lightly using the system at the time of that corruption, I'm not sure. I finally wrangled NixOS into building the module with debugging and am running with that as of today. |
@rbrewer123 thanks for the update. Definitely let me know if you encounter any problems. Unfortunately, I've thus far not be able to reproduce the issue. |
@behlendorf here's a quick update. I'm currently running openzfs 2.0.1 with I'm noticing that the |
FWIW I also see core dumps from zfs send when attempting to send a corrupted snapshot. Unfortunately my core files get auto purged so I don't have one handy right now, but if I see one again I can attempt to investigate. I unfortunately am not running a debug build here :( |
@behlendorf I got another corrupted snapshot, noticed during my backup procedure. I ran
There are a bunch more hung tasks following that in the logs. The next morning I had to power cycle because my machine was unresponsive. |
@rbrewer123 thanks! Unfortunately, that stack is most likely unrelated since it's something we have observed rarely with prior releases. |
This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions. |
System information
Describe the problem you're observing
The issue raised for a previous version of zfs (#10697) is still there.
Issue appears in Freenas (Freebsd 12) using openzfs 2.0; problem was confirmed using SystemRescue (https://github.com/nchevsky/systemrescue-zfs) that is using a more recent version of zfs. This sound more a zfs issue.
The impacted files are read-only (I never changed them (write once, read many times), should have been part of the pool for at least one year). The files (about 17,000) are all part of the same dataset (mainly .jpg files but not all while the dataset contains mostly music)
I have many files that I can't see (cat, less, ... reports: Input/output error); ls is reporting the file howver. I replaced a disk recently, not sure if I should try a resilver or if it was the reason of the problem. I checked old snapshots, and the problem is also there.
I tried to see the content of the file using zdb and it seems successful
cp returns: Bad Address, and dmesg reports the error (nothing for cat, ....):
vm_fault: pager read error
zfs send
returns also Input/output error; easy way to spot the problem.
Pool is healthy, scrub did not report any issue. However after executing the zfs send, one error appeared connected to a snapshot (the one I used with zfs send); I guess it stopped after the 1st error.
Note that the slog/cache were added very recently (the issue was likely present before I added them).
Files are pretty old as per below, but it is as if the inode changed recently (I replaced one disk with a larger one) around that time (mid-November I think, not sure how to check it as I don't see it under zpool history) and if it relates.
Describe how to reproduce the problem
zfs send or cat file
Include any warning/errors/backtraces from the system logs
On Freebsd, nothing except if I am using cp (
vm_fault: pager read error
)The text was updated successfully, but these errors were encountered: