-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: node crashes with "Corruption: external file have non zero sequence number" error #36679
Comments
@ajkr Lucy is pretty confident that she was running a binary containing @dt's mitigation for this problem. Specifically, we only try to do a moving ingest of an sstable when the link count is 1 prior to the ingest. See #36511. Any ideas what could have gone wrong with that mitigation? @dt had another thought here: let's snoop the error message and not fatal when this specific error occurs. No RocksDB state has been changed. We can look for this error message and then force a copy of the file and try the ingest again. @ajkr do you see any problems with that approach? |
The issue behind #36511 was considered a release blocker. Does that mean this one is too? I thought about inspecting the error message before we introduced the link-count hack. I think it would work, although it feels unclean (especially if we don't have a good idea about why counting links didn't work) |
Yes, let's get this added back to the release blocker issue list.
Agreed that it feels unclean. We don't currently have a theory as to why the link-count hack didn't work. @lucy-zhang is going to be trying to reproduce this crash this week which will hopefully shed some light on it. |
Here is a theory for why the link-count hack didn't work:
@ajkr does this sound possible to you? |
Huh, so maybe link counting isn't the right approach. Maybe we should go back to the idea of touching a sentinel file to get at-most-once semantics? |
Peter's scenario makes sense to me (unless there's a process that eagerly evicts blocks from the cache when a file is compacted away). Touching a sentinel file feels cleaner than inspecting the error message, although it seems potentially fairly expensive to create a new file and flush it to disk. Would it be better to write a key to rocksdb to track this? |
There is definitely not such a process in RocksDB. The lack of such a process is exactly why the block cache expands to fill available space even when the on-disk size is much smaller.
Given that I hack expect the hack to be temporary, I'd go with what is most expedient. Creating a sentinel file isn't much more expensive (if at all) vs writing a RocksDB key and syncing it. |
Sorry I need to figure out how to prioritize mentions in my notifications. Seems I miss them pretty much every time. Yes that looks right. Compaction could delete the ingested file before we re-ingest causing its link count to be one. And that file's blocks can survive in block cache. This realization feels kind of unfortunate, though it's really good we know now. It means even if we track unique ID of all files in the DB, e.g., by storing them in the manifest, we still cannot prevent a duplicate file from being move-ingested. |
I've already forgotten the reason when using the sstable's file number (instead of inode number) is problematic. That would solve the problem here as the file number is allocated anew each time a file is ingested. We'd have to make the cache key unique by including some sort of unique identifier for the DB, but that doesn't seem like a serious hurdle. Concretely, I'm proposing that the cache key look like: |
Addressed via #36688 |
After discussion with @dt we are going to close this issue (as a part of the release blocker) and open a new issue for any other work/ongoing thinking here |
During testing for #36091, when I was running an index backfill, some nodes ran out of memory and died. I restarted the cluster, and some of the nodes crashed after a few minutes (although I don't remember if they were the same nodes that originally crashed), with this error:
After I restarted the cluster again, the job seemed to resume correctly.
This seems to be the same issue (with the same cause, i.e., restarting nodes after a crash during an index backfill) that was supposed to have been fixed in #36445 (which was merged when I ran this). I haven't tried to reproduce it yet.
The text was updated successfully, but these errors were encountered: