Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: node crashes with "Corruption: external file have non zero sequence number" error #36679

Closed
thoszhang opened this issue Apr 9, 2019 · 11 comments
Assignees
Labels
A-schema-changes A-storage Relating to our storage engine (Pebble) on-disk storage. S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting

Comments

@thoszhang
Copy link
Contributor

During testing for #36091, when I was running an index backfill, some nodes ran out of memory and died. I restarted the cluster, and some of the nodes crashed after a few minutes (although I don't remember if they were the same nodes that originally crashed), with this error:

F190409 00:42:44.917430 271 storage/replica_proposal.go:452  [n20,s20,r3342/3:/Table/53/2/"D{.6K"/…-4>\\"…}] while ingesting /mnt/data1/cockroach/auxiliary/sideloading/r0XXXX/r3342/i802.t6.ingested: Corruption: external file have non zero sequence number
goroutine 271 [running]:
github.com/cockroachdb/cockroach/pkg/util/log.getStacks(0xc000057b01, 0xc000057b60, 0x53b4000, 0x1b)
        /go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:1020 +0xd4
github.com/cockroachdb/cockroach/pkg/util/log.(*loggingT).outputLogEntry(0x5b482c0, 0xc000000004, 0x53b40b3, 0x1b, 0x1c4, 0xc008ad20d0, 0xc5)
        /go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:878 +0x93d
github.com/cockroachdb/cockroach/pkg/util/log.addStructured(0x3a352c0, 0xc009669ec0, 0x4, 0x2, 0x32dd7b3, 0x16, 0xc00b00e9e8, 0x2, 0x2)
        /go/src/github.com/cockroachdb/cockroach/pkg/util/log/structured.go:85 +0x2d8
github.com/cockroachdb/cockroach/pkg/util/log.logDepth(0x3a352c0, 0xc009669ec0, 0x1, 0xc000000004, 0x32dd7b3, 0x16, 0xc00b00e9e8, 0x2, 0x2)
        /go/src/github.com/cockroachdb/cockroach/pkg/util/log/log.go:71 +0x8c
github.com/cockroachdb/cockroach/pkg/util/log.Fatalf(0x3a352c0, 0xc009669ec0, 0x32dd7b3, 0x16, 0xc00b00e9e8, 0x2, 0x2)
        /go/src/github.com/cockroachdb/cockroach/pkg/util/log/log.go:182 +0x7e
github.com/cockroachdb/cockroach/pkg/storage.addSSTablePreApply(0x3a352c0, 0xc009669ec0, 0xc000505300, 0x3a82700, 0xc000447500, 0x3a57ba0, 0xc0052d6300, 0x6, 0x322, 0xc007199500, ...)
        /go/src/github.com/cockroachdb/cockroach/pkg/storage/replica_proposal.go:452 +0xef6
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).processRaftCommand(0xc0052ec000, 0x3a352c0, 0xc009669ec0, 0xc00e308428, 0x8, 0x6, 0x322, 0x200000002, 0x1, 0x14, ...)
        /go/src/github.com/cockroachdb/cockroach/pkg/storage/replica_raft.go:1864 +0x148e
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).handleRaftReadyRaftMuLocked(0xc0052ec000, 0x3a352c0, 0xc009669ec0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
        /go/src/github.com/cockroachdb/cockroach/pkg/storage/replica_raft.go:812 +0x13df
github.com/cockroachdb/cockroach/pkg/storage.(*Store).processRequestQueue.func1(0x3a352c0, 0xc009669ec0, 0xc0052ec000, 0x3a352c0)
        /go/src/github.com/cockroachdb/cockroach/pkg/storage/store.go:3628 +0x120
github.com/cockroachdb/cockroach/pkg/storage.(*Store).withReplicaForRequest(0xc0003ecc00, 0x3a352c0, 0xc009669ec0, 0xc0067baa80, 0xc00a68bed0, 0x0)
        /go/src/github.com/cockroachdb/cockroach/pkg/storage/store.go:3275 +0x135
github.com/cockroachdb/cockroach/pkg/storage.(*Store).processRequestQueue(0xc0003ecc00, 0x3a352c0, 0xc0009e9560, 0xd0e)
        /go/src/github.com/cockroachdb/cockroach/pkg/storage/store.go:3616 +0x21b
github.com/cockroachdb/cockroach/pkg/storage.(*raftScheduler).worker(0xc0000da900, 0x3a352c0, 0xc0009e9560)
        /go/src/github.com/cockroachdb/cockroach/pkg/storage/scheduler.go:225 +0x21a
github.com/cockroachdb/cockroach/pkg/storage.(*raftScheduler).Start.func2(0x3a352c0, 0xc0009e9560)
        /go/src/github.com/cockroachdb/cockroach/pkg/storage/scheduler.go:165 +0x3e
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunWorker.func1(0xc00029a6b0, 0xc000b46000, 0xc00029a620)
        /go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:200 +0xe1
created by github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunWorker
        /go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:193 +0xa8

After I restarted the cluster again, the job seemed to resume correctly.

This seems to be the same issue (with the same cause, i.e., restarting nodes after a crash during an index backfill) that was supposed to have been fixed in #36445 (which was merged when I ran this). I haven't tried to reproduce it yet.

@thoszhang thoszhang added S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting A-storage Relating to our storage engine (Pebble) on-disk storage. A-schema-changes labels Apr 9, 2019
@dt dt assigned ajkr, dt and thoszhang Apr 9, 2019
@petermattis
Copy link
Collaborator

@ajkr Lucy is pretty confident that she was running a binary containing @dt's mitigation for this problem. Specifically, we only try to do a moving ingest of an sstable when the link count is 1 prior to the ingest. See #36511. Any ideas what could have gone wrong with that mitigation?

@dt had another thought here: let's snoop the error message and not fatal when this specific error occurs. No RocksDB state has been changed. We can look for this error message and then force a copy of the file and try the ingest again. @ajkr do you see any problems with that approach?

@bdarnell
Copy link
Contributor

bdarnell commented Apr 9, 2019

The issue behind #36511 was considered a release blocker. Does that mean this one is too?

I thought about inspecting the error message before we introduced the link-count hack. I think it would work, although it feels unclean (especially if we don't have a good idea about why counting links didn't work)

@petermattis
Copy link
Collaborator

The issue behind #36511 was considered a release blocker. Does that mean this one is too?

Yes, let's get this added back to the release blocker issue list.

I thought about inspecting the error message before we introduced the link-count hack. I think it would work, although it feels unclean (especially if we don't have a good idea about why counting links didn't work)

Agreed that it feels unclean. We don't currently have a theory as to why the link-count hack didn't work. @lucy-zhang is going to be trying to reproduce this crash this week which will hopefully shed some light on it.

@petermattis
Copy link
Collaborator

Here is a theory for why the link-count hack didn't work:

  • Ingest external file A as SST X
  • RocksDB adds a hard link to file A (link-count == 2)
  • Crash -> restart
  • SST X is loaded into the block cache
  • Compaction occurs involving SST X which decreases the link-count of file A to 1
  • We try to re-ingest file A without making a copy ... boom!

@ajkr does this sound possible to you?

@dt
Copy link
Member

dt commented Apr 10, 2019

Huh, so maybe link counting isn't the right approach. Maybe we should go back to the idea of touching a sentinel file to get at-most-once semantics?

@bdarnell
Copy link
Contributor

Peter's scenario makes sense to me (unless there's a process that eagerly evicts blocks from the cache when a file is compacted away).

Touching a sentinel file feels cleaner than inspecting the error message, although it seems potentially fairly expensive to create a new file and flush it to disk. Would it be better to write a key to rocksdb to track this?

@petermattis
Copy link
Collaborator

unless there's a process that eagerly evicts blocks from the cache when a file is compacted away

There is definitely not such a process in RocksDB. The lack of such a process is exactly why the block cache expands to fill available space even when the on-disk size is much smaller.

Touching a sentinel file feels cleaner than inspecting the error message, although it seems potentially fairly expensive to create a new file and flush it to disk. Would it be better to write a key to rocksdb to track this?

Given that I hack expect the hack to be temporary, I'd go with what is most expedient. Creating a sentinel file isn't much more expensive (if at all) vs writing a RocksDB key and syncing it.

@ajkr
Copy link
Contributor

ajkr commented Apr 10, 2019

@ajkr does this sound possible to you?

Sorry I need to figure out how to prioritize mentions in my notifications. Seems I miss them pretty much every time.

Yes that looks right. Compaction could delete the ingested file before we re-ingest causing its link count to be one. And that file's blocks can survive in block cache.

This realization feels kind of unfortunate, though it's really good we know now. It means even if we track unique ID of all files in the DB, e.g., by storing them in the manifest, we still cannot prevent a duplicate file from being move-ingested.

@petermattis
Copy link
Collaborator

This realization feels kind of unfortunate, though it's really good we know now. It means even if we track unique ID of all files in the DB, e.g., by storing them in the manifest, we still cannot prevent a duplicate file from being move-ingested.

I've already forgotten the reason when using the sstable's file number (instead of inode number) is problematic. That would solve the problem here as the file number is allocated anew each time a file is ingested. We'd have to make the cache key unique by including some sort of unique identifier for the DB, but that doesn't seem like a serious hurdle. Concretely, I'm proposing that the cache key look like: <db-unique-id>/<file-num>/<block-offset>.

@awoods187
Copy link
Contributor

Addressed via #36688

@awoods187
Copy link
Contributor

After discussion with @dt we are going to close this issue (as a part of the release blocker) and open a new issue for any other work/ongoing thinking here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-schema-changes A-storage Relating to our storage engine (Pebble) on-disk storage. S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting
Projects
None yet
Development

No branches or pull requests

6 participants