storage: node crashes with "Corruption: external file have non zero sequence number" error #36679

thoszhang · 2019-04-09T18:04:31Z

During testing for #36091, when I was running an index backfill, some nodes ran out of memory and died. I restarted the cluster, and some of the nodes crashed after a few minutes (although I don't remember if they were the same nodes that originally crashed), with this error:

F190409 00:42:44.917430 271 storage/replica_proposal.go:452  [n20,s20,r3342/3:/Table/53/2/"D{.6K"/…-4>\\"…}] while ingesting /mnt/data1/cockroach/auxiliary/sideloading/r0XXXX/r3342/i802.t6.ingested: Corruption: external file have non zero sequence number
goroutine 271 [running]:
github.com/cockroachdb/cockroach/pkg/util/log.getStacks(0xc000057b01, 0xc000057b60, 0x53b4000, 0x1b)
        /go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:1020 +0xd4
github.com/cockroachdb/cockroach/pkg/util/log.(*loggingT).outputLogEntry(0x5b482c0, 0xc000000004, 0x53b40b3, 0x1b, 0x1c4, 0xc008ad20d0, 0xc5)
        /go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:878 +0x93d
github.com/cockroachdb/cockroach/pkg/util/log.addStructured(0x3a352c0, 0xc009669ec0, 0x4, 0x2, 0x32dd7b3, 0x16, 0xc00b00e9e8, 0x2, 0x2)
        /go/src/github.com/cockroachdb/cockroach/pkg/util/log/structured.go:85 +0x2d8
github.com/cockroachdb/cockroach/pkg/util/log.logDepth(0x3a352c0, 0xc009669ec0, 0x1, 0xc000000004, 0x32dd7b3, 0x16, 0xc00b00e9e8, 0x2, 0x2)
        /go/src/github.com/cockroachdb/cockroach/pkg/util/log/log.go:71 +0x8c
github.com/cockroachdb/cockroach/pkg/util/log.Fatalf(0x3a352c0, 0xc009669ec0, 0x32dd7b3, 0x16, 0xc00b00e9e8, 0x2, 0x2)
        /go/src/github.com/cockroachdb/cockroach/pkg/util/log/log.go:182 +0x7e
github.com/cockroachdb/cockroach/pkg/storage.addSSTablePreApply(0x3a352c0, 0xc009669ec0, 0xc000505300, 0x3a82700, 0xc000447500, 0x3a57ba0, 0xc0052d6300, 0x6, 0x322, 0xc007199500, ...)
        /go/src/github.com/cockroachdb/cockroach/pkg/storage/replica_proposal.go:452 +0xef6
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).processRaftCommand(0xc0052ec000, 0x3a352c0, 0xc009669ec0, 0xc00e308428, 0x8, 0x6, 0x322, 0x200000002, 0x1, 0x14, ...)
        /go/src/github.com/cockroachdb/cockroach/pkg/storage/replica_raft.go:1864 +0x148e
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).handleRaftReadyRaftMuLocked(0xc0052ec000, 0x3a352c0, 0xc009669ec0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
        /go/src/github.com/cockroachdb/cockroach/pkg/storage/replica_raft.go:812 +0x13df
github.com/cockroachdb/cockroach/pkg/storage.(*Store).processRequestQueue.func1(0x3a352c0, 0xc009669ec0, 0xc0052ec000, 0x3a352c0)
        /go/src/github.com/cockroachdb/cockroach/pkg/storage/store.go:3628 +0x120
github.com/cockroachdb/cockroach/pkg/storage.(*Store).withReplicaForRequest(0xc0003ecc00, 0x3a352c0, 0xc009669ec0, 0xc0067baa80, 0xc00a68bed0, 0x0)
        /go/src/github.com/cockroachdb/cockroach/pkg/storage/store.go:3275 +0x135
github.com/cockroachdb/cockroach/pkg/storage.(*Store).processRequestQueue(0xc0003ecc00, 0x3a352c0, 0xc0009e9560, 0xd0e)
        /go/src/github.com/cockroachdb/cockroach/pkg/storage/store.go:3616 +0x21b
github.com/cockroachdb/cockroach/pkg/storage.(*raftScheduler).worker(0xc0000da900, 0x3a352c0, 0xc0009e9560)
        /go/src/github.com/cockroachdb/cockroach/pkg/storage/scheduler.go:225 +0x21a
github.com/cockroachdb/cockroach/pkg/storage.(*raftScheduler).Start.func2(0x3a352c0, 0xc0009e9560)
        /go/src/github.com/cockroachdb/cockroach/pkg/storage/scheduler.go:165 +0x3e
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunWorker.func1(0xc00029a6b0, 0xc000b46000, 0xc00029a620)
        /go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:200 +0xe1
created by github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunWorker
        /go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:193 +0xa8

After I restarted the cluster again, the job seemed to resume correctly.

This seems to be the same issue (with the same cause, i.e., restarting nodes after a crash during an index backfill) that was supposed to have been fixed in #36445 (which was merged when I ran this). I haven't tried to reproduce it yet.

The text was updated successfully, but these errors were encountered:

petermattis · 2019-04-09T20:37:42Z

@ajkr Lucy is pretty confident that she was running a binary containing @dt's mitigation for this problem. Specifically, we only try to do a moving ingest of an sstable when the link count is 1 prior to the ingest. See #36511. Any ideas what could have gone wrong with that mitigation?

@dt had another thought here: let's snoop the error message and not fatal when this specific error occurs. No RocksDB state has been changed. We can look for this error message and then force a copy of the file and try the ingest again. @ajkr do you see any problems with that approach?

bdarnell · 2019-04-09T22:10:14Z

The issue behind #36511 was considered a release blocker. Does that mean this one is too?

I thought about inspecting the error message before we introduced the link-count hack. I think it would work, although it feels unclean (especially if we don't have a good idea about why counting links didn't work)

petermattis · 2019-04-10T13:21:22Z

The issue behind #36511 was considered a release blocker. Does that mean this one is too?

Yes, let's get this added back to the release blocker issue list.

I thought about inspecting the error message before we introduced the link-count hack. I think it would work, although it feels unclean (especially if we don't have a good idea about why counting links didn't work)

Agreed that it feels unclean. We don't currently have a theory as to why the link-count hack didn't work. @lucy-zhang is going to be trying to reproduce this crash this week which will hopefully shed some light on it.

petermattis · 2019-04-10T13:27:57Z

Here is a theory for why the link-count hack didn't work:

Ingest external file A as SST X
RocksDB adds a hard link to file A (link-count == 2)
Crash -> restart
SST X is loaded into the block cache
Compaction occurs involving SST X which decreases the link-count of file A to 1
We try to re-ingest file A without making a copy ... boom!

@ajkr does this sound possible to you?

dt · 2019-04-10T14:22:11Z

Huh, so maybe link counting isn't the right approach. Maybe we should go back to the idea of touching a sentinel file to get at-most-once semantics?

bdarnell · 2019-04-10T15:04:20Z

Peter's scenario makes sense to me (unless there's a process that eagerly evicts blocks from the cache when a file is compacted away).

Touching a sentinel file feels cleaner than inspecting the error message, although it seems potentially fairly expensive to create a new file and flush it to disk. Would it be better to write a key to rocksdb to track this?

petermattis · 2019-04-10T16:57:06Z

unless there's a process that eagerly evicts blocks from the cache when a file is compacted away

There is definitely not such a process in RocksDB. The lack of such a process is exactly why the block cache expands to fill available space even when the on-disk size is much smaller.

Touching a sentinel file feels cleaner than inspecting the error message, although it seems potentially fairly expensive to create a new file and flush it to disk. Would it be better to write a key to rocksdb to track this?

Given that I hack expect the hack to be temporary, I'd go with what is most expedient. Creating a sentinel file isn't much more expensive (if at all) vs writing a RocksDB key and syncing it.

ajkr · 2019-04-10T20:31:28Z

@ajkr does this sound possible to you?

Sorry I need to figure out how to prioritize mentions in my notifications. Seems I miss them pretty much every time.

Yes that looks right. Compaction could delete the ingested file before we re-ingest causing its link count to be one. And that file's blocks can survive in block cache.

This realization feels kind of unfortunate, though it's really good we know now. It means even if we track unique ID of all files in the DB, e.g., by storing them in the manifest, we still cannot prevent a duplicate file from being move-ingested.

petermattis · 2019-04-11T00:37:14Z

This realization feels kind of unfortunate, though it's really good we know now. It means even if we track unique ID of all files in the DB, e.g., by storing them in the manifest, we still cannot prevent a duplicate file from being move-ingested.

I've already forgotten the reason when using the sstable's file number (instead of inode number) is problematic. That would solve the problem here as the file number is allocated anew each time a file is ingested. We'd have to make the cache key unique by including some sort of unique identifier for the DB, but that doesn't seem like a serious hurdle. Concretely, I'm proposing that the cache key look like: <db-unique-id>/<file-num>/<block-offset>.

awoods187 · 2019-04-11T18:47:17Z

Addressed via #36688

awoods187 · 2019-04-11T19:07:11Z

After discussion with @dt we are going to close this issue (as a part of the release blocker) and open a new issue for any other work/ongoing thinking here

thoszhang added S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting A-storage Relating to our storage engine (Pebble) on-disk storage. A-schema-changes labels Apr 9, 2019

dt assigned ajkr, dt and thoszhang Apr 9, 2019

andreimatei mentioned this issue Apr 10, 2019

19.1 release blockers list #35554

Closed

23 tasks

awoods187 closed this as completed Apr 11, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: node crashes with "Corruption: external file have non zero sequence number" error #36679

storage: node crashes with "Corruption: external file have non zero sequence number" error #36679

thoszhang commented Apr 9, 2019

petermattis commented Apr 9, 2019

bdarnell commented Apr 9, 2019

petermattis commented Apr 10, 2019

petermattis commented Apr 10, 2019

dt commented Apr 10, 2019

bdarnell commented Apr 10, 2019

petermattis commented Apr 10, 2019

ajkr commented Apr 10, 2019 •

edited

Loading

petermattis commented Apr 11, 2019

awoods187 commented Apr 11, 2019

awoods187 commented Apr 11, 2019

storage: node crashes with "Corruption: external file have non zero sequence number" error #36679

storage: node crashes with "Corruption: external file have non zero sequence number" error #36679

Comments

thoszhang commented Apr 9, 2019

petermattis commented Apr 9, 2019

bdarnell commented Apr 9, 2019

petermattis commented Apr 10, 2019

petermattis commented Apr 10, 2019

dt commented Apr 10, 2019

bdarnell commented Apr 10, 2019

petermattis commented Apr 10, 2019

ajkr commented Apr 10, 2019 • edited Loading

petermattis commented Apr 11, 2019

awoods187 commented Apr 11, 2019

awoods187 commented Apr 11, 2019

ajkr commented Apr 10, 2019 •

edited

Loading