storage: consistency check failure during import #36861

nvanbenschoten · 2019-04-16T03:23:15Z

To do / understand

how did nathan-tpcc-geo:7 end up rotating into a new log file to log the fatal error without retaining any prior ones? (answer: log rotation storage: consistency check failure during import #36861 (comment))
should we change something about where the diff goes or tweak the logging somehow?
take a RocksDB checkpoint on all nodes before fatal inconsistency errors (PR: storage: take an engine checkpoint during failing consistency checks #36867)

This looks very similar to #35424, so it's possible that that issue wasn't fully resolved. I was most of the way through a TPC-C 4k import when a node died due to a consistency check failure.

F190416 01:58:51.634989 172922 storage/replica_consistency.go:220  [n5,consistencyChecker,s5,r590/1:/Table/68/1/{29/4/2…-31/4/1…}] consistency check failed with 1 inconsistent replicas
goroutine 172922 [running]:
github.com/cockroachdb/cockroach/pkg/util/log.getStacks(0xc000056301, 0xc000056300, 0x5449800, 0x1e)
        /go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:1020 +0xd4
github.com/cockroachdb/cockroach/pkg/util/log.(*loggingT).outputLogEntry(0x5bdd700, 0xc000000004, 0x5449860, 0x1e, 0xdc, 0xc008bfa5a0, 0x79)
        /go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:878 +0x93d
github.com/cockroachdb/cockroach/pkg/util/log.addStructured(0x3aa1620, 0xc0071899e0, 0x4, 0x2, 0x33b2862, 0x36, 0xc01f06cce0, 0x1, 0x1)
        /go/src/github.com/cockroachdb/cockroach/pkg/util/log/structured.go:85 +0x2d8
github.com/cockroachdb/cockroach/pkg/util/log.logDepth(0x3aa1620, 0xc0071899e0, 0x1, 0xc000000004, 0x33b2862, 0x36, 0xc01f06cce0, 0x1, 0x1)
        /go/src/github.com/cockroachdb/cockroach/pkg/util/log/log.go:71 +0x8c
github.com/cockroachdb/cockroach/pkg/util/log.Fatalf(0x3aa1620, 0xc0071899e0, 0x33b2862, 0x36, 0xc01f06cce0, 0x1, 0x1)
        /go/src/github.com/cockroachdb/cockroach/pkg/util/log/log.go:182 +0x7e
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).CheckConsistency(0xc023eeb400, 0x3aa1620, 0xc0071899e0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
        /go/src/github.com/cockroachdb/cockroach/pkg/storage/replica_consistency.go:220 +0x6ce
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).CheckConsistency(0xc023eeb400, 0x3aa1620, 0xc0071899e0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
        /go/src/github.com/cockroachdb/cockroach/pkg/storage/replica_consistency.go:229 +0x81b
github.com/cockroachdb/cockroach/pkg/storage.(*consistencyQueue).process(0xc0003de2a0, 0x3aa1620, 0xc0071899e0, 0xc023eeb400, 0x0, 0x0, 0x0)
        /go/src/github.com/cockroachdb/cockroach/pkg/storage/consistency_queue.go:125 +0x210

Cockroach SHA: 3ebed10

Notes:

Cluster: nathan-tpcc-geo (stopped, extended for 48h)
Cockroach nodes: 1,2,4,5,7,8,10,11
Inconsistent range: r590
Replicas: nathan-tpcc-geo:2/n2/r3, nathan-tpcc-geo:5/n4/r4, and nathan-tpcc-geo:7/n5/r1
Inconsistent replica: nathan-tpcc-geo:7/n5/r1
Replicas in zones: europe-west2-b, europe-west4-b, and asia-northeast1-b respectively

Initial Investigation

Unlike in the later reproductions of #35424, replica 1's Raft log is an exact prefix of replica 3 and 4's, so this doesn't look like the same issue we saw later in that issue.

I haven't looked at much else yet.

r590 Range _ Debug _ Cockroach Console.pdf

The text was updated successfully, but these errors were encountered:

tbg · 2019-04-16T07:45:06Z

We don't have the diff because I messed something up during recent refactors. Running

roachprod ssh nathan-tpcc-geo -- sudo cp -R /mnt/data1/cockroach /mnt/data1/backup

now and then I'll bring the cluster up to recover the diff.

tbg · 2019-04-16T08:32:18Z

Currently the range seems stats-consistent and also passed enqueuing through the full consistency checker. The replicas are still n2,n4,n5 (replicaIDs 1 3 4), so nothing has shifted. I will dig some more but this could mean that we were looking at a false positive of sorts, or a temporary inconsistency that resolved itself.

I'm also not so sure that I messed up something regarding the consistency checker diff. What's really weird is that typically, the queue first computes a checksum without the diff. Then, when that has finished and found mismatching SHAs, it computes another round this time with a diff, then prints that, and fatals. In the logs, we're supposed to see the log message from line 226.

cockroach/pkg/storage/replica_consistency.go

Lines 224 to 231 in a078fd2

    
           // No diff was printed, so we want to re-run with diff. 
        
           // Note that this will call Fatal recursively in `CheckConsistency` (in the code above). 
        
           log.Errorf(ctx, "consistency check failed with %d inconsistent replicas; fetching details", 
        
           	inconsistencyCount) 
        
           args.WithDiff = true 
        
           if _, pErr := r.CheckConsistency(ctx, args); pErr != nil { 
        
           	logFunc(ctx, "replica inconsistency detected; could not obtain actual diff: %s", pErr) 
        
           }

I don't see that in the (combined) logs from the cluster. But the stack trace shows that we are in the recursive call from on line 229.

A clue is that the fatal error is in its own, new, log file. There aren't any past log files. It looks like the logging messed something up; we effectively know nothing about n5 (nathan-tpcc-geo:7) before the fatal; the fatal message is the very first thing we have.

tschottdorf@nathan-tpcc-geo-0007:/home/nathan/logs$ ls -lah
total 1.1M
drwxrwxr-x 4 nathan nathan 4.0K Apr 16 05:29 .
drwxr-xr-x 5 nathan nathan 4.0K Apr 16 05:29 ..
lrwxrwxrwx 1 nathan nathan   69 Apr 16 01:58 cockroach.log -> cockroach.nathan-tpcc-geo-0007.nathan.2019-04-16T01_58_52Z.003778.log
-rw-rw-r-- 1 nathan nathan 495K Apr 16 01:58 cockroach.nathan-tpcc-geo-0007.nathan.2019-04-16T01_58_52Z.003778.log
-rw-rw-r-- 1 nathan nathan 495K Apr 16 01:58 cockroach.stderr.log
-rw-rw-r-- 1 nathan nathan    0 Apr 16 00:41 cockroach.stdout.log
drwxr-xr-x 2 nathan nathan 4.0K Apr 16 00:41 goroutine_dump
drwxr-xr-x 2 nathan nathan 4.0K Apr 16 00:41 heap_profiler
-rw-rw-r-- 1 nathan nathan  37K Apr 16 02:02 roachprod.log

I was going to suggest that the diff that was printed was so large that it forced a log file rotation and triggered a bug through that, but r590 is actually basically empty -- in Nathan's RangeStatus it has 12 system keys and that's it, now it has 16. The diff couldn't be large. And even if so, I don't know of any bugs regarding log rotation, especially not ones that would completely wipe out any prior log files.

cockroach/pkg/util/log/clog.go

Lines 1102 to 1110 in d0f758a

    
           // rotateFile closes the syncBuffer's file and starts a new one. 
        
           func (sb *syncBuffer) rotateFile(now time.Time) error { 
        
           	if sb.file != nil { 
        
           		if err := sb.Flush(); err != nil { 
        
           			return err 
        
           		} 
        
           		if err := sb.file.Close(); err != nil { 
        
           			return err 
        
           		}

I even checked Nathan's .bash_history to see if he had moved things around, but of course he didn't. And there also aren't any other users' log dirs that the logs could've gone to (now there's one under my user since I restarted the cluster, but that's not it). All other crdb nodes have two log files, corresponding to the two times the cluster were started (around T00_40_58Z and T00_48_20Z). Only n5 has only this log file dated to the crash.

tbg · 2019-04-16T09:32:54Z

I ran the following for n=2,5,7 (i.e. the node indexes that have the replica)

roachprod ssh nathan-tpcc-geo:$n -- /home/nathan/cockroach debug range-data /mnt/data1/backup 590 --replicated > 590.$n.repl

and it gets, maybe, more interesting. The ones for n=2,5 are identical. The one for n=7 is very different, which makes some sense since it's only applied up to log position 23, which is presumably where it crashed (22 is the consistency computation at GMT Tuesday, April 16, 2019 1:58:45 AM, 23 an empty entry). (The other two diffs are at 28). Interestingly, at log 23 (on that node) there's tons of data. On the other nodes, a range deletion at index 27 (reproposed at 28) at time 1555379948806436077 (=GMT Tuesday, April 16, 2019 1:59:08 AM) nuked the range empty (save for some sys keys).

raft.2.txt
raft.5.txt
raft.7.txt

What I've done so far seems to show that when :7 is allowed to catch up to :2 and :5, it'll "look the same", i.e. the range is actually consistent. Whatever the inconsistency was, the range deletions have nuked it.

I don't know exactly what the range deletions were caused by, but my assumption would be that the node crashing failed the import, and that the import tried to clean up by nuking the keyspace empty. This happened within a minute of the crash, so we must expect inconsistencies like that to be wiped out on the regular (though usually we'll at least get the diff). This in turn indicates that we'll want to disable this behavior in our nightly tests. Or even better, we need to take a cluster-wide RocksDB checkpoint. I know @andreimatei was looking at this; this seems high priority now.

On the other hand, since the resulting stats are consistent, this indicates that whatever the diff would have been likely didn't include the MVCCStats key.

I'm adding a little TODO list in the top post to summarize the action items/things to understand. Unfortunately, however, it doesn't seem like we're ever going to be able to find out what the reported inconsistency actually was in this case.

tbg · 2019-04-16T09:37:26Z

@ajkr this one's shaping up to be another puzzler -- it's similar to that private issue regarding the time series in that the stats don't seem to have diverged.

tbg · 2019-04-16T09:44:44Z

Oh no.. I think I might understand why the log file is missing:

cockroach/pkg/util/log/clog.go

Lines 1318 to 1335 in d0f758a

    
           logFilesCombinedMaxSize := atomic.LoadInt64(&LogFilesCombinedMaxSize) 
        
           files := selectFiles(allFiles, math.MaxInt64) 
        
           if len(files) == 0 { 
        
           	return 
        
           } 
        
           // files is sorted with the newest log files first (which we want 
        
           // to keep). Note that we always keep the most recent log file. 
        
           sum := files[0].SizeBytes 
        
           for _, f := range files[1:] { 
        
           	sum += f.SizeBytes 
        
           	if sum < logFilesCombinedMaxSize { 
        
           		continue 
        
           	} 
        
           	path := filepath.Join(dir, f.Name) 
        
           	if err := os.Remove(path); err != nil { 
        
           		fmt.Fprintln(OrigStderr, err) 
        
           	} 
        
           }

LogFilesCombinedMaxSize defaults to 100mb. If we wrote a huge diff, we'd rotate to a new file (we do this after 10mb), and then ask the gc daemon to run. The daemon preserves the latest file, but it will delete the previous file because it bumps us over the 100mb threshold (assuming it is itself way over 100mb).

I'd say it's plausible then to assume that we wrote a large diff and then shot ourselves in the foot by GC'ing it right away.

We've repeatedly wanted this to preserve state when finding replica inconsistencies. See, for example: cockroachdb#36861 Release note: None

andreimatei · 2019-04-17T02:40:54Z

A few runs, nothing so far. More tomorrow.
I've rebased my roachtest branch and added a geo-distributed import test here https://github.com/andreimatei/cockroach/tree/repro.geo-import-failure

tbg · 2019-04-17T10:19:41Z

Please pick up #36867 as well, or we may get a repro that isn't actionable, like we did before.

We've repeatedly wanted this to preserve state when finding replica inconsistencies. See, for example: cockroachdb#36861 Release note: None

tbg · 2019-04-17T20:06:33Z

@nvanbenschoten or @andreimatei could either of you copy the /mnt/data1/backup directories on the roach nodes to S3? I don't know if we'll need them but there's no good reason for letting this cluster run forever. (I'd do it, but I'm out for the night and didn't think of it earlier)

nvanbenschoten · 2019-04-17T20:15:07Z

I already killed the cluster 😬

tbg · 2019-04-17T20:16:26Z

You sure?

$ roachprod ssh nathan-tpcc-geo:7
Warning: Permanently added '34.85.67.36' (ECDSA) to the list of known hosts.
Welcome to Ubuntu 16.04.5 LTS (GNU/Linux 4.15.0-1026-gcp x86_64)

 * Documentation:  https://help.ubuntu.com
 * Management:     https://landscape.canonical.com
 * Support:        https://ubuntu.com/advantage

  Get cloud support with Ubuntu Advantage Cloud Guest:
    http://www.ubuntu.com/business/services/cloud

1 package can be updated.
1 update is a security update.

New release '18.04.2 LTS' available.
Run 'do-release-upgrade' to upgrade to it.


Last login: Wed Apr 17 20:05:59 2019 from 188.97.85.189

tbg · 2019-04-17T20:16:56Z

Oh, I guess it got recreated! Nvm then. Probably wasn't ever going to look at it again anyway.

nvanbenschoten · 2019-04-17T22:20:48Z

Luckily I just hit this again on a new version of nathan-tpcc-geo. Like last time the raft logs look identical and like last time the logs rolled over and deleted the diff. I'm looking to see if there's anything else I can pull out of the cluster.

I wasn't trying to reproduce this, which is why I didn't have any debug info enabled. All future attempts will be run with #36867 and an increased LogFileMaxSize.

nvanbenschoten · 2019-04-17T22:24:05Z

The only similarity I see is that like in the first case, this consistency check failure was hit almost immediately after a split. In this case, we see the following entries:

Term:6 Index:11 : EMPTY
Term:6 Index:12  Put: /Local/Range/Table/59/1/114/43738/QueueLastProcessed/"consistencyChecker"
Term:6 Index:13 compute_checksum:<checksum_id:941c5b0f-7846-4546-8cf2-b1a132cdbf71 >
Term:6 Index:14 compute_checksum:<checksum_id:4cc632b2-4210-4eb0-965c-b92735e47fc1 save_snapshot:true >

Notice that term 6, index 11 is the first raft log term/position combo allowed. That indicates that this range had just split off from its parent range.

andreimatei · 2019-04-17T23:22:10Z

I got a bunch of repros (6 before I stopped the charade). I was running 8 node clusters across 4 GCP zones, 4-cpu machines.
I didn't tweak the logging limits, but I believe we have the whole logs for at least 2 of them. We also have RocksDB checkpoints from when the diffs were detected.
Briefly looking at two of them:

cluster andrei-1555535390-11-n8cpu4-geo (GCE_PROJECT=andrei-jepsen):
Here the diff seems to only contain the stats (part of the RangeAppliedState). Notice that the "expected" counts are negative:

E190417 22:25:14.161556 86032 storage/replica_consistency.go:134  [n4,consistencyChecker,s4,r871/3:/Table/57/1/76{2/9/1…-4/9/61}] replica (n1,s1):2 is inconsistent: expected checksum 6f698c554509b49a1ad50666b095f4bcc1e710d0808c1fe20f66e0e09bf9a4b7c9dc94376d5cdfd7d26a3ef954731b6f4e61d7ea776a5fc6549e03b90a65226d, got 3603fbbddd95bf94b1d6646735ccdb3392a60e388d0455f24f9666b0522f81ea810251cb1d5d866521832bd3a5dabd554e03da4d81da4a17693006fbfd39d446
persisted stats: exp {ContainsEstimates:true LastUpdateNanos:1555539780955426343 IntentAge:0 GCBytesAge:0 LiveBytes:-364 LiveCount:-23 KeyBytes:-285 KeyCount:-23 ValBytes:-79 ValCount:-23 IntentBytes:0 IntentCount:0 SysBytes:565 SysCount:5 XXX_NoUnkeyedLiteral:{} XXX_sizecache:0}, got {ContainsEstimates:true LastUpdateNanos:1555539780955426343 IntentAge:0 GCBytesAge:0 LiveBytes:34316122 LiveCount:58552 KeyBytes:1339660 KeyCount:58552 ValBytes:32976462 ValCount:58552 IntentBytes:0 IntentCount:0 SysBytes:565 SysCount:5 XXX_NoUnkeyedLiteral:{} XXX_sizecache:0}
--- leaseholder
+++ follower
-0.000000000,0 /Local/RangeID/871/r/RangeAppliedState
-    ts:<zero>
-    value:"\x12\x04\b\x00\x10\x00\x18\x00 \x00(\x002]\xdbl\xfeY\x03\b8\x10!\x1aR\t'N\xd6m0c\x96\x15 \x94\xfd\xff\xff\xff\xff\xff\xff\xff\x01(\xe9\xff\xff\xff\xff\xff\xff\xff\xff\x010\xe3\xfd\xff\xff\xff\xff\xff\xff\xff\x018\xe9\xff\xff\xff\xff\xff\xff\xff\xff\x01@\xb1\xff\xff\xff\xff\xff\xff\xff\xff\x01H\xe9\xff\xff\xff\xff\xff\xff\xff\xff\x01`\xb5\x04h\x05p\x01"
-    raw mvcc_key/value: 0169f70367727261736b00 120408001000180020002800325ddb6cfe5903083810211a5209274ed66d306396152094fdffffffffffffff0128e9ffffffffffffffff0130e3fdffffffffffffff0138e9ffffffffffffffff0140b1ffffffffffffffff0148e9ffffffffffffffff0160b50468057001
+0.000000000,0 /Local/RangeID/871/r/RangeAppliedState
+    ts:<zero>
+    value:"\x12\x04\b\x00\x10\x00\x18\x00 \x00(\x0025\x96\x1d\x85\xb5\x03\b8\x10!\x1a*\t'N\xd6m0c\x96\x15 ھ\xae\x10(\xb8\xc9\x030\x8c\xe2Q8\xb8\xc9\x03@\xce\xdc\xdc\x0fH\xb8\xc9\x03`\xb5\x04h\x05p\x01"
+    raw mvcc_key/value: 0169f70367727261736b00 1204080010001800200028003235961d85b503083810211a2a09274ed66d3063961520dabeae1028b8c903308ce25138b8c90340cedcdc0f48b8c90360b50468057001

cluster andrei-1555535390-13-n8cpu4-geo (GCE_PROJECT=andrei-jepsen):
Here we have a follower that seems to be completely empty. The leaseholder has a ton of data, present in the diff.

E190417 22:27:35.539789 77285 storage/replica_consistency.go:134  [n6,consistencyChecker,s6,r552/2:/Table/54/1/65{1/6/3…-3/5/1…}] replica (n1,s1):4 is inconsistent: expected checksum 5fe2eaef32849d0569eb74db44cd78a7d48ee7c85128bef37507f47a448b1facc67eb11e03b8d822bf9738732a10fdfd288e200219aca7534cdce5896ba87c83, got 39e9bc6e80dd52f7ce5350dfe2f2ce6bc9a79fd54bf956d5bdf376686a54cc57d33bfdbf42473b3f7d6f8865a7437eae30690d83e907e4427a40f5676b267574
persisted stats: exp {ContainsEstimates:true LastUpdateNanos:1555539836245201052 IntentAge:0 GCBytesAge:0 LiveBytes:34315972 LiveCount:58548 KeyBytes:1339688 KeyCount:58548 ValBytes:32976284 ValCount:58548 IntentBytes:0 IntentCount:0 SysBytes:878 SysCount:7 XXX_NoUnkeyedLiteral:{} XXX_sizecache:0}, got {ContainsEstimates:false LastUpdateNanos:1555539836245201052 IntentAge:0 GCBytesAge:0 LiveBytes:0 LiveCount:0 KeyBytes:0 KeyCount:0 ValBytes:0 ValCount:0 IntentBytes:0 IntentCount:0 SysBytes:878 SysCount:7 XXX_NoUnkeyedLiteral:{} XXX_sizecache:0}
--- leaseholder
+++ follower
-0.000000000,0 /Local/RangeID/552/r/RangeAppliedState
-    ts:<zero>
-    value:"\x12\x04\b\x00\x10\x00\x18\x00 \x00(\x0025\xffI4{\x03\b7\x10\x1a\x1a*\t\x9cP]M=c\x96\x15 Ľ\xae\x10(\xb4\xc9\x030\xa8\xe2Q8\xb4\xc9\x03@\x9c\xdb\xdc\x0fH\xb4\xc9\x03`\xee\x06h\ap\x01"
-    raw mvcc_key/value: 0169f70228727261736b00 1204080010001800200028003235ff49347b030837101a1a2a099c505d4d3d63961520c4bdae1028b4c90330a8e25138b4c903409cdbdc0f48b4c90360ee0668077001
+0.000000000,0 /Local/RangeID/552/r/RangeAppliedState
+    ts:<zero>
+    value:"\x12\x04\b\x00\x10\x00\x18\x00 \x00(\x002\x19m\xb2\xfa\xee\x03\b7\x10\x1a\x1a\x0e\t\x9cP]M=c\x96\x15`\xee\x06h\a"
+    raw mvcc_key/value: 0169f70228727261736b00 12040800100018002000280032196db2faee030837101a1a0e099c505d4d3d63961560ee066807
-1555535595.048202377,0 /Table/54/1/651/6/378/0
-    ts:2019-04-17 21:13:15.048202377 +0000 UTC
-    value:"մ\xcd\x1a\nF\x10Y7pwAnR0TQoU5nH8\x16\x02OE\x16\rPRICALLYCALLY\x16\x11K4CqqLGzrRCg74JuA\x16\x11coaW1dT9RiedgnRGN\x16\x11XSpIXhOAuYmfJebIz\x16\x02EB\x16\t835211111\x16\x100013839382606905\x18ʃʻ\b\x00\x16\x02GC\x15\x054\x8dLK@\x15\x03)\tw\x15\x04\x1a\x8a\x03\xe8\x15\x044\x8a\x03\xe8\x13\x02\x13\x00\x16\x83\aKo3yn9fyLwx0AL4bEuemxBaljZZtx8sWMIEe2HXikqs9Tf9CH6zL7q3FIrk6O9fnif3TtXL8OjaLiEWMAtzFcoAGRqc0GiBIEa1WrbG03f0GiODMNmX9Do7HP3hH0gcB3Mn6QL0gbFk9AFQDXTA2oourvZJ8wzF5ymvwroLaaXWRIo6so4bZuKPBNenKqJKabQkEQQHm9vR8LzpkZjkgEdsthSTfUHkrQnXGxEPK7Xg7w4fhqzgs0bV7hTo5XHdhpdIxMxt9aCNkMkhAqv0M3k7GLIhcpCT6wI2PRuHeCViWnBFQdSOF4hNFnTQL98qtTQBXd28IIpZPbYyuPPzkrdvTBkLx1otV0XtuxW4Eqg4yFBj8dJIQALmjb9sDVA7bD2TDqDr"
...

More looking tomorrow.

tbg · 2019-04-18T06:38:10Z

andrei-1555535390-11-n8cpu4-geo

(n1,s1):2:

raft_applied_index:56 lease_applied_index:33 range_stats:<contains_estimates:true last_update_nanos:1555539780955426343 live_bytes:34316122 live_count:58552 key_bytes:1339660 key_count:58552 val_bytes:32976462 val_count:58552 sys_bytes:565 sys_count:5 >

(n4,s4):3:

raft_applied_index:56 lease_applied_index:33 range_stats:<contains_estimates:true last_update_nanos:1555539780955426343 live_bytes:-364 live_count:-23 key_bytes:-285 key_count:-23 val_bytes:-79 val_count:-23 sys_bytes:565 sys_count:5 >

ContainsEstimates=true is due to #36525, i.e. it doesn't tell us anything except that SSTs were ingested by this range at some point. Looking at the follower's stats this makes sense anyway, so it doesn't tell us much.

The checkpoints worked like a charm. On n4 (the leaseholder with the weird stats) the Raft log reveals... nothing. It's recently been truncated, so all there is is the queue activity.

So let's look at what's actually in r871 on n4:

$ ./cockroach debug check-store /mnt/data1/cockroach/auxiliary/checkpoints/2019-04-17T22\:25\:10Z/
...
(ignored) r871:/Table/57/1/76{2/9/1509-4/9/61} [(n6,s6):4, (n1,s1):2, (n4,s4):3, next=5, gen=110]: diff(actual, claimed): LiveBytes: 34316122 != -364
LiveCount: 58552 != -23
KeyBytes: 1339660 != -285
KeyCount: 58552 != -23
ValBytes: 32976462 != -79
ValCount: 58552 != -23

See something? The data actually in the replica has actual stats matching that of the follower. The leaseholder persistently has the stats messed up, so this looks like a true divergence.

Here are the logs for r871 (before the inconsistency):

https://gist.github.com/tbg/779bd8813ea8dec2d45579ca9d3e8c2d

There's nothing related to n1 and n4, but a little while before the crash we remove the replica on n5 which results in this message:

removed -58570 (-58575+5) keys in 1ms [clear=0ms commit=1ms]

This suggests that n5 was messed up similar to n1.

tbg · 2019-04-18T07:20:18Z

andrei-1555535390-13-n8cpu4-geo

On n6 (leaseholder) the stats agree with recomputation, so there's tons of data. The Raft log is basically only log truncations and the consistency checks again. On n1 (follower), the range stats key is

0.000000000,0 /Local/RangeID/552/r/RangeAppliedState (0x0169f70228727261736b00): raft_applied_index:55 lease_applied_index:26 range_stats:<last_update_nanos:1555539836245201052 sys_bytes:878 sys_count:7 >

i.e. very very empty. And guess what? It also agrees with recomputation (here's the dump of the data in it).

tbg · 2019-04-18T07:44:35Z

This looks like there are two different failure modes already.

In the first case, the leaseholder (and probably another follower, since removed) had the same data, but diverging stats. Tentatively speaking, it seems as if the leaseholder had at least forgotten to add the stats for the data it ingested at some point, though the fact that its stats are negative indicates that more might be going on.

In the second case, both replicas are internally consistent (i.e. stats match data) but one replica is missing tons of stuff. This looks more classically like what you'd expect if a replica "skipped" a raft command erroneously, similar to #35424.

It's frustrating to not have the Raft logs, I think we'd really need them to tell us what's up. @andreimatei can you disable the raft log queue and see if you can get a repro that way?

tbg · 2019-04-18T07:50:43Z

BTW, in Nathan's original issue, it seems like yet another failure mode: the stats were the same, but one node was missing data (and this was then swept under the rug by the range deletions).

tbg · 2019-04-18T08:07:12Z

for another addendum, there's no clear argument that the inconsistencies immediately followed splits.

In Andrei's first repo, the split happens a minute before the inconsistency is detected. Shortly after the split we see the (then) leaseholder n5 send a snapshot to n6. That snapshot is basically empty. n5 is later removed and results in the inconsistent log line claiming to have removed a large negative number of keys. This number is just the (in-mem) stats of the range, so it got really messed up. This points at a split, as nothing else would've generated a negative delta.

@andreimatei let's try flipping this assertion on

cockroach/pkg/storage/replica_raft.go

Line 2021 in 1cd61f9

const expensiveSplitAssertion = false

In the second repro, I can't find anything suspicious-looking; there are empty-looking snapshots right after the split so if those are to be trusted, the range starts out empty. It's worth pointing out that since ContainsEstimates=true in all the repros I've seen so far, SSTs have been ingested by all of the involved ranges directly (the RHS of a split starts out with no estimates).

Need more repros! But this problem is way worse than I anticipated, but since it repros so easily I doubt that we'll have trouble tracking it down given enough time. BTW @andreimatei can you always share the SHAs that you're running? Right now I don't even know if it's master or release-19.1.

tbg · 2019-04-18T08:14:31Z

I found Andrei's branch, so I'm gonna start a repro cycle of my own.

tbg · 2019-04-18T08:49:24Z

Up and running, my branches are

crdb binary: https://github.com/tbg/cockroach/tree/repro/import-geo-36861
roachtest: https://github.com/tbg/cockroach/tree/repro.geo-import-failure (I made the test use a monitor and stop nodes when it fails)

and the command

./bin/roachtestandrei run import-geo --debug --cpu-quota=1000 --roachprod=bin/roachprod --count=100 --parallelism=10 --port=8080

nvanbenschoten · 2019-04-18T19:18:08Z

I'm running another iteration with the following diff. I expect it to have no issues

No consistency check failure in sight.

Specifically, before #35261 Replica.mu.proposals[cmdID].RaftCommand was immutable after the command was proposed to Raft. Anyone could take a command ID and find its corresponding RaftCommand exactly as it looked when proposed.

Heh, @tbg actually picked up on the need for this invariant back when we introduced this inlining code: 9170399#diff-0f4e61b240de001f1b97a87fc31af043R341.

I'd rather lose the "happy case" that at best applies to only one of the replicas.

Agreed. The "happy case" doesn't seem like it's buying us all that much since it only applies on the leaseholder.

…ading SSTs Fixes cockroachdb#36861. This optimization relied on the fact that `RaftCommands` in `Replica.mu.proposals` were immutable over the lifetime of a Raft proposal. This invariant was violated by cockroachdb#35261, which allowed a lease index error to trigger an immediate reproposal. This reproposal mutated the corresponding `RaftCommand` in `Replica.mu.proposals`. Combined with aliasing between multiple Raft proposals due to reproposals due to ticks, this resulted in cases where a leaseholder's Raft logs could diverge from its followers and cause Raft groups to become inconsistent. Release note: None

tbg · 2019-04-18T20:24:40Z

I'd also rather fix the bug.

…ading SSTs Fixes cockroachdb#36861. This optimization relied on the fact that `RaftCommands` in `Replica.mu.proposals` were immutable over the lifetime of a Raft proposal. This invariant was violated by cockroachdb#35261, which allowed a lease index error to trigger an immediate reproposal. This reproposal mutated the corresponding `RaftCommand` in `Replica.mu.proposals`. Combined with aliasing between multiple Raft proposals due to reproposals due to ticks, this resulted in cases where a leaseholder's Raft logs could diverge from its followers and cause Raft groups to become inconsistent. Release note: None

36939: storage: remove optimization to use in-memory RaftCommand when sideloading SSTs r=nvanbenschoten a=nvanbenschoten Fixes #36861. This optimization relied on the fact that `RaftCommands` in `Replica.mu.proposals` were immutable over the lifetime of a Raft proposal. This invariant was violated by #35261, which allowed a lease index error to trigger an immediate reproposal. This reproposal mutated the corresponding `RaftCommand` in `Replica.mu.proposals`. Combined with aliasing between multiple Raft proposals due to reproposals from ticks, this resulted in cases where a leaseholder's Raft logs could diverge from its followers and cause Raft groups to become inconsistent. We can justify the new cost of unmarshalling `RaftCommands` in `maybeInlineSideloadedRaftCommand` on leaseholders because https://github.com/cockroachdb/cockroach/pull/36670/files#diff-8a33b5571aa9ab154d4f3c2a16724697R230 just landed. That change removes an equally sized allocation and memory copy from the function on both leaseholders and followers. Release note: None Co-authored-by: Nathan VanBenschoten <[email protected]>

…ading SSTs Fixes cockroachdb#36861. This optimization relied on the fact that `RaftCommands` in `Replica.mu.proposals` were immutable over the lifetime of a Raft proposal. This invariant was violated by cockroachdb#35261, which allowed a lease index error to trigger an immediate reproposal. This reproposal mutated the corresponding `RaftCommand` in `Replica.mu.proposals`. Combined with aliasing between multiple Raft proposals due to reproposals due to ticks, this resulted in cases where a leaseholder's Raft logs could diverge from its followers and cause Raft groups to become inconsistent. Release note: None

tbg · 2019-04-19T05:37:38Z

Reopening until we've added the geo import test to our suite. What cadence should we be running that? It seems to tickle new behavior. From what I saw during the repros, the geo distribution causes costly reproposals.I assume we'll get value out of observing this test.

This should avoid losing the diff printed to the logs, as observed in: cockroachdb#36861 We don't necessarily care about the large diffs (though they can't hurt to have around) but about whatever else is in the deleted logs. Without logs, debugging is typically much more difficult. Release note: None

36580: roachprod: rationalize refreshing of DNS entries r=andreimatei a=andreimatei Before this patch, the DNS entries were refreshed even if we hadn't properly listed the AWS clusters, which would cause their DNS entries to be removed inadvertently. This patch lifts the decision about whether to refresh or not out of the GCE code and into the top level. Release note: None 36950: storage: disable log file size limit before consistency fatal r=andreimatei a=tbg This should avoid losing the diff printed to the logs, as observed in: #36861 We don't necessarily care about the large diffs (though they can't hurt to have around) but about whatever else is in the deleted logs. Without logs, debugging is typically much more difficult. Release note: None Co-authored-by: Andrei Matei <[email protected]> Co-authored-by: Tobias Schottdorf <[email protected]>

Closes cockroachdb#36861. This test runs a geo-distributed import in the same configuration as the one we saw cause issues with cockroachdb#36861. I'm testing now to see if this will actually trigger the consistency failure without the fix from cockroachdb#36939. Release note: None

We've repeatedly wanted this to preserve state when finding replica inconsistencies. See, for example: cockroachdb#36861 Release note: None

This should avoid losing the diff printed to the logs, as observed in: cockroachdb#36861 We don't necessarily care about the large diffs (though they can't hurt to have around) but about whatever else is in the deleted logs. Without logs, debugging is typically much more difficult. Release note: None

36997: roachtest: create new import/tpcc/warehouses=4000/geo test r=nvanbenschoten a=nvanbenschoten Closes #36861. This test runs a geo-distributed import in the same configuration as the one we saw cause issues with #36861. I'm testing now to see if this will actually trigger the consistency failure without the fix from #36939. Release note: None Co-authored-by: Nathan VanBenschoten <[email protected]>

Closes cockroachdb#36861. This test runs a geo-distributed import in the same configuration as the one we saw cause issues with cockroachdb#36861. I'm testing now to see if this will actually trigger the consistency failure without the fix from cockroachdb#36939. Release note: None

We've repeatedly wanted this to preserve state when finding replica inconsistencies. See, for example: cockroachdb#36861 Release note: None

nvanbenschoten added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. S-1 High impact: many users impacted, serious risk of high unavailability or data loss labels Apr 16, 2019

tbg added a commit to tbg/cockroach that referenced this issue Apr 16, 2019

engine: allow taking RocksDB CheckPoint

2b89605

We've repeatedly wanted this to preserve state when finding replica inconsistencies. See, for example: cockroachdb#36861 Release note: None

andreimatei mentioned this issue Apr 16, 2019

19.1 release blockers list #35554

Closed

23 tasks

andreimatei assigned andreimatei and tbg Apr 17, 2019

tbg added a commit to tbg/cockroach that referenced this issue Apr 17, 2019

engine: allow taking RocksDB CheckPoint

c1d8a2e

We've repeatedly wanted this to preserve state when finding replica inconsistencies. See, for example: cockroachdb#36861 Release note: None

petermattis mentioned this issue Apr 18, 2019

sql: fatal storage/replica_corruption when running a schema change #36937

Closed

nvanbenschoten mentioned this issue Apr 18, 2019

storage: remove optimization to use in-memory RaftCommand when sideloading SSTs #36939

Merged

craig bot closed this as completed in #36939 Apr 19, 2019

nvanbenschoten mentioned this issue Apr 19, 2019

release-19.1: storage: remove optimization to use in-memory RaftCommand when sideloading SSTs #36945

Merged

tbg reopened this Apr 19, 2019

tbg mentioned this issue Apr 19, 2019

storage: disable log file size limit before consistency fatal #36950

Merged

nvanbenschoten mentioned this issue Apr 22, 2019

roachtest: create new import/tpcc/warehouses=4000/geo test #36997

Merged

tbg added a commit to tbg/cockroach that referenced this issue Apr 23, 2019

engine: allow taking RocksDB CheckPoint

dec6348

We've repeatedly wanted this to preserve state when finding replica inconsistencies. See, for example: cockroachdb#36861 Release note: None

tbg mentioned this issue Apr 25, 2019

storage: mvccstats inconsistencies during tpcc #28946

Closed

craig bot closed this as completed in #36997 Apr 29, 2019

nvanbenschoten mentioned this issue Oct 24, 2019

sentry: replica_raft.go:1868: checksum for AddSSTable at index term %d, index %d does not match; at proposal time %x (%d), now %x (%d) | uint64; uint64; uint32; uint32; uint32; uint32 #41142

Closed

irfansharif pushed a commit to irfansharif/cockroach that referenced this issue Oct 30, 2019

engine: allow taking RocksDB CheckPoint

641ceaf

We've repeatedly wanted this to preserve state when finding replica inconsistencies. See, for example: cockroachdb#36861 Release note: None

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: consistency check failure during import #36861

storage: consistency check failure during import #36861

nvanbenschoten commented Apr 16, 2019 •

edited

Loading

tbg commented Apr 16, 2019

tbg commented Apr 16, 2019

tbg commented Apr 16, 2019

tbg commented Apr 16, 2019

tbg commented Apr 16, 2019

andreimatei commented Apr 17, 2019

tbg commented Apr 17, 2019

tbg commented Apr 17, 2019

nvanbenschoten commented Apr 17, 2019

tbg commented Apr 17, 2019

tbg commented Apr 17, 2019

nvanbenschoten commented Apr 17, 2019

nvanbenschoten commented Apr 17, 2019

andreimatei commented Apr 17, 2019

tbg commented Apr 18, 2019 •

edited

Loading

tbg commented Apr 18, 2019

tbg commented Apr 18, 2019

tbg commented Apr 18, 2019

tbg commented Apr 18, 2019

tbg commented Apr 18, 2019

tbg commented Apr 18, 2019 •

edited

Loading

nvanbenschoten commented Apr 18, 2019

tbg commented Apr 18, 2019

tbg commented Apr 19, 2019 •

edited

Loading

storage: consistency check failure during import #36861

storage: consistency check failure during import #36861

Comments

nvanbenschoten commented Apr 16, 2019 • edited Loading

To do / understand

Notes:

Initial Investigation

tbg commented Apr 16, 2019

tbg commented Apr 16, 2019

tbg commented Apr 16, 2019

tbg commented Apr 16, 2019

tbg commented Apr 16, 2019

andreimatei commented Apr 17, 2019

tbg commented Apr 17, 2019

tbg commented Apr 17, 2019

nvanbenschoten commented Apr 17, 2019

tbg commented Apr 17, 2019

tbg commented Apr 17, 2019

nvanbenschoten commented Apr 17, 2019

nvanbenschoten commented Apr 17, 2019

andreimatei commented Apr 17, 2019

tbg commented Apr 18, 2019 • edited Loading

andrei-1555535390-11-n8cpu4-geo

tbg commented Apr 18, 2019

andrei-1555535390-13-n8cpu4-geo

tbg commented Apr 18, 2019

tbg commented Apr 18, 2019

tbg commented Apr 18, 2019

tbg commented Apr 18, 2019

tbg commented Apr 18, 2019 • edited Loading

nvanbenschoten commented Apr 18, 2019

tbg commented Apr 18, 2019

tbg commented Apr 19, 2019 • edited Loading

nvanbenschoten commented Apr 16, 2019 •

edited

Loading

tbg commented Apr 18, 2019 •

edited

Loading

tbg commented Apr 18, 2019 •

edited

Loading

tbg commented Apr 19, 2019 •

edited

Loading