Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: consistency check failure during import #36861

Closed
3 tasks done
nvanbenschoten opened this issue Apr 16, 2019 · 46 comments · Fixed by #36939 or #36997
Closed
3 tasks done

storage: consistency check failure during import #36861

nvanbenschoten opened this issue Apr 16, 2019 · 46 comments · Fixed by #36939 or #36997
Assignees
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. S-1 High impact: many users impacted, serious risk of high unavailability or data loss

Comments

@nvanbenschoten
Copy link
Member

nvanbenschoten commented Apr 16, 2019

To do / understand


This looks very similar to #35424, so it's possible that that issue wasn't fully resolved. I was most of the way through a TPC-C 4k import when a node died due to a consistency check failure.

F190416 01:58:51.634989 172922 storage/replica_consistency.go:220  [n5,consistencyChecker,s5,r590/1:/Table/68/1/{29/4/2…-31/4/1…}] consistency check failed with 1 inconsistent replicas
goroutine 172922 [running]:
github.com/cockroachdb/cockroach/pkg/util/log.getStacks(0xc000056301, 0xc000056300, 0x5449800, 0x1e)
        /go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:1020 +0xd4
github.com/cockroachdb/cockroach/pkg/util/log.(*loggingT).outputLogEntry(0x5bdd700, 0xc000000004, 0x5449860, 0x1e, 0xdc, 0xc008bfa5a0, 0x79)
        /go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:878 +0x93d
github.com/cockroachdb/cockroach/pkg/util/log.addStructured(0x3aa1620, 0xc0071899e0, 0x4, 0x2, 0x33b2862, 0x36, 0xc01f06cce0, 0x1, 0x1)
        /go/src/github.com/cockroachdb/cockroach/pkg/util/log/structured.go:85 +0x2d8
github.com/cockroachdb/cockroach/pkg/util/log.logDepth(0x3aa1620, 0xc0071899e0, 0x1, 0xc000000004, 0x33b2862, 0x36, 0xc01f06cce0, 0x1, 0x1)
        /go/src/github.com/cockroachdb/cockroach/pkg/util/log/log.go:71 +0x8c
github.com/cockroachdb/cockroach/pkg/util/log.Fatalf(0x3aa1620, 0xc0071899e0, 0x33b2862, 0x36, 0xc01f06cce0, 0x1, 0x1)
        /go/src/github.com/cockroachdb/cockroach/pkg/util/log/log.go:182 +0x7e
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).CheckConsistency(0xc023eeb400, 0x3aa1620, 0xc0071899e0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
        /go/src/github.com/cockroachdb/cockroach/pkg/storage/replica_consistency.go:220 +0x6ce
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).CheckConsistency(0xc023eeb400, 0x3aa1620, 0xc0071899e0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
        /go/src/github.com/cockroachdb/cockroach/pkg/storage/replica_consistency.go:229 +0x81b
github.com/cockroachdb/cockroach/pkg/storage.(*consistencyQueue).process(0xc0003de2a0, 0x3aa1620, 0xc0071899e0, 0xc023eeb400, 0x0, 0x0, 0x0)
        /go/src/github.com/cockroachdb/cockroach/pkg/storage/consistency_queue.go:125 +0x210

Cockroach SHA: 3ebed10

Notes:

Cluster: nathan-tpcc-geo (stopped, extended for 48h)
Cockroach nodes: 1,2,4,5,7,8,10,11
Inconsistent range: r590
Replicas: nathan-tpcc-geo:2/n2/r3, nathan-tpcc-geo:5/n4/r4, and nathan-tpcc-geo:7/n5/r1
Inconsistent replica: nathan-tpcc-geo:7/n5/r1
Replicas in zones: europe-west2-b, europe-west4-b, and asia-northeast1-b respectively

Initial Investigation

Unlike in the later reproductions of #35424, replica 1's Raft log is an exact prefix of replica 3 and 4's, so this doesn't look like the same issue we saw later in that issue.

I haven't looked at much else yet.

r590 Range _ Debug _ Cockroach Console.pdf

@nvanbenschoten nvanbenschoten added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. S-1 High impact: many users impacted, serious risk of high unavailability or data loss labels Apr 16, 2019
@tbg
Copy link
Member

tbg commented Apr 16, 2019

We don't have the diff because I messed something up during recent refactors. Running

roachprod ssh nathan-tpcc-geo -- sudo cp -R /mnt/data1/cockroach /mnt/data1/backup

now and then I'll bring the cluster up to recover the diff.

@tbg
Copy link
Member

tbg commented Apr 16, 2019

Currently the range seems stats-consistent and also passed enqueuing through the full consistency checker. The replicas are still n2,n4,n5 (replicaIDs 1 3 4), so nothing has shifted. I will dig some more but this could mean that we were looking at a false positive of sorts, or a temporary inconsistency that resolved itself.

I'm also not so sure that I messed up something regarding the consistency checker diff. What's really weird is that typically, the queue first computes a checksum without the diff. Then, when that has finished and found mismatching SHAs, it computes another round this time with a diff, then prints that, and fatals. In the logs, we're supposed to see the log message from line 226.

// No diff was printed, so we want to re-run with diff.
// Note that this will call Fatal recursively in `CheckConsistency` (in the code above).
log.Errorf(ctx, "consistency check failed with %d inconsistent replicas; fetching details",
inconsistencyCount)
args.WithDiff = true
if _, pErr := r.CheckConsistency(ctx, args); pErr != nil {
logFunc(ctx, "replica inconsistency detected; could not obtain actual diff: %s", pErr)
}

I don't see that in the (combined) logs from the cluster. But the stack trace shows that we are in the recursive call from on line 229.

A clue is that the fatal error is in its own, new, log file. There aren't any past log files. It looks like the logging messed something up; we effectively know nothing about n5 (nathan-tpcc-geo:7) before the fatal; the fatal message is the very first thing we have.

tschottdorf@nathan-tpcc-geo-0007:/home/nathan/logs$ ls -lah
total 1.1M
drwxrwxr-x 4 nathan nathan 4.0K Apr 16 05:29 .
drwxr-xr-x 5 nathan nathan 4.0K Apr 16 05:29 ..
lrwxrwxrwx 1 nathan nathan   69 Apr 16 01:58 cockroach.log -> cockroach.nathan-tpcc-geo-0007.nathan.2019-04-16T01_58_52Z.003778.log
-rw-rw-r-- 1 nathan nathan 495K Apr 16 01:58 cockroach.nathan-tpcc-geo-0007.nathan.2019-04-16T01_58_52Z.003778.log
-rw-rw-r-- 1 nathan nathan 495K Apr 16 01:58 cockroach.stderr.log
-rw-rw-r-- 1 nathan nathan    0 Apr 16 00:41 cockroach.stdout.log
drwxr-xr-x 2 nathan nathan 4.0K Apr 16 00:41 goroutine_dump
drwxr-xr-x 2 nathan nathan 4.0K Apr 16 00:41 heap_profiler
-rw-rw-r-- 1 nathan nathan  37K Apr 16 02:02 roachprod.log

I was going to suggest that the diff that was printed was so large that it forced a log file rotation and triggered a bug through that, but r590 is actually basically empty -- in Nathan's RangeStatus it has 12 system keys and that's it, now it has 16. The diff couldn't be large. And even if so, I don't know of any bugs regarding log rotation, especially not ones that would completely wipe out any prior log files.

cockroach/pkg/util/log/clog.go

Lines 1102 to 1110 in d0f758a

// rotateFile closes the syncBuffer's file and starts a new one.
func (sb *syncBuffer) rotateFile(now time.Time) error {
if sb.file != nil {
if err := sb.Flush(); err != nil {
return err
}
if err := sb.file.Close(); err != nil {
return err
}

I even checked Nathan's .bash_history to see if he had moved things around, but of course he didn't. And there also aren't any other users' log dirs that the logs could've gone to (now there's one under my user since I restarted the cluster, but that's not it). All other crdb nodes have two log files, corresponding to the two times the cluster were started (around T00_40_58Z and T00_48_20Z). Only n5 has only this log file dated to the crash.

@tbg
Copy link
Member

tbg commented Apr 16, 2019

I ran the following for n=2,5,7 (i.e. the node indexes that have the replica)

roachprod ssh nathan-tpcc-geo:$n -- /home/nathan/cockroach debug range-data /mnt/data1/backup 590 --replicated > 590.$n.repl

and it gets, maybe, more interesting. The ones for n=2,5 are identical. The one for n=7 is very different, which makes some sense since it's only applied up to log position 23, which is presumably where it crashed (22 is the consistency computation at GMT Tuesday, April 16, 2019 1:58:45 AM, 23 an empty entry). (The other two diffs are at 28). Interestingly, at log 23 (on that node) there's tons of data. On the other nodes, a range deletion at index 27 (reproposed at 28) at time 1555379948806436077 (=GMT Tuesday, April 16, 2019 1:59:08 AM) nuked the range empty (save for some sys keys).

raft.2.txt
raft.5.txt
raft.7.txt

What I've done so far seems to show that when :7 is allowed to catch up to :2 and :5, it'll "look the same", i.e. the range is actually consistent. Whatever the inconsistency was, the range deletions have nuked it.

I don't know exactly what the range deletions were caused by, but my assumption would be that the node crashing failed the import, and that the import tried to clean up by nuking the keyspace empty. This happened within a minute of the crash, so we must expect inconsistencies like that to be wiped out on the regular (though usually we'll at least get the diff). This in turn indicates that we'll want to disable this behavior in our nightly tests. Or even better, we need to take a cluster-wide RocksDB checkpoint. I know @andreimatei was looking at this; this seems high priority now.

On the other hand, since the resulting stats are consistent, this indicates that whatever the diff would have been likely didn't include the MVCCStats key.

I'm adding a little TODO list in the top post to summarize the action items/things to understand. Unfortunately, however, it doesn't seem like we're ever going to be able to find out what the reported inconsistency actually was in this case.

@tbg
Copy link
Member

tbg commented Apr 16, 2019

@ajkr this one's shaping up to be another puzzler -- it's similar to that private issue regarding the time series in that the stats don't seem to have diverged.

@tbg
Copy link
Member

tbg commented Apr 16, 2019

Oh no.. I think I might understand why the log file is missing:

cockroach/pkg/util/log/clog.go

Lines 1318 to 1335 in d0f758a

logFilesCombinedMaxSize := atomic.LoadInt64(&LogFilesCombinedMaxSize)
files := selectFiles(allFiles, math.MaxInt64)
if len(files) == 0 {
return
}
// files is sorted with the newest log files first (which we want
// to keep). Note that we always keep the most recent log file.
sum := files[0].SizeBytes
for _, f := range files[1:] {
sum += f.SizeBytes
if sum < logFilesCombinedMaxSize {
continue
}
path := filepath.Join(dir, f.Name)
if err := os.Remove(path); err != nil {
fmt.Fprintln(OrigStderr, err)
}
}

LogFilesCombinedMaxSize defaults to 100mb. If we wrote a huge diff, we'd rotate to a new file (we do this after 10mb), and then ask the gc daemon to run. The daemon preserves the latest file, but it will delete the previous file because it bumps us over the 100mb threshold (assuming it is itself way over 100mb).

I'd say it's plausible then to assume that we wrote a large diff and then shot ourselves in the foot by GC'ing it right away.

tbg added a commit to tbg/cockroach that referenced this issue Apr 16, 2019
We've repeatedly wanted this to preserve state when finding replica
inconsistencies.

See, for example:

cockroachdb#36861

Release note: None
@andreimatei
Copy link
Contributor

A few runs, nothing so far. More tomorrow.
I've rebased my roachtest branch and added a geo-distributed import test here https://github.com/andreimatei/cockroach/tree/repro.geo-import-failure

@tbg
Copy link
Member

tbg commented Apr 17, 2019

Please pick up #36867 as well, or we may get a repro that isn't actionable, like we did before.

tbg added a commit to tbg/cockroach that referenced this issue Apr 17, 2019
We've repeatedly wanted this to preserve state when finding replica
inconsistencies.

See, for example:

cockroachdb#36861

Release note: None
@tbg
Copy link
Member

tbg commented Apr 17, 2019

@nvanbenschoten or @andreimatei could either of you copy the /mnt/data1/backup directories on the roach nodes to S3? I don't know if we'll need them but there's no good reason for letting this cluster run forever. (I'd do it, but I'm out for the night and didn't think of it earlier)

@nvanbenschoten
Copy link
Member Author

I already killed the cluster 😬

@tbg
Copy link
Member

tbg commented Apr 17, 2019

You sure?

$ roachprod ssh nathan-tpcc-geo:7
Warning: Permanently added '34.85.67.36' (ECDSA) to the list of known hosts.
Welcome to Ubuntu 16.04.5 LTS (GNU/Linux 4.15.0-1026-gcp x86_64)

 * Documentation:  https://help.ubuntu.com
 * Management:     https://landscape.canonical.com
 * Support:        https://ubuntu.com/advantage

  Get cloud support with Ubuntu Advantage Cloud Guest:
    http://www.ubuntu.com/business/services/cloud

1 package can be updated.
1 update is a security update.

New release '18.04.2 LTS' available.
Run 'do-release-upgrade' to upgrade to it.


Last login: Wed Apr 17 20:05:59 2019 from 188.97.85.189

@tbg
Copy link
Member

tbg commented Apr 17, 2019

Oh, I guess it got recreated! Nvm then. Probably wasn't ever going to look at it again anyway.

@nvanbenschoten
Copy link
Member Author

Luckily I just hit this again on a new version of nathan-tpcc-geo. Like last time the raft logs look identical and like last time the logs rolled over and deleted the diff. I'm looking to see if there's anything else I can pull out of the cluster.

I wasn't trying to reproduce this, which is why I didn't have any debug info enabled. All future attempts will be run with #36867 and an increased LogFileMaxSize.

@nvanbenschoten
Copy link
Member Author

The only similarity I see is that like in the first case, this consistency check failure was hit almost immediately after a split. In this case, we see the following entries:

Term:6 Index:11 : EMPTY
Term:6 Index:12  Put: /Local/Range/Table/59/1/114/43738/QueueLastProcessed/"consistencyChecker"
Term:6 Index:13 compute_checksum:<checksum_id:941c5b0f-7846-4546-8cf2-b1a132cdbf71 >
Term:6 Index:14 compute_checksum:<checksum_id:4cc632b2-4210-4eb0-965c-b92735e47fc1 save_snapshot:true >

Notice that term 6, index 11 is the first raft log term/position combo allowed. That indicates that this range had just split off from its parent range.

@andreimatei
Copy link
Contributor

I got a bunch of repros (6 before I stopped the charade). I was running 8 node clusters across 4 GCP zones, 4-cpu machines.
I didn't tweak the logging limits, but I believe we have the whole logs for at least 2 of them. We also have RocksDB checkpoints from when the diffs were detected.
Briefly looking at two of them:

  1. cluster andrei-1555535390-11-n8cpu4-geo (GCE_PROJECT=andrei-jepsen):
    Here the diff seems to only contain the stats (part of the RangeAppliedState). Notice that the "expected" counts are negative:
E190417 22:25:14.161556 86032 storage/replica_consistency.go:134  [n4,consistencyChecker,s4,r871/3:/Table/57/1/76{2/9/1…-4/9/61}] replica (n1,s1):2 is inconsistent: expected checksum 6f698c554509b49a1ad50666b095f4bcc1e710d0808c1fe20f66e0e09bf9a4b7c9dc94376d5cdfd7d26a3ef954731b6f4e61d7ea776a5fc6549e03b90a65226d, got 3603fbbddd95bf94b1d6646735ccdb3392a60e388d0455f24f9666b0522f81ea810251cb1d5d866521832bd3a5dabd554e03da4d81da4a17693006fbfd39d446
persisted stats: exp {ContainsEstimates:true LastUpdateNanos:1555539780955426343 IntentAge:0 GCBytesAge:0 LiveBytes:-364 LiveCount:-23 KeyBytes:-285 KeyCount:-23 ValBytes:-79 ValCount:-23 IntentBytes:0 IntentCount:0 SysBytes:565 SysCount:5 XXX_NoUnkeyedLiteral:{} XXX_sizecache:0}, got {ContainsEstimates:true LastUpdateNanos:1555539780955426343 IntentAge:0 GCBytesAge:0 LiveBytes:34316122 LiveCount:58552 KeyBytes:1339660 KeyCount:58552 ValBytes:32976462 ValCount:58552 IntentBytes:0 IntentCount:0 SysBytes:565 SysCount:5 XXX_NoUnkeyedLiteral:{} XXX_sizecache:0}
--- leaseholder
+++ follower
-0.000000000,0 /Local/RangeID/871/r/RangeAppliedState
-    ts:<zero>
-    value:"\x12\x04\b\x00\x10\x00\x18\x00 \x00(\x002]\xdbl\xfeY\x03\b8\x10!\x1aR\t'N\xd6m0c\x96\x15 \x94\xfd\xff\xff\xff\xff\xff\xff\xff\x01(\xe9\xff\xff\xff\xff\xff\xff\xff\xff\x010\xe3\xfd\xff\xff\xff\xff\xff\xff\xff\x018\xe9\xff\xff\xff\xff\xff\xff\xff\xff\x01@\xb1\xff\xff\xff\xff\xff\xff\xff\xff\x01H\xe9\xff\xff\xff\xff\xff\xff\xff\xff\x01`\xb5\x04h\x05p\x01"
-    raw mvcc_key/value: 0169f70367727261736b00 120408001000180020002800325ddb6cfe5903083810211a5209274ed66d306396152094fdffffffffffffff0128e9ffffffffffffffff0130e3fdffffffffffffff0138e9ffffffffffffffff0140b1ffffffffffffffff0148e9ffffffffffffffff0160b50468057001
+0.000000000,0 /Local/RangeID/871/r/RangeAppliedState
+    ts:<zero>
+    value:"\x12\x04\b\x00\x10\x00\x18\x00 \x00(\x0025\x96\x1d\x85\xb5\x03\b8\x10!\x1a*\t'N\xd6m0c\x96\x15 ھ\xae\x10(\xb8\xc9\x030\x8c\xe2Q8\xb8\xc9\x03@\xce\xdc\xdc\x0fH\xb8\xc9\x03`\xb5\x04h\x05p\x01"
+    raw mvcc_key/value: 0169f70367727261736b00 1204080010001800200028003235961d85b503083810211a2a09274ed66d3063961520dabeae1028b8c903308ce25138b8c90340cedcdc0f48b8c90360b50468057001
  1. cluster andrei-1555535390-13-n8cpu4-geo (GCE_PROJECT=andrei-jepsen):
    Here we have a follower that seems to be completely empty. The leaseholder has a ton of data, present in the diff.
E190417 22:27:35.539789 77285 storage/replica_consistency.go:134  [n6,consistencyChecker,s6,r552/2:/Table/54/1/65{1/6/3…-3/5/1…}] replica (n1,s1):4 is inconsistent: expected checksum 5fe2eaef32849d0569eb74db44cd78a7d48ee7c85128bef37507f47a448b1facc67eb11e03b8d822bf9738732a10fdfd288e200219aca7534cdce5896ba87c83, got 39e9bc6e80dd52f7ce5350dfe2f2ce6bc9a79fd54bf956d5bdf376686a54cc57d33bfdbf42473b3f7d6f8865a7437eae30690d83e907e4427a40f5676b267574
persisted stats: exp {ContainsEstimates:true LastUpdateNanos:1555539836245201052 IntentAge:0 GCBytesAge:0 LiveBytes:34315972 LiveCount:58548 KeyBytes:1339688 KeyCount:58548 ValBytes:32976284 ValCount:58548 IntentBytes:0 IntentCount:0 SysBytes:878 SysCount:7 XXX_NoUnkeyedLiteral:{} XXX_sizecache:0}, got {ContainsEstimates:false LastUpdateNanos:1555539836245201052 IntentAge:0 GCBytesAge:0 LiveBytes:0 LiveCount:0 KeyBytes:0 KeyCount:0 ValBytes:0 ValCount:0 IntentBytes:0 IntentCount:0 SysBytes:878 SysCount:7 XXX_NoUnkeyedLiteral:{} XXX_sizecache:0}
--- leaseholder
+++ follower
-0.000000000,0 /Local/RangeID/552/r/RangeAppliedState
-    ts:<zero>
-    value:"\x12\x04\b\x00\x10\x00\x18\x00 \x00(\x0025\xffI4{\x03\b7\x10\x1a\x1a*\t\x9cP]M=c\x96\x15 Ľ\xae\x10(\xb4\xc9\x030\xa8\xe2Q8\xb4\xc9\x03@\x9c\xdb\xdc\x0fH\xb4\xc9\x03`\xee\x06h\ap\x01"
-    raw mvcc_key/value: 0169f70228727261736b00 1204080010001800200028003235ff49347b030837101a1a2a099c505d4d3d63961520c4bdae1028b4c90330a8e25138b4c903409cdbdc0f48b4c90360ee0668077001
+0.000000000,0 /Local/RangeID/552/r/RangeAppliedState
+    ts:<zero>
+    value:"\x12\x04\b\x00\x10\x00\x18\x00 \x00(\x002\x19m\xb2\xfa\xee\x03\b7\x10\x1a\x1a\x0e\t\x9cP]M=c\x96\x15`\xee\x06h\a"
+    raw mvcc_key/value: 0169f70228727261736b00 12040800100018002000280032196db2faee030837101a1a0e099c505d4d3d63961560ee066807
-1555535595.048202377,0 /Table/54/1/651/6/378/0
-    ts:2019-04-17 21:13:15.048202377 +0000 UTC
-    value:"մ\xcd\x1a\nF\x10Y7pwAnR0TQoU5nH8\x16\x02OE\x16\rPRICALLYCALLY\x16\x11K4CqqLGzrRCg74JuA\x16\x11coaW1dT9RiedgnRGN\x16\x11XSpIXhOAuYmfJebIz\x16\x02EB\x16\t835211111\x16\x100013839382606905\x18ʃʻ\b\x00\x16\x02GC\x15\x054\x8dLK@\x15\x03)\tw\x15\x04\x1a\x8a\x03\xe8\x15\x044\x8a\x03\xe8\x13\x02\x13\x00\x16\x83\aKo3yn9fyLwx0AL4bEuemxBaljZZtx8sWMIEe2HXikqs9Tf9CH6zL7q3FIrk6O9fnif3TtXL8OjaLiEWMAtzFcoAGRqc0GiBIEa1WrbG03f0GiODMNmX9Do7HP3hH0gcB3Mn6QL0gbFk9AFQDXTA2oourvZJ8wzF5ymvwroLaaXWRIo6so4bZuKPBNenKqJKabQkEQQHm9vR8LzpkZjkgEdsthSTfUHkrQnXGxEPK7Xg7w4fhqzgs0bV7hTo5XHdhpdIxMxt9aCNkMkhAqv0M3k7GLIhcpCT6wI2PRuHeCViWnBFQdSOF4hNFnTQL98qtTQBXd28IIpZPbYyuPPzkrdvTBkLx1otV0XtuxW4Eqg4yFBj8dJIQALmjb9sDVA7bD2TDqDr"
...

More looking tomorrow.

@tbg
Copy link
Member

tbg commented Apr 18, 2019

andrei-1555535390-11-n8cpu4-geo

(n1,s1):2:

raft_applied_index:56 lease_applied_index:33 range_stats:<contains_estimates:true last_update_nanos:1555539780955426343 live_bytes:34316122 live_count:58552 key_bytes:1339660 key_count:58552 val_bytes:32976462 val_count:58552 sys_bytes:565 sys_count:5 >

(n4,s4):3:

raft_applied_index:56 lease_applied_index:33 range_stats:<contains_estimates:true last_update_nanos:1555539780955426343 live_bytes:-364 live_count:-23 key_bytes:-285 key_count:-23 val_bytes:-79 val_count:-23 sys_bytes:565 sys_count:5 >

ContainsEstimates=true is due to #36525, i.e. it doesn't tell us anything except that SSTs were ingested by this range at some point. Looking at the follower's stats this makes sense anyway, so it doesn't tell us much.

The checkpoints worked like a charm. On n4 (the leaseholder with the weird stats) the Raft log reveals... nothing. It's recently been truncated, so all there is is the queue activity.

So let's look at what's actually in r871 on n4:

$ ./cockroach debug check-store /mnt/data1/cockroach/auxiliary/checkpoints/2019-04-17T22\:25\:10Z/
...
(ignored) r871:/Table/57/1/76{2/9/1509-4/9/61} [(n6,s6):4, (n1,s1):2, (n4,s4):3, next=5, gen=110]: diff(actual, claimed): LiveBytes: 34316122 != -364
LiveCount: 58552 != -23
KeyBytes: 1339660 != -285
KeyCount: 58552 != -23
ValBytes: 32976462 != -79
ValCount: 58552 != -23

See something? The data actually in the replica has actual stats matching that of the follower. The leaseholder persistently has the stats messed up, so this looks like a true divergence.

Here are the logs for r871 (before the inconsistency):

https://gist.github.com/tbg/779bd8813ea8dec2d45579ca9d3e8c2d

There's nothing related to n1 and n4, but a little while before the crash we remove the replica on n5 which results in this message:

removed -58570 (-58575+5) keys in 1ms [clear=0ms commit=1ms]

This suggests that n5 was messed up similar to n1.

@tbg
Copy link
Member

tbg commented Apr 18, 2019

andrei-1555535390-13-n8cpu4-geo

On n6 (leaseholder) the stats agree with recomputation, so there's tons of data. The Raft log is basically only log truncations and the consistency checks again. On n1 (follower), the range stats key is

0.000000000,0 /Local/RangeID/552/r/RangeAppliedState (0x0169f70228727261736b00): raft_applied_index:55 lease_applied_index:26 range_stats:<last_update_nanos:1555539836245201052 sys_bytes:878 sys_count:7 >

i.e. very very empty. And guess what? It also agrees with recomputation (here's the dump of the data in it).

@tbg
Copy link
Member

tbg commented Apr 18, 2019

This looks like there are two different failure modes already.

In the first case, the leaseholder (and probably another follower, since removed) had the same data, but diverging stats. Tentatively speaking, it seems as if the leaseholder had at least forgotten to add the stats for the data it ingested at some point, though the fact that its stats are negative indicates that more might be going on.

In the second case, both replicas are internally consistent (i.e. stats match data) but one replica is missing tons of stuff. This looks more classically like what you'd expect if a replica "skipped" a raft command erroneously, similar to #35424.

It's frustrating to not have the Raft logs, I think we'd really need them to tell us what's up. @andreimatei can you disable the raft log queue and see if you can get a repro that way?

@tbg
Copy link
Member

tbg commented Apr 18, 2019

BTW, in Nathan's original issue, it seems like yet another failure mode: the stats were the same, but one node was missing data (and this was then swept under the rug by the range deletions).

@tbg
Copy link
Member

tbg commented Apr 18, 2019

for another addendum, there's no clear argument that the inconsistencies immediately followed splits.

In Andrei's first repo, the split happens a minute before the inconsistency is detected. Shortly after the split we see the (then) leaseholder n5 send a snapshot to n6. That snapshot is basically empty. n5 is later removed and results in the inconsistent log line claiming to have removed a large negative number of keys. This number is just the (in-mem) stats of the range, so it got really messed up. This points at a split, as nothing else would've generated a negative delta.

@andreimatei let's try flipping this assertion on

const expensiveSplitAssertion = false

In the second repro, I can't find anything suspicious-looking; there are empty-looking snapshots right after the split so if those are to be trusted, the range starts out empty. It's worth pointing out that since ContainsEstimates=true in all the repros I've seen so far, SSTs have been ingested by all of the involved ranges directly (the RHS of a split starts out with no estimates).

Need more repros! But this problem is way worse than I anticipated, but since it repros so easily I doubt that we'll have trouble tracking it down given enough time. BTW @andreimatei can you always share the SHAs that you're running? Right now I don't even know if it's master or release-19.1.

@tbg
Copy link
Member

tbg commented Apr 18, 2019

I found Andrei's branch, so I'm gonna start a repro cycle of my own.

@tbg
Copy link
Member

tbg commented Apr 18, 2019

Up and running, my branches are

crdb binary: https://github.com/tbg/cockroach/tree/repro/import-geo-36861
roachtest: https://github.com/tbg/cockroach/tree/repro.geo-import-failure (I made the test use a monitor and stop nodes when it fails)

and the command

./bin/roachtestandrei run import-geo --debug --cpu-quota=1000 --roachprod=bin/roachprod --count=100 --parallelism=10 --port=8080

@nvanbenschoten
Copy link
Member Author

I'm running another iteration with the following diff. I expect it to have no issues

No consistency check failure in sight.

Specifically, before #35261 Replica.mu.proposals[cmdID].RaftCommand was immutable after the command was proposed to Raft. Anyone could take a command ID and find its corresponding RaftCommand exactly as it looked when proposed.

Heh, @tbg actually picked up on the need for this invariant back when we introduced this inlining code: 9170399#diff-0f4e61b240de001f1b97a87fc31af043R341.

I'd rather lose the "happy case" that at best applies to only one of the replicas.

Agreed. The "happy case" doesn't seem like it's buying us all that much since it only applies on the leaseholder.

nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Apr 18, 2019
…ading SSTs

Fixes cockroachdb#36861.

This optimization relied on the fact that `RaftCommands` in `Replica.mu.proposals`
were immutable over the lifetime of a Raft proposal. This invariant was violated
by cockroachdb#35261, which allowed a lease index error to trigger an immediate reproposal.
This reproposal mutated the corresponding `RaftCommand` in `Replica.mu.proposals`.
Combined with aliasing between multiple Raft proposals due to reproposals due to
ticks, this resulted in cases where a leaseholder's Raft logs could diverge from
its followers and cause Raft groups to become inconsistent.

Release note: None
@tbg
Copy link
Member

tbg commented Apr 18, 2019

I'd also rather fix the bug.

nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Apr 19, 2019
…ading SSTs

Fixes cockroachdb#36861.

This optimization relied on the fact that `RaftCommands` in `Replica.mu.proposals`
were immutable over the lifetime of a Raft proposal. This invariant was violated
by cockroachdb#35261, which allowed a lease index error to trigger an immediate reproposal.
This reproposal mutated the corresponding `RaftCommand` in `Replica.mu.proposals`.
Combined with aliasing between multiple Raft proposals due to reproposals due to
ticks, this resulted in cases where a leaseholder's Raft logs could diverge from
its followers and cause Raft groups to become inconsistent.

Release note: None
craig bot pushed a commit that referenced this issue Apr 19, 2019
36939: storage: remove optimization to use in-memory RaftCommand when sideloading SSTs r=nvanbenschoten a=nvanbenschoten

Fixes #36861.

This optimization relied on the fact that `RaftCommands` in `Replica.mu.proposals` were immutable over the lifetime of a Raft proposal. This invariant was violated by #35261, which allowed a lease index error to trigger an immediate reproposal. This reproposal mutated the corresponding `RaftCommand` in `Replica.mu.proposals`. Combined with aliasing between multiple Raft proposals due to reproposals from ticks, this resulted in cases where a leaseholder's Raft logs could diverge from its followers and cause Raft groups to become inconsistent.

We can justify the new cost of unmarshalling `RaftCommands` in `maybeInlineSideloadedRaftCommand` on leaseholders because https://github.com/cockroachdb/cockroach/pull/36670/files#diff-8a33b5571aa9ab154d4f3c2a16724697R230 just landed. That change removes an equally sized allocation and memory copy from the function on both leaseholders and followers.

Release note: None

Co-authored-by: Nathan VanBenschoten <[email protected]>
@craig craig bot closed this as completed in #36939 Apr 19, 2019
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Apr 19, 2019
…ading SSTs

Fixes cockroachdb#36861.

This optimization relied on the fact that `RaftCommands` in `Replica.mu.proposals`
were immutable over the lifetime of a Raft proposal. This invariant was violated
by cockroachdb#35261, which allowed a lease index error to trigger an immediate reproposal.
This reproposal mutated the corresponding `RaftCommand` in `Replica.mu.proposals`.
Combined with aliasing between multiple Raft proposals due to reproposals due to
ticks, this resulted in cases where a leaseholder's Raft logs could diverge from
its followers and cause Raft groups to become inconsistent.

Release note: None
@tbg
Copy link
Member

tbg commented Apr 19, 2019

Reopening until we've added the geo import test to our suite. What cadence should we be running that? It seems to tickle new behavior. From what I saw during the repros, the geo distribution causes costly reproposals.I assume we'll get value out of observing this test.

@tbg tbg reopened this Apr 19, 2019
tbg added a commit to tbg/cockroach that referenced this issue Apr 19, 2019
This should avoid losing the diff printed to the logs, as observed in:
cockroachdb#36861

We don't necessarily care about the large diffs (though they can't hurt
to have around) but about whatever else is in the deleted logs. Without
logs, debugging is typically much more difficult.

Release note: None
craig bot pushed a commit that referenced this issue Apr 19, 2019
36580: roachprod: rationalize refreshing of DNS entries r=andreimatei a=andreimatei

Before this patch, the DNS entries were refreshed even if we hadn't
properly listed the AWS clusters, which would cause their DNS entries to
be removed inadvertently.
This patch lifts the decision about whether to refresh or not out of the
GCE code and into the top level.

Release note: None

36950: storage: disable log file size limit before consistency fatal r=andreimatei a=tbg

This should avoid losing the diff printed to the logs, as observed in:
#36861

We don't necessarily care about the large diffs (though they can't hurt
to have around) but about whatever else is in the deleted logs. Without
logs, debugging is typically much more difficult.

Release note: None

Co-authored-by: Andrei Matei <[email protected]>
Co-authored-by: Tobias Schottdorf <[email protected]>
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Apr 22, 2019
Closes cockroachdb#36861.

This test runs a geo-distributed import in the same configuration
as the one we saw cause issues with cockroachdb#36861.

I'm testing now to see if this will actually trigger the consistency
failure without the fix from cockroachdb#36939.

Release note: None
tbg added a commit to tbg/cockroach that referenced this issue Apr 23, 2019
We've repeatedly wanted this to preserve state when finding replica
inconsistencies.

See, for example:

cockroachdb#36861

Release note: None
tbg added a commit to tbg/cockroach that referenced this issue Apr 23, 2019
This should avoid losing the diff printed to the logs, as observed in:
cockroachdb#36861

We don't necessarily care about the large diffs (though they can't hurt
to have around) but about whatever else is in the deleted logs. Without
logs, debugging is typically much more difficult.

Release note: None
craig bot pushed a commit that referenced this issue Apr 29, 2019
36997: roachtest: create new import/tpcc/warehouses=4000/geo test r=nvanbenschoten a=nvanbenschoten

Closes #36861.

This test runs a geo-distributed import in the same configuration
as the one we saw cause issues with #36861.

I'm testing now to see if this will actually trigger the consistency
failure without the fix from #36939.

Release note: None

Co-authored-by: Nathan VanBenschoten <[email protected]>
@craig craig bot closed this as completed in #36997 Apr 29, 2019
ajwerner pushed a commit to ajwerner/cockroach that referenced this issue May 15, 2019
Closes cockroachdb#36861.

This test runs a geo-distributed import in the same configuration
as the one we saw cause issues with cockroachdb#36861.

I'm testing now to see if this will actually trigger the consistency
failure without the fix from cockroachdb#36939.

Release note: None
irfansharif pushed a commit to irfansharif/cockroach that referenced this issue Oct 30, 2019
We've repeatedly wanted this to preserve state when finding replica
inconsistencies.

See, for example:

cockroachdb#36861

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. S-1 High impact: many users impacted, serious risk of high unavailability or data loss
Projects
None yet
5 participants
@bdarnell @andreimatei @tbg @nvanbenschoten and others