RESTORE of large dataset with Pebble knocks nodes offline #64591

dankinder · 2021-05-03T18:55:57Z

Describe the problem

I'm restoring a large dataset, in S3 it's 2.5TB/~264k objects. CRDB v20.2.8

When using pebble as the storage engine, on a fresh empty cluster, several nodes go offline in less than a minute and the restore can't complete. (Note that this is a RESTORE DATABASE <> FROM <s3 url> command.)

I have tried using more nodes or using as few as possible that will fit the data, in all cases a minority of the nodes seem to stop updating liveness. I.e. they report as "suspect" in the console and log stuff like unable to move closed timestamp forward: not live.

For the heck of it I tested restoring into just 1 node. That successfully created ~264 ranges and then sat there merging ranges and slowly importing. This was bound to fail by running out of space but at least showed it could get further.

The only thing that fixed this was moving back to rocksdb as the storage engine. This succeeded on the first try.

I'm guessing this is related to comments on #19699

I could send more debug data if you want but this seems very easy to reproduce, just run a restore of a large dataset into a pebble cluster of any size >1.

The text was updated successfully, but these errors were encountered:

blathers-crl · 2021-05-03T18:56:01Z

Hello, I am Blathers. I am here to help you get the issue triaged.

Hoot - a bug! Though bugs are the bane of my existence, rest assured the wretched thing will get the best of care here.

I have CC'd a few people who may be able to assist you:

@petermattis (commented on storage: Improve reliability of node liveness #19699, found keywords: rocksdb,pebble)
@nvanbenschoten (commented on storage: Improve reliability of node liveness #19699)
@awoods187 (commented on storage: Improve reliability of node liveness #19699)
@ajwerner (commented on storage: Improve reliability of node liveness #19699)
@cockroachdb/bulk-io (found keywords: restore)
@tbg (assigned to storage: Improve reliability of node liveness #19699)
@bdarnell (commented on storage: Improve reliability of node liveness #19699)

If we have not gotten back to your issue within a few business days, you can try the following:

Join our community slack channel and ask on #cockroachdb.
Try find someone from here if you know they worked closely on the area and CC them.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.}

itsbilal · 2021-05-04T15:26:50Z

Thanks for filing this issue! We regularly do large restores to Pebble clusters and haven't seen something like this yet, so we'd need more info from the problematic cluster / nodes. Specifically, the cockroach and pebble logs (cockroach*.log and cockroach-pebble*.log respectively) from the problematic nodes after a failed restore would help us a lot here.

This doesn't quite seem like #19699 - from the symptoms it seems like a node didn't ack a heartbeat quickly enough, maybe because the storage engine didn't ack a heartbeat quickly enough. I can't think of why a restore would make that happen, but the logs would help give us more clarity. Thanks!

dankinder · 2021-05-06T00:58:33Z

Well, I am confounded, because now I can't reproduce this... I tried starting a fresh cluster several times the same way I did before (varying number of nodes and whatnot), and it's working fine. Sorry for the trouble, I'll reopen if I see this happen again.

dankinder added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label May 3, 2021

blathers-crl bot added O-community Originated from the community X-blathers-triaged blathers was able to find an owner labels May 3, 2021

dankinder closed this as completed May 6, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RESTORE of large dataset with Pebble knocks nodes offline #64591

RESTORE of large dataset with Pebble knocks nodes offline #64591

dankinder commented May 3, 2021

blathers-crl bot commented May 3, 2021

itsbilal commented May 4, 2021

dankinder commented May 6, 2021

RESTORE of large dataset with Pebble knocks nodes offline #64591

RESTORE of large dataset with Pebble knocks nodes offline #64591

Comments

dankinder commented May 3, 2021

blathers-crl bot commented May 3, 2021

itsbilal commented May 4, 2021

dankinder commented May 6, 2021