Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RESTORE of large dataset with Pebble knocks nodes offline #64591

Closed
dankinder opened this issue May 3, 2021 · 3 comments
Closed

RESTORE of large dataset with Pebble knocks nodes offline #64591

dankinder opened this issue May 3, 2021 · 3 comments
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-community Originated from the community X-blathers-triaged blathers was able to find an owner

Comments

@dankinder
Copy link

Describe the problem

I'm restoring a large dataset, in S3 it's 2.5TB/~264k objects. CRDB v20.2.8

When using pebble as the storage engine, on a fresh empty cluster, several nodes go offline in less than a minute and the restore can't complete. (Note that this is a RESTORE DATABASE <> FROM <s3 url> command.)

I have tried using more nodes or using as few as possible that will fit the data, in all cases a minority of the nodes seem to stop updating liveness. I.e. they report as "suspect" in the console and log stuff like unable to move closed timestamp forward: not live.

For the heck of it I tested restoring into just 1 node. That successfully created ~264 ranges and then sat there merging ranges and slowly importing. This was bound to fail by running out of space but at least showed it could get further.

The only thing that fixed this was moving back to rocksdb as the storage engine. This succeeded on the first try.

I'm guessing this is related to comments on #19699

I could send more debug data if you want but this seems very easy to reproduce, just run a restore of a large dataset into a pebble cluster of any size >1.

@dankinder dankinder added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label May 3, 2021
@blathers-crl
Copy link

blathers-crl bot commented May 3, 2021

Hello, I am Blathers. I am here to help you get the issue triaged.

Hoot - a bug! Though bugs are the bane of my existence, rest assured the wretched thing will get the best of care here.

I have CC'd a few people who may be able to assist you:

If we have not gotten back to your issue within a few business days, you can try the following:

  • Join our community slack channel and ask on #cockroachdb.
  • Try find someone from here if you know they worked closely on the area and CC them.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

@blathers-crl blathers-crl bot added O-community Originated from the community X-blathers-triaged blathers was able to find an owner labels May 3, 2021
@itsbilal
Copy link
Contributor

itsbilal commented May 4, 2021

Thanks for filing this issue! We regularly do large restores to Pebble clusters and haven't seen something like this yet, so we'd need more info from the problematic cluster / nodes. Specifically, the cockroach and pebble logs (cockroach*.log and cockroach-pebble*.log respectively) from the problematic nodes after a failed restore would help us a lot here.

This doesn't quite seem like #19699 - from the symptoms it seems like a node didn't ack a heartbeat quickly enough, maybe because the storage engine didn't ack a heartbeat quickly enough. I can't think of why a restore would make that happen, but the logs would help give us more clarity. Thanks!

@dankinder
Copy link
Author

Well, I am confounded, because now I can't reproduce this... I tried starting a fresh cluster several times the same way I did before (varying number of nodes and whatnot), and it's working fine. Sorry for the trouble, I'll reopen if I see this happen again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-community Originated from the community X-blathers-triaged blathers was able to find an owner
Projects
None yet
Development

No branches or pull requests

2 participants