RESTORE of large dataset with Pebble knocks nodes offline #64591
Labels
C-bug
Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
O-community
Originated from the community
X-blathers-triaged
blathers was able to find an owner
Describe the problem
I'm restoring a large dataset, in S3 it's 2.5TB/~264k objects. CRDB v20.2.8
When using pebble as the storage engine, on a fresh empty cluster, several nodes go offline in less than a minute and the restore can't complete. (Note that this is a
RESTORE DATABASE <> FROM <s3 url>
command.)I have tried using more nodes or using as few as possible that will fit the data, in all cases a minority of the nodes seem to stop updating liveness. I.e. they report as "suspect" in the console and log stuff like
unable to move closed timestamp forward: not live
.For the heck of it I tested restoring into just 1 node. That successfully created ~264 ranges and then sat there merging ranges and slowly importing. This was bound to fail by running out of space but at least showed it could get further.
The only thing that fixed this was moving back to rocksdb as the storage engine. This succeeded on the first try.
I'm guessing this is related to comments on #19699
I could send more debug data if you want but this seems very easy to reproduce, just run a restore of a large dataset into a pebble cluster of any size >1.
The text was updated successfully, but these errors were encountered: