storage: don't crash on new seqno error from rocks ingest #36688

dt · 2019-04-09T22:16:26Z

storage: don't crash on new seqno error from rocks ingest
If rocks has already compacted our file away, the link count might not
be >1 but it could still reject repeated ingestion. We can just fall
back to the copy, and any real error will be surfaced when we try to
ingest it.

Release note: none.

cockroach-teamcity · 2019-04-09T22:16:32Z

This change is

petermattis

See my comment in #36679. If that looks reasonable to you, I'd want to include that scenario in a comment here.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajkr and @petermattis)

petermattis · 2019-04-10T18:21:05Z

@ajkr and @bdarnell How do you feel about this approach vs a sentinel file that indicates an external SST has already been ingested? The bulk IO folks are asking for someone on Core to take over this change as they want to focus on testing. This is high priority as we need to get some bandaid into the 19.1 release.

bdarnell

I think I like this approach better than the sentinel file, since the sentinel file introduces additional cleanup concerns. The main thing that worried me about the error matching is that it might result in hiding true errors, but on further thought that's not really a problem because we fall back to copying, and if the file is really the problem we'll fail the second time and won't hide the error.

Just add a comment about the scenario in which this matters and the fact that it's safe to err on the side of swallowing the error because the second ingestion attempt will catch it if it's a real error.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajkr)

ajkr · 2019-04-10T20:46:42Z

I would've leaned slightly towards sentinel file to avoid the string matching dependency between here and rocksdb. Given fixing the root cause appears to require major changes in RocksDB (evicting files deleted by compaction and tracking unique ID in manifest), I don't think it'll be fixed anytime soon. So this band-aid may last a while, and rocksdb can change their error messages as they wish.

bdarnell · 2019-04-10T20:47:51Z

Good point. Let me be more specific then: I prefer the error matching fix for 19.1, with a more robust solution to come in 19.2.

ajkr · 2019-04-10T20:53:30Z

Or you could attempt the copy approach more broadly, like on any rocksdb::Status::Corruption error. I can't think of any safety issue with that.

If rocks has already compacted our file away, the link count might not be >1 but it could still reject repeated ingestion. We can just fall back to the copy, and any real error will be surfaced when we try to ingest it. Release note: none.

dt · 2019-04-10T21:51:32Z

I updated the comment and commit message -- ready for another look.
I'm hesitant to do anything more broad or fancier since I just want to backport this.

ajkr

Understood.

Reviewed 1 of 1 files at r1.
Reviewable status: complete! 0 of 0 LGTMs obtained

bdarnell · 2019-04-10T21:53:31Z

LGTM

dt · 2019-04-10T21:57:32Z

bors r+

craig · 2019-04-10T22:35:36Z

Build failed (retrying...)

GitHub CI (Cockroach)

craig · 2019-04-10T23:12:00Z

Build failed

GitHub CI (Cockroach)

dt · 2019-04-10T23:17:16Z

bors r+

36688: storage: don't crash on new seqno error from rocks ingest r=dt a=dt storage: don't crash on new seqno error from rocks ingest If rocks has already compacted our file away, the link count might not be >1 but it could still reject repeated ingestion. We can just fall back to the copy, and any real error will be surfaced when we try to ingest it. Release note: none. Co-authored-by: David Taylor <[email protected]>

craig · 2019-04-10T23:56:47Z

Build succeeded

GitHub CI (Cockroach)

dt requested review from ajkr, petermattis and a team April 9, 2019 22:16

petermattis reviewed Apr 10, 2019

View reviewed changes

bdarnell approved these changes Apr 10, 2019

View reviewed changes

dt force-pushed the ingest-no-crash branch from b2e4aa2 to 9ea2f0a Compare April 10, 2019 21:40

ajkr approved these changes Apr 10, 2019

View reviewed changes

dt mentioned this pull request Apr 10, 2019

release-19.1: storage: don't crash on new seqno error from rocks ingest #36739

Merged

craig bot merged commit 9ea2f0a into cockroachdb:master Apr 10, 2019

dt deleted the ingest-no-crash branch April 11, 2019 12:48

awoods187 mentioned this pull request Apr 11, 2019

storage: node crashes with "Corruption: external file have non zero sequence number" error #36679

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: don't crash on new seqno error from rocks ingest #36688

storage: don't crash on new seqno error from rocks ingest #36688

dt commented Apr 9, 2019 •

edited

Loading

cockroach-teamcity commented Apr 9, 2019

petermattis left a comment

petermattis commented Apr 10, 2019

bdarnell left a comment

ajkr commented Apr 10, 2019 •

edited

Loading

bdarnell commented Apr 10, 2019

ajkr commented Apr 10, 2019

dt commented Apr 10, 2019

ajkr left a comment

bdarnell commented Apr 10, 2019

dt commented Apr 10, 2019

craig bot commented Apr 10, 2019

craig bot commented Apr 10, 2019

dt commented Apr 10, 2019

craig bot commented Apr 10, 2019

storage: don't crash on new seqno error from rocks ingest #36688

storage: don't crash on new seqno error from rocks ingest #36688

Conversation

dt commented Apr 9, 2019 • edited Loading

cockroach-teamcity commented Apr 9, 2019

petermattis left a comment

Choose a reason for hiding this comment

petermattis commented Apr 10, 2019

bdarnell left a comment

Choose a reason for hiding this comment

ajkr commented Apr 10, 2019 • edited Loading

bdarnell commented Apr 10, 2019

ajkr commented Apr 10, 2019

dt commented Apr 10, 2019

ajkr left a comment

Choose a reason for hiding this comment

bdarnell commented Apr 10, 2019

dt commented Apr 10, 2019

craig bot commented Apr 10, 2019

Build failed (retrying...)

craig bot commented Apr 10, 2019

Build failed

dt commented Apr 10, 2019

craig bot commented Apr 10, 2019

Build succeeded

dt commented Apr 9, 2019 •

edited

Loading

ajkr commented Apr 10, 2019 •

edited

Loading