Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: don't crash on new seqno error from rocks ingest #36688

Merged
merged 1 commit into from
Apr 10, 2019

Conversation

dt
Copy link
Member

@dt dt commented Apr 9, 2019

storage: don't crash on new seqno error from rocks ingest
If rocks has already compacted our file away, the link count might not
be >1 but it could still reject repeated ingestion. We can just fall
back to the copy, and any real error will be surfaced when we try to
ingest it.

Release note: none.

@dt dt requested review from ajkr, petermattis and a team April 9, 2019 22:16
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Collaborator

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comment in #36679. If that looks reasonable to you, I'd want to include that scenario in a comment here.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajkr and @petermattis)

@petermattis
Copy link
Collaborator

@ajkr and @bdarnell How do you feel about this approach vs a sentinel file that indicates an external SST has already been ingested? The bulk IO folks are asking for someone on Core to take over this change as they want to focus on testing. This is high priority as we need to get some bandaid into the 19.1 release.

Copy link
Contributor

@bdarnell bdarnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I like this approach better than the sentinel file, since the sentinel file introduces additional cleanup concerns. The main thing that worried me about the error matching is that it might result in hiding true errors, but on further thought that's not really a problem because we fall back to copying, and if the file is really the problem we'll fail the second time and won't hide the error.

Just add a comment about the scenario in which this matters and the fact that it's safe to err on the side of swallowing the error because the second ingestion attempt will catch it if it's a real error.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajkr)

@ajkr
Copy link
Contributor

ajkr commented Apr 10, 2019

I would've leaned slightly towards sentinel file to avoid the string matching dependency between here and rocksdb. Given fixing the root cause appears to require major changes in RocksDB (evicting files deleted by compaction and tracking unique ID in manifest), I don't think it'll be fixed anytime soon. So this band-aid may last a while, and rocksdb can change their error messages as they wish.

@bdarnell
Copy link
Contributor

Good point. Let me be more specific then: I prefer the error matching fix for 19.1, with a more robust solution to come in 19.2.

@ajkr
Copy link
Contributor

ajkr commented Apr 10, 2019

Or you could attempt the copy approach more broadly, like on any rocksdb::Status::Corruption error. I can't think of any safety issue with that.

If rocks has already compacted our file away, the link count might not
be >1 but it could still reject repeated ingestion. We can just fall
back to the copy, and any real error will be surfaced when we try to
ingest it.

Release note: none.
@dt dt force-pushed the ingest-no-crash branch from b2e4aa2 to 9ea2f0a Compare April 10, 2019 21:40
@dt
Copy link
Member Author

dt commented Apr 10, 2019

I updated the comment and commit message -- ready for another look.
I'm hesitant to do anything more broad or fancier since I just want to backport this.

Copy link
Contributor

@ajkr ajkr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood.

Reviewed 1 of 1 files at r1.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained

@bdarnell
Copy link
Contributor

LGTM

@dt
Copy link
Member Author

dt commented Apr 10, 2019

bors r+

@craig
Copy link
Contributor

craig bot commented Apr 10, 2019

Build failed (retrying...)

@craig
Copy link
Contributor

craig bot commented Apr 10, 2019

Build failed

@dt
Copy link
Member Author

dt commented Apr 10, 2019

bors r+

craig bot pushed a commit that referenced this pull request Apr 10, 2019
36688: storage: don't crash on new seqno error from rocks ingest r=dt a=dt

storage: don't crash on new seqno error from rocks ingest
If rocks has already compacted our file away, the link count might not
be >1 but it could still reject repeated ingestion. We can just fall
back to the copy, and any real error will be surfaced when we try to
ingest it.

Release note: none.

Co-authored-by: David Taylor <[email protected]>
@craig
Copy link
Contributor

craig bot commented Apr 10, 2019

Build succeeded

@craig craig bot merged commit 9ea2f0a into cockroachdb:master Apr 10, 2019
@dt dt deleted the ingest-no-crash branch April 11, 2019 12:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants