Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

release-19.1: batcheval: estimate stats in EvalAddSSTable #36525

Merged
merged 1 commit into from
Apr 4, 2019

Conversation

dt
Copy link
Member

@dt dt commented Apr 4, 2019

Backport 1/1 commits from #35231.

/cc @cockroachdb/release


Adding the SSTs stats directly, without removing the underlying stats and computing merged stats,
avoids some potentially expensive repeated reads/reccomputation of the underlying existing data when
repeatedly adding sparse, wide SSTs to existing, full-ish ranges.

When adding one big SST to an empty range, the previous approach was not particularly expensive:
recomputing the existing stats was cheap, as the underying range was empty-ish, and computing
the merged stats was able to mostly iterate sequentially over just the SST.

However when adding an SST with keys distributed over an existing, non-empty range, the
recomputation of the existing span stats actually has to read and process a large amount
of existing data, and the merged recompute has to iterate over it as well -- and in doing
so has to jump between the go-side merge iterator and the underlying Rocks-backed range.

Instead, we can optimistically assume all the keys/data in the SST being added is in fact
being added -- i.e. not shadowing existing keys -- and simply add its stats to the range
stats without recomputing. This will be incorrect in some cases -- when a key does shadow
or, in particular, when retrying and all keys are shadowed -- but by flipping the flag
in the stats to indicate that the stats contain estimates, we can document our assumption's
imperfection. When the descrepency is found and fixed by a recompute, the flag will prevent
a consistency checker error being raised.

These estimated stats are probably good enough 'as-is', but a followup could send an explicit
CheckConsistency command to all ranges that were sent SSTs during a given bulk operation to
expedite the correction of any inccorrect estimates.

Release note (performance improvement): Speed up bulk data ingestion during index backfills and IMPORT.

Adding the SSTs stats directly, without removing the underlying stats and computing merged stats,
avoids some potentially expensive repeated reads/reccomputation of the underlying existing data when
repeatedly adding sparse, wide SSTs to existing, full-ish ranges.

When adding one big SST to an empty range, the previous approach was not particularly expensive:
recomputing the existing stats was cheap, as the underying range was empty-ish, and computing
the merged stats was able to mostly iterate sequentially over just the SST.

However when adding an SST with keys distributed over an existing, non-empty range, the
recomputation of the existing span stats actually has to read and process a large amount
of existing data, and the merged recompute has to iterate over it as well -- and in doing
so has to jump between the go-side merge iterator and the underlying Rocks-backed range.

Instead, we can optimistically assume all the keys/data in the SST being added is in fact
being added -- i.e. not shadowing existing keys -- and simply add its stats to the range
stats without recomputing. This will be incorrect in some cases -- when a key does shadow
or, in particular, when retrying and all keys are shadowed -- but by flipping the flag
in the stats to indicate that the stats contain estimates, we can document our assumption's
imperfection. When the descrepency is found and fixed by a recompute, the flag will prevent
a consistency checker error being raised.

These estimated stats are probably good enough 'as-is', but a followup could send an explicit
CheckConsistency command to all ranges that were sent SSTs during a given bulk operation to
expedite the correction of any inccorrect estimates.

Release note (performance improvement): Speed up bulk data ingestion during index backfills and IMPORT.
@dt dt requested review from bdarnell, tbg, nvanbenschoten and a team April 4, 2019 12:53
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@tbg
Copy link
Member

tbg commented Apr 4, 2019 via email

@dt dt merged commit 5a76863 into cockroachdb:release-19.1 Apr 4, 2019
@dt dt deleted the backport19.1-35231 branch April 4, 2019 17:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants