Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

builtins: add crdb_internal.fingerprint builtin #91124

Merged
merged 1 commit into from
Nov 24, 2022

Conversation

adityamaru
Copy link
Contributor

@adityamaru adityamaru commented Nov 2, 2022

This change adds a crdb_internal.fingerprint builtin
that accepts a startTime, endTime, startKey and endKey
to define the interval the user wants to fingerprint. The builtin
is powered by sending an ExportRequest with the defined intervals
but with the ExportFingerprint option set to true.

Setting this option on the ExportRequest means that instead of
writing all point and rangekeys to an SST and sending them back to
the client, command evaluation will use the newly introduced
fingerprintWriter (#90848) when exporting keys. This writer
computes an fnv64 hash of the key/timestamp, value for each point key
and maintains a running XOR aggregate of all the point keys processed
as part of the ExportRequest. Rangekeys are not fingerprinted during
command evaluation, but instead returned to the client in a
pebble SST. This is because range keys do not have a stable,
discrete identity and so it is up to the caller to define a deterministic
ingerprinting scheme across all returned range keys.

The ExportRequest sent as part of this builtin does not set any DistSender
limit, thereby allowing concurrent execution across ranges. We are not
concerned about the ExportResponses growing too large since the SSTs
will only contain rangekeys that should be few in number. If this assumption
is proved incorrect in the future, we can revisit setting a TargetBytes
to the header of the BatchRequest.

Fixes: #89336

Release note: None

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@adityamaru adityamaru force-pushed the export-request-hookup branch 2 times, most recently from cc0d368 to 89cac0f Compare November 2, 2022 16:20
@shermanCRL shermanCRL requested a review from baoalvin1 November 7, 2022 18:54
@adityamaru adityamaru force-pushed the export-request-hookup branch from 89cac0f to be88a29 Compare November 17, 2022 21:29
@adityamaru adityamaru marked this pull request as ready for review November 17, 2022 21:31
@adityamaru adityamaru requested review from a team as code owners November 17, 2022 21:31
@adityamaru adityamaru requested a review from a team November 17, 2022 21:31
@adityamaru adityamaru changed the title [WIP] builtins: add crdb_internal.fingerprint builtin builtins: add crdb_internal.fingerprint builtin Nov 17, 2022
@adityamaru
Copy link
Contributor Author

Pushing this out to get some initial comments, and ideas for more interesting cases to test.

@adityamaru adityamaru force-pushed the export-request-hookup branch from be88a29 to 084cc68 Compare November 17, 2022 22:32
Copy link
Collaborator

@stevendanna stevendanna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall the test you wrote is about what I expected to see.

I suppose we could generate range tombstones through actual sql operations and then assert that splits and merges don't affect the fingerprint.

If we modified this to create tenant-agonostic fingerprints by default, we could also integrate it into a bunch (all?) of the tenant to tenant tests in a follow-up PR.

@adityamaru
Copy link
Contributor Author

friendly ping @stevendanna @erikgrinaker, if there are no blocking comments then I'd like to start using this in our C2C tests to shake out bugs/issues.

I suppose we could generate range tombstones through actual sql operations and then assert that splits and merges don't affect the fingerprint.

The test does have a case where it issues an admin split and ensures that we see two ExportRequests that are then combined by distsender. I think we'll also see more coverage once C2C in the face of SQL operations that write tombstones start comparing fingerprints.

Copy link
Collaborator

@stevendanna stevendanna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this looks reasonable to me. Thanks for working on it!

@erikgrinaker
Copy link
Contributor

Sorry, I'm running a bit behind on code reviews, will have a quick look now.

Copy link
Contributor

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few issues that should be straightforward to fix (and some nits that you can ignore at will), feel free to merge once resolved.

} else if !ok {
break
}
hasPoint, _ := iter.HasPointAndRange()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This likely doesn't matter here at all, but combined point/range key iteration is usually a fair bit more expensive than iterating over them separately. We could set up a point-only iterator at the start of the function, and check if a seek lands on a valid position (found a point key), and then use range-only iteration for the fingerprinting.

This isn't going to matter unless a span has a bunch of range keys though, which we don't really expect to see, but I suppose it could e.g. in the case of import cancellations of coarsely interleaved data. Feel free to ignore this or leave a comment for later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahh TIL, changed to first use a point iter to assert we don't have any point keys.

if err := fw.hash(fw.timestampBuf); err != nil {
return 0, err
}
if err := fw.hashValue(v.Value); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, we're fingerprinting the encoded value including the MVCCValueHeader (if any), while MVCCExportToFingerprint fingerprints the inner roachpb.Value contained in MVCCValue.Value.RawBytes. We should do the same here, by decoding the MVCCValue first.

This is particularly important because the MVCCValueHeader may contain data that isn't guaranteed to be the same across clusters or datasets, such as the MVCCValueHeader.LocalTimestamp, even though the SQL user data is identical.

This deserves a test case, where datasets with differing (or empty/non-empty) value headers yield identical fingerprints, both for point keys and range keys. TestMVCCHistories can generate this by passing localTs with a value below ts to put or del_range_ts.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch!

Copy link
Contributor Author

@adityamaru adityamaru Nov 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice catch indeed, i think we already have the test you outlined for point keys over here -

put k=/b ts=2 v=b localTs=4 tenant-prefix=11 init-checksum
. Since the fingerprints for tenant 10 and tenant 11 are the same after stripping tenant prefix and checksum it proves that we don't fingerprint the localTS iiuc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that we have FingerprintRangekeys I think I can teach TestMVCCHistories to also fingerprint rangekeys instead of printing out the rangekeys. Let me try that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, tweaked the datadriven driver to compute the rangekey fingerprint and XOR'ing it with the point key fingerprint instead of printing rangekeys. I also added a test to export_fingerprint_tenant where the rangekey in tenant 10 has a localTS but an identical rangekey in tenant 11 doesn't. The fingerprints continue to match which proves we are discarding the MVCCValueHeader before fingerprinting.

},
tree.Overload{
Types: tree.ArgTypes{
{"span", types.BytesArray},
Copy link
Contributor

@erikgrinaker erikgrinaker Nov 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we usually pass the start/end keys separately to functions like these? No idea what the convention or recommendation is, just wondering.

Copy link
Contributor Author

@adityamaru adityamaru Nov 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think crdb_internal.scan has an overload for both passing in a span or a start/end key. I was optimizing for the use case of crdb_internal.fingerprint(crdb_internal.tenant_span(<id>)) or crdb_internal.fingerprint(crdb_internal.table_span(...)) but its likely we'll add the other overload soon enough. I'll leave it as a follow up when we need it.

@adityamaru adityamaru force-pushed the export-request-hookup branch 2 times, most recently from 4a43407 to 4e57fb8 Compare November 23, 2022 20:44
@adityamaru
Copy link
Contributor Author

TFTRs!

bors r=stevendanna,erikgrinaker

@craig
Copy link
Contributor

craig bot commented Nov 24, 2022

Build failed:

@adityamaru
Copy link
Contributor Author

Unsure what happened here.

bors retry

@craig
Copy link
Contributor

craig bot commented Nov 24, 2022

Build failed:

@adityamaru
Copy link
Contributor Author

oh, rebasing:
panic: Multiple signatures have oid 2045: [pg_blocking_pids() -> int[] crdb_internal.fingerprint

This change adds a `crdb_internal.fingerprint` builtin
that accepts a `startTime`, `endTime`, `startKey` and `endKey`
to define the interval the user wants to fingerprint. The builtin
is powered by sending an ExportRequest with the defined intervals
but with the `ExportFingerprint` option set to true.

Setting this option on the ExportRequest means that instead of
writing all point and rangekeys to an SST and sending them back to
the client, command evaluation will use the newly introduced
`fingerprintWriter` (cockroachdb#90848) when exporting keys. This writer
computes an `fnv64` hash of the key/timestamp, value for each point key
and maintains a running XOR aggregate of all the point keys processed
as part of the ExportRequest. Rangekeys are not fingerprinted during
command evaluation, but instead returned to the client in a
pebble SST. This is because range keys do not have a stable,
discrete identity and so it is up to the caller to define a deterministic
ingerprinting scheme across all returned range keys.

The ExportRequest sent as part of this builtin does not set any DistSender
limit, thereby allowing concurrent execution across ranges. We are not
concerned about the ExportResponses growing too large since the SSTs
will only contain rangekeys that should be few in number. If this assumption
is proved incorrect in the future, we can revisit setting a `TargetBytes`
to the header of the BatchRequest.

Fixes: cockroachdb#89336

Release note: None
@adityamaru adityamaru force-pushed the export-request-hookup branch from 4e57fb8 to 1acfcc8 Compare November 24, 2022 16:57
@adityamaru
Copy link
Contributor Author

bors r+

@craig
Copy link
Contributor

craig bot commented Nov 24, 2022

Build succeeded:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

c2c: add fingerprinting for internal testing
5 participants