builtins: add crdb_internal.fingerprint builtin #91124

adityamaru · 2022-11-02T13:36:39Z

This change adds a crdb_internal.fingerprint builtin
that accepts a startTime, endTime, startKey and endKey
to define the interval the user wants to fingerprint. The builtin
is powered by sending an ExportRequest with the defined intervals
but with the ExportFingerprint option set to true.

Setting this option on the ExportRequest means that instead of
writing all point and rangekeys to an SST and sending them back to
the client, command evaluation will use the newly introduced
fingerprintWriter (#90848) when exporting keys. This writer
computes an fnv64 hash of the key/timestamp, value for each point key
and maintains a running XOR aggregate of all the point keys processed
as part of the ExportRequest. Rangekeys are not fingerprinted during
command evaluation, but instead returned to the client in a
pebble SST. This is because range keys do not have a stable,
discrete identity and so it is up to the caller to define a deterministic
ingerprinting scheme across all returned range keys.

The ExportRequest sent as part of this builtin does not set any DistSender
limit, thereby allowing concurrent execution across ranges. We are not
concerned about the ExportResponses growing too large since the SSTs
will only contain rangekeys that should be few in number. If this assumption
is proved incorrect in the future, we can revisit setting a TargetBytes
to the header of the BatchRequest.

Fixes: #89336

Release note: None

cockroach-teamcity · 2022-11-02T13:36:49Z

This change is

pkg/sql/sem/builtins/builtins.go

adityamaru · 2022-11-17T21:32:10Z

Pushing this out to get some initial comments, and ideas for more interesting cases to test.

stevendanna

Overall the test you wrote is about what I expected to see.

I suppose we could generate range tombstones through actual sql operations and then assert that splits and merges don't affect the fingerprint.

If we modified this to create tenant-agonostic fingerprints by default, we could also integrate it into a bunch (all?) of the tenant to tenant tests in a follow-up PR.

pkg/sql/sem/builtins/builtins.go

pkg/kv/kvserver/batcheval/cmd_export.go

pkg/sql/sem/builtins/fingerprint_builtin_test.go

adityamaru · 2022-11-23T14:43:27Z

friendly ping @stevendanna @erikgrinaker, if there are no blocking comments then I'd like to start using this in our C2C tests to shake out bugs/issues.

I suppose we could generate range tombstones through actual sql operations and then assert that splits and merges don't affect the fingerprint.

The test does have a case where it issues an admin split and ensures that we see two ExportRequests that are then combined by distsender. I think we'll also see more coverage once C2C in the face of SQL operations that write tombstones start comparing fingerprints.

stevendanna

Overall this looks reasonable to me. Thanks for working on it!

erikgrinaker · 2022-11-23T16:24:53Z

Sorry, I'm running a bit behind on code reviews, will have a quick look now.

erikgrinaker

A few issues that should be straightforward to fix (and some nits that you can ignore at will), feel free to merge once resolved.

pkg/roachpb/api.proto

pkg/storage/fingerprint_writer.go

erikgrinaker · 2022-11-23T16:41:47Z

pkg/storage/fingerprint_writer.go

+		} else if !ok {
+			break
+		}
+		hasPoint, _ := iter.HasPointAndRange()


This likely doesn't matter here at all, but combined point/range key iteration is usually a fair bit more expensive than iterating over them separately. We could set up a point-only iterator at the start of the function, and check if a seek lands on a valid position (found a point key), and then use range-only iteration for the fingerprinting.

This isn't going to matter unless a span has a bunch of range keys though, which we don't really expect to see, but I suppose it could e.g. in the case of import cancellations of coarsely interleaved data. Feel free to ignore this or leave a comment for later.

ahh TIL, changed to first use a point iter to assert we don't have any point keys.

erikgrinaker · 2022-11-23T16:46:09Z

pkg/storage/fingerprint_writer.go

+			if err := fw.hash(fw.timestampBuf); err != nil {
+				return 0, err
+			}
+			if err := fw.hashValue(v.Value); err != nil {


Here, we're fingerprinting the encoded value including the MVCCValueHeader (if any), while MVCCExportToFingerprint fingerprints the inner roachpb.Value contained in MVCCValue.Value.RawBytes. We should do the same here, by decoding the MVCCValue first.

This is particularly important because the MVCCValueHeader may contain data that isn't guaranteed to be the same across clusters or datasets, such as the MVCCValueHeader.LocalTimestamp, even though the SQL user data is identical.

This deserves a test case, where datasets with differing (or empty/non-empty) value headers yield identical fingerprints, both for point keys and range keys. TestMVCCHistories can generate this by passing localTs with a value below ts to put or del_range_ts.

Good catch!

nice catch indeed, i think we already have the test you outlined for point keys over here -

cockroach/pkg/storage/testdata/mvcc_histories/export_fingerprint_tenant

Line 11 in d316af0

put k=/b ts=2 v=b localTs=4 tenant-prefix=11 init-checksum

. Since the fingerprints for tenant 10 and tenant 11 are the same after stripping tenant prefix and checksum it proves that we don't fingerprint the localTS iiuc.

Now that we have FingerprintRangekeys I think I can teach TestMVCCHistories to also fingerprint rangekeys instead of printing out the rangekeys. Let me try that.

Okay, tweaked the datadriven driver to compute the rangekey fingerprint and XOR'ing it with the point key fingerprint instead of printing rangekeys. I also added a test to export_fingerprint_tenant where the rangekey in tenant 10 has a localTS but an identical rangekey in tenant 11 doesn't. The fingerprints continue to match which proves we are discarding the MVCCValueHeader before fingerprinting.

erikgrinaker · 2022-11-23T16:53:35Z

pkg/sql/sem/builtins/builtins.go

+		},
+		tree.Overload{
+			Types: tree.ArgTypes{
+				{"span", types.BytesArray},


Don't we usually pass the start/end keys separately to functions like these? No idea what the convention or recommendation is, just wondering.

I think crdb_internal.scan has an overload for both passing in a span or a start/end key. I was optimizing for the use case of crdb_internal.fingerprint(crdb_internal.tenant_span(<id>)) or crdb_internal.fingerprint(crdb_internal.table_span(...)) but its likely we'll add the other overload soon enough. I'll leave it as a follow up when we need it.

pkg/sql/sem/builtins/builtins.go

pkg/storage/fingerprint_writer.go

pkg/sql/sem/builtins/builtins.go

adityamaru · 2022-11-24T14:31:19Z

TFTRs!

bors r=stevendanna,erikgrinaker

craig · 2022-11-24T14:57:38Z

Build failed:

Bazel Essential CI (Cockroach)

adityamaru · 2022-11-24T16:12:37Z

Unsure what happened here.

bors retry

craig · 2022-11-24T16:21:51Z

Build failed:

Bazel Essential CI (Cockroach)

adityamaru · 2022-11-24T16:30:54Z

oh, rebasing:
panic: Multiple signatures have oid 2045: [pg_blocking_pids() -> int[] crdb_internal.fingerprint

This change adds a `crdb_internal.fingerprint` builtin that accepts a `startTime`, `endTime`, `startKey` and `endKey` to define the interval the user wants to fingerprint. The builtin is powered by sending an ExportRequest with the defined intervals but with the `ExportFingerprint` option set to true. Setting this option on the ExportRequest means that instead of writing all point and rangekeys to an SST and sending them back to the client, command evaluation will use the newly introduced `fingerprintWriter` (cockroachdb#90848) when exporting keys. This writer computes an `fnv64` hash of the key/timestamp, value for each point key and maintains a running XOR aggregate of all the point keys processed as part of the ExportRequest. Rangekeys are not fingerprinted during command evaluation, but instead returned to the client in a pebble SST. This is because range keys do not have a stable, discrete identity and so it is up to the caller to define a deterministic ingerprinting scheme across all returned range keys. The ExportRequest sent as part of this builtin does not set any DistSender limit, thereby allowing concurrent execution across ranges. We are not concerned about the ExportResponses growing too large since the SSTs will only contain rangekeys that should be few in number. If this assumption is proved incorrect in the future, we can revisit setting a `TargetBytes` to the header of the BatchRequest. Fixes: cockroachdb#89336 Release note: None

adityamaru · 2022-11-24T18:46:11Z

bors r+

craig · 2022-11-24T19:36:57Z

Build succeeded:

Bazel Essential CI (Cockroach)

adityamaru force-pushed the export-request-hookup branch 2 times, most recently from cc0d368 to 89cac0f Compare November 2, 2022 16:20

rafiss reviewed Nov 2, 2022

View reviewed changes

pkg/sql/sem/builtins/builtins.go Show resolved Hide resolved

adityamaru mentioned this pull request Nov 2, 2022

batcheval: add option to trim tenant prefix and value metadata before fingerprinting #91150

Closed

shermanCRL requested a review from baoalvin1 November 7, 2022 18:54

adityamaru force-pushed the export-request-hookup branch from 89cac0f to be88a29 Compare November 17, 2022 21:29

adityamaru requested review from stevendanna and erikgrinaker November 17, 2022 21:31

adityamaru marked this pull request as ready for review November 17, 2022 21:31

adityamaru requested review from a team as code owners November 17, 2022 21:31

adityamaru requested a review from a team November 17, 2022 21:31

adityamaru changed the title ~~[WIP] builtins: add crdb_internal.fingerprint builtin~~ builtins: add crdb_internal.fingerprint builtin Nov 17, 2022

adityamaru force-pushed the export-request-hookup branch from be88a29 to 084cc68 Compare November 17, 2022 22:32

stevendanna reviewed Nov 18, 2022

View reviewed changes

pkg/sql/sem/builtins/builtins.go Outdated Show resolved Hide resolved

pkg/sql/sem/builtins/builtins.go Show resolved Hide resolved

pkg/kv/kvserver/batcheval/cmd_export.go Show resolved Hide resolved

pkg/sql/sem/builtins/fingerprint_builtin_test.go Show resolved Hide resolved

adityamaru force-pushed the export-request-hookup branch from 084cc68 to de7237f Compare November 18, 2022 17:50

adityamaru requested a review from stevendanna November 18, 2022 17:52

adityamaru force-pushed the export-request-hookup branch from de7237f to 67047b8 Compare November 23, 2022 02:57

stevendanna approved these changes Nov 23, 2022

View reviewed changes

shermanCRL mentioned this pull request Nov 23, 2022

c2c: add fingerprinting for internal testing #89336

Closed

erikgrinaker approved these changes Nov 23, 2022

View reviewed changes

adityamaru force-pushed the export-request-hookup branch 2 times, most recently from 4a43407 to 4e57fb8 Compare November 23, 2022 20:44

adityamaru force-pushed the export-request-hookup branch from 4e57fb8 to 1acfcc8 Compare November 24, 2022 16:57

craig bot merged commit b5be006 into cockroachdb:master Nov 24, 2022

mgartner mentioned this pull request Feb 15, 2023

sql: crdb_internal.fingerprint panics with null arguments #97097

Closed

yuzefovich mentioned this pull request Nov 25, 2024

sql: disable gossip-based physical planning by default #135034

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

builtins: add crdb_internal.fingerprint builtin #91124

builtins: add crdb_internal.fingerprint builtin #91124

adityamaru commented Nov 2, 2022 •

edited

Loading

cockroach-teamcity commented Nov 2, 2022

adityamaru commented Nov 17, 2022

stevendanna left a comment

adityamaru commented Nov 23, 2022

stevendanna left a comment

erikgrinaker commented Nov 23, 2022

erikgrinaker left a comment

erikgrinaker Nov 23, 2022

adityamaru Nov 23, 2022

erikgrinaker Nov 23, 2022

stevendanna Nov 23, 2022

adityamaru Nov 23, 2022 •

edited

Loading

adityamaru Nov 23, 2022

adityamaru Nov 23, 2022

erikgrinaker Nov 23, 2022 •

edited

Loading

adityamaru Nov 23, 2022 •

edited

Loading

adityamaru commented Nov 24, 2022

craig bot commented Nov 24, 2022

adityamaru commented Nov 24, 2022

craig bot commented Nov 24, 2022

adityamaru commented Nov 24, 2022

adityamaru commented Nov 24, 2022

craig bot commented Nov 24, 2022

builtins: add crdb_internal.fingerprint builtin #91124

builtins: add crdb_internal.fingerprint builtin #91124

Conversation

adityamaru commented Nov 2, 2022 • edited Loading

cockroach-teamcity commented Nov 2, 2022

adityamaru commented Nov 17, 2022

stevendanna left a comment

Choose a reason for hiding this comment

adityamaru commented Nov 23, 2022

stevendanna left a comment

Choose a reason for hiding this comment

erikgrinaker commented Nov 23, 2022

erikgrinaker left a comment

Choose a reason for hiding this comment

erikgrinaker Nov 23, 2022

Choose a reason for hiding this comment

adityamaru Nov 23, 2022

Choose a reason for hiding this comment

erikgrinaker Nov 23, 2022

Choose a reason for hiding this comment

stevendanna Nov 23, 2022

Choose a reason for hiding this comment

adityamaru Nov 23, 2022 • edited Loading

Choose a reason for hiding this comment

adityamaru Nov 23, 2022

Choose a reason for hiding this comment

adityamaru Nov 23, 2022

Choose a reason for hiding this comment

erikgrinaker Nov 23, 2022 • edited Loading

Choose a reason for hiding this comment

adityamaru Nov 23, 2022 • edited Loading

Choose a reason for hiding this comment

adityamaru commented Nov 24, 2022

craig bot commented Nov 24, 2022

adityamaru commented Nov 24, 2022

craig bot commented Nov 24, 2022

adityamaru commented Nov 24, 2022

adityamaru commented Nov 24, 2022

craig bot commented Nov 24, 2022

adityamaru commented Nov 2, 2022 •

edited

Loading

adityamaru Nov 23, 2022 •

edited

Loading

erikgrinaker Nov 23, 2022 •

edited

Loading

adityamaru Nov 23, 2022 •

edited

Loading