[persist] Structured file format #31080

bkirwi · 2025-01-16T23:52:06Z

Motivation

Adds a new on-disk file format for Persist - the non-dual-write structured-data-only version.

We'd like this format to be supported in our on-prem releases, so folks aren't stuck with the migration midpoint.

Tips for reviewer

I've intentionally left this off in CI, out of the parallel workload. For our existing cloud envs I think it's most important to verify that the (substantial) refactorings here don't cause any regression for existing flag settings. However, it is enabled in parallel workload so we should be able to spot any bugs. (And empirically those tests are good at finding bugs in this corner of Persist.)

Checklist

This PR has adequate test coverage / QA involvement has been duly considered. (trigger-ci for additional test/nightly runs)
This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).
If this PR includes major user-facing behavior changes, I have pinged the relevant PM to schedule a changelog post.

def- · 2025-01-17T22:01:47Z

I've intentionally left this off in CI, out of the parallel workload. For or existing cloud envs I think it's most important to verify that the (substantial) refactorings here don't cause any regression for existing flag settings. However, it is enabled in parallel workload so we should be able to spot any bugs. (And empirically those tests are good at finding bugs in this corner of Persist.)

One risk here is that what parallel workloads finds is usually harder to reproduce than other tests. I'll trigger a one-off run at least to see what would happen if we default all of CI to structured. Edit: Done: #31101

bkirwi · 2025-01-17T22:03:24Z

Aight! Note that feature benchmarks and upgrade tests are very likely to fail; unsure about the rest of them.

def- · 2025-01-17T22:06:26Z

That kind of failure is ok. I'm more hoping to catch some wrong result or panic.

def-

No wrong results, no panics, just some timeouts, and the expected upgrade errors. So all good from my side

We built this for the streaming iterator, but it works well here too.

In particular, don't decode K/Vs when we have an override set, and make sure every K/V only gets decoded once.

bkirwi · 2025-01-21T15:04:40Z

Found and fixed the performance regression - this should be ready for review.

(Nightly run with clean benchmarks. The only failure is in data ingest, which I think is the recent Azure issue?)

def- · 2025-01-21T15:14:25Z

(Nightly run with clean benchmarks. The only failure is in data ingest, which I think is the recent Azure issue?)

Correct, please ignore.

ParkMyCar

WOOHOO!

src/persist-client/src/fetch.rs

ParkMyCar · 2025-01-22T21:55:56Z

src/persist-client/src/fetch.rs

+            let next = if self.part_cursor < self.timestamps.len() {
+                let next_idx = self.part_cursor;
+                self.part_cursor += 1;
+                let mut t = T::decode(self.timestamps.values()[next_idx].to_le_bytes());


This is probably fine, but I'm a bit surprised to see a .to_le_bytes() here? Feels like a bit of an abstraction leakage

Added a comment about this... it's an artifact of not always reading data from ColumnarRecords, where this code used to live. I've left a comment that we should re-encapsulate it once we're all in on structured data, if that feels like enough to cover it?

src/persist-client/src/fetch.rs

bkirwi force-pushed the structured-on-disk-2 branch 2 times, most recently from ef767c6 to 60af8bb Compare January 17, 2025 19:25

bkirwi marked this pull request as ready for review January 17, 2025 19:47

bkirwi requested review from a team as code owners January 17, 2025 19:47

def- approved these changes Jan 18, 2025

View reviewed changes

bkirwi added 10 commits January 21, 2025 10:03

Allow encoding batches with only structured data

1274067

Allow decoding structured-only data

7a12e7d

Add a new batch format variant

d8352f6

Enable structured-only writes in CI

66bcbaf

Decode structured-only data from proto

2bf7edc

Fill in codec data at read time where needed

668fbc3

Validate method now always has decoded types available

74c5052

Normalize away the encoded part

27438aa

We built this for the streaming iterator, but it works well here too.

Make decoding codec data optional in FetchedPart

6078362

Refactor FetchedPart to avoid as much decoding as possible

e60157d

In particular, don't decode K/Vs when we have an override set, and make sure every K/V only gets decoded once.

bkirwi force-pushed the structured-on-disk-2 branch from 43bc2c2 to e60157d Compare January 21, 2025 15:03

bkirwi requested a review from ParkMyCar January 21, 2025 20:48

ParkMyCar approved these changes Jan 22, 2025

View reviewed changes

Comments suggested by review

41d0aea

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[persist] Structured file format #31080

[persist] Structured file format #31080

bkirwi commented Jan 16, 2025 •

edited

Loading

def- commented Jan 17, 2025 •

edited

Loading

bkirwi commented Jan 17, 2025

def- commented Jan 17, 2025

def- left a comment

bkirwi commented Jan 21, 2025 •

edited

Loading

def- commented Jan 21, 2025

ParkMyCar left a comment

ParkMyCar Jan 22, 2025

bkirwi Jan 24, 2025

[persist] Structured file format #31080

Are you sure you want to change the base?

[persist] Structured file format #31080

Conversation

bkirwi commented Jan 16, 2025 • edited Loading

Motivation

Tips for reviewer

Checklist

def- commented Jan 17, 2025 • edited Loading

bkirwi commented Jan 17, 2025

def- commented Jan 17, 2025

def- left a comment

Choose a reason for hiding this comment

bkirwi commented Jan 21, 2025 • edited Loading

def- commented Jan 21, 2025

ParkMyCar left a comment

Choose a reason for hiding this comment

ParkMyCar Jan 22, 2025

Choose a reason for hiding this comment

bkirwi Jan 24, 2025

Choose a reason for hiding this comment

bkirwi commented Jan 16, 2025 •

edited

Loading

def- commented Jan 17, 2025 •

edited

Loading

bkirwi commented Jan 21, 2025 •

edited

Loading