Stop using arrow unions #6388

emilk · 2024-05-20T18:25:30Z

Why

Arrow unions has downsides:

slow serialization/deserialization (no zero-copy)
hard to codegen (especially for Python)
complex for users that want the raw arrow data

TODO

TimeRangeBoundary
TensorBuffer

Punting on datatype conversions by simplifying types

Image (Encode images as a byte blob + pixel format + resolution; not as a tensor #6386):

archetype Image {
    pixel_buffer: PixelBuffer,
    pixel_format: PixelFormat,
    resolution: Resolution2D,
    stride: Option<PixelStride>,
}

enum PixelFormat {
    /* Image formats */
    RGBA8_SRGB_22,
    RG32F,
    NV12,
    // ...

    /* Depth formats */
    F16,
    F32,
    F32_LINEAR_XXX,
    // ...

    /* Segmentation formats */
    U8,
    U16,
    // ...
}

archetype ImageEncoded {
    blob: ImageBlob,
    media_type: Option<MediaType>,
}

archetype DepthImage {
    depth_buffer: PixelBuffer,
    depth_format: PixelFormat,
    depth_meter: DepthMeter,
    resolution: Resolution2D,
    stride: Option<PixelStride>,
}

archetype SegmentationImage {
    buffer: PixelBuffer,
    buffer_format: PixelFormat,
    resolution: Resolution2D,
    stride: Option<PixelStride>,
}

component PixelStride {
    bytes_per_row: u32,
    bytes_per_plane: Option<u32>,
}

Tensor:
We generate archetypes and components for all tensor variants (TensorF32, TensorU8, etc) and make sure they share the same indicator.

archetype TensorU8 {
    buffer: BufferU8,

    // One of these
    shape: TensorShape,
    shape: Vec<TensorDimension>,
}

component BufferU8 {
    data: [u8],
}

archetype TensorF32 {
    buffer: BufferF32,

    // One of these
    shape: TensorShape,
    shape: Vec<TensorDimension>,
}

component BufferF32 {
    data: [f32],
}

Mesh3D (more specifically, the embedded albedo texture):
We stay away from data-oriented entity path references because they are A) effectively promises and B) prevent us from knowing what we're querying ahead of time.
Nothing changes about the archetype itself (we just remove the TensorData): logging an Image (or an ImageEncoded) at the same path is now the approved ™️ way of setting up an albedo texture.
PRs are welcome for SDK-side helpers to do this.
Transform3D:

// Two possibilities:
// - Only legal to set one of them
// - Or apply them all in deterministic order
archetype Transform {
    mat4: Option<Mat4>,
    translation: Option<Translation3>,
    mat3: Option<Mat3>,
    rotation: Option<Rotation3D>,
    scale3: Option<Scale3D>,
    scale: Option<Scale>,
}

AnnotationContext:

// TODO: Separate the skeleton stuff in its own archetype -- figure it out.
archetype AnnotationContext {
    class_ids: Vec<ClassId>,
    colors: Vec<Color>,
    labels: Vec<Text>,
}

Scalar of different sizes:
We generate archetypes for all scalar variants (ScalarF32, ScalarU8, etc) and make sure they share the same indicator.
At some point, we actually generate templated types in the target languages, if only for sanity.

archetype ScalarU8 {
    value: ScalarU8,
}

component ScalarU8 {
    value: u8,
}

archetype ScalarF32 {
    value: ScalarF32,
}

component ScalarF32 {
    value: f32,
}

Conclusion

No datatype conversions
No heterogeneous cells and stuff
Massively improves the raw arrow experience ™️
No more guessing pixel formats from tensor shapes (or only at a last resort, at the very least)

Punting on field accessor DSL by simplifying types

Killing field selection DSL:
Offer ways to ~~"augment" chunks~~ derive new chunks from existing chunks, adding arbitrary extra columns in the process.
This can happen at log time (SDK-side) or offline or server-side (ingestion time, fetching time), or whenever.
Doesn't matter: the user gets notified of the chunks, and is free to add any list-arrays of their own making.
Example:

User logs some structural data { velocity: f32, konfidence: bool }.
User now wants to plot velocity, but isn't able to re-log the data for whatever reason.
User augments the chunk and/or create a new one with a velocity column and just does the struct extraction and copy paste the data in a dedicated column.
Update: either augment a chunk or create a new one.
Update 2: always create new one.

Conclusion

More powerful than any DSL that is implementable in mid-term
"Totally valid workaround for not having a DSL" -- Someone said that, names were redacted
👍

Other random killings

data-oriented entity references:
We don't do those -- they are akin to promises and necessitate to inspect the data to know the query plan, which is a no-no at the moment.
blueprint entity references:
Maybe at some point -- doesn't really matter, it's very orthogonal to everything else.
Far simpler than data-oriented references anyhow.
Clear:
oh god

jleibs · 2024-05-27T19:51:06Z

Some additional notes on the above:

Why should Images use an untyped buffer + pixel format while tensors use a typed buffer?

While at first glance this proposal might seem to introduce an inconsistency, in practice it serves to highlight the fundamental differences between these two approaches to data representation.

Images are a way of describing a (possibly multi-channel) pixel value over a 2D image plane.

Images are almost always specifically grounded in data received from sensors or sent to displays. This usage, as it relates to special-built hardware, has given rise to pragmatic ways of describing these pixel values more efficiently for purposes of implementation. It is not uncommon for pixel encodings to pack data in ways that simply don't align with a uniform-shape tensor representation. See, chroma subsampling, bayer patterns, etc. It is also quite common to consider an approximate or interpolated pixel value as the data is inherently 2d-spatial.

As such a raw buffer + image encoding really is the most authentic representation we can achieve. For many low-level image libraries or sensor drivers we should be able to directly map this structure to an API that lets us access or load the raw image buffer + some metadata.

On the other hand, Tensors are much more generally mapped to multi-dimensional arrays. They are often used in pure data and computational contexts that have nothing to do with images. Due to the wildly varied applications, the patterns of tensor compression (beyond things like run-length-encoding, or sparse / dense representation) are much more varied and domain specific. This means there simply aren't equivalent forms of tensor-encoding that are as common/applicable as what you see in images. In this case, a strongly typed buffer of primitives dramatically simplifies questions of indexing and tensor-value access. This is the exact approach taken by the Arrow tensor spec (https://arrow.apache.org/docs/format/CanonicalExtensions.html#variable-shape-tensor). Again, most tensor libraries work under this assumption and so feeding a tensor library from a typed buffer + shape will be the most naturally way to work with this data.

What about "RGB" Tensors?

All that said, it's still a very common pattern for a user to decode an image into an HxWxC (or CxHxW) tensor. And this is, in fact, what many users will expect to provide as an input. A numpy ndarray is a tensor -- not an image buffer.

Even for users working with images, whether the user expects to provide an Image (buffer + encoding) or a Tensor (ndarray) will heavily depend on where the user sits in the software stack of their organization.

Rather than fight against this, we may also want to support an "ImageTensor" archetype, which would be a Tensor datatype which we know stores the pixels of an image in one of the common tensor arrangements. This would not support any pixel-encoded images. Only those that had already been decoded into multi-channel tensors.

jleibs · 2024-05-27T20:10:21Z

Most of the choices for working with tensors fall into one of 4 categories.

Typed buffer, multiple data-types (the proposal)

Pros:

When processing a chunk the raw arrow data is much easier to work with
Opportunity to align with the official arrow spec for tensor representation
Aligns with our long-term direction of wanting to have multiple types and datatype conversions

Cons:

Multi-datatype representation means we must either proliferate typed components or introduce datatype conversions.

The current hypothesis is that proliferating types is a known challenge and can be mostly automated with a mixture of code-gen and some helper code, whereas datatype conversions is an unknown challenge.

Still this puts us on a pathway where once we support multi-typed components, we mostly delete a bunch of code and everything gets simpler. Any type conversions move from visualizer-space to data-query-space, but the types and arrow representations we work with don't actually need to change.

Untyped buffer with type-id

Pros

Avoids arrow unions while maintaining a single datatype.

Cons

Forces arrow users to do annoying user-space datatype casting.
Doesn't align with our long-term goals

Typed buffer with union

Pros

Status quo. Already works.

Cons

Forces arrow users to do annoying poorly supported union operations when loading or reading tensors.

emilk added the codegen/idl label May 20, 2024

emilk mentioned this issue May 27, 2024

Encode images as a byte blob + pixel format + resolution; not as a tensor #6386

Closed

emilk added the 🏹 arrow Apache Arrow label Jul 8, 2024

emilk assigned jleibs and Wumpf and unassigned jleibs Jul 8, 2024

jleibs mentioned this issue Jul 8, 2024

Document/publish our arrow schemas #6818

Closed

This comment was marked as duplicate.

Sign in to view

emilk mentioned this issue Jul 9, 2024

Remove support for nullable components #6819

Open

Wumpf mentioned this issue Jul 9, 2024

Split Transform3D component union type into components #6831

Closed

9 tasks

Wumpf added the 🎄 tracking issue issue that tracks a bunch of subissues label Jul 9, 2024

Wumpf assigned emilk and Wumpf and unassigned Wumpf Jul 9, 2024

Wumpf mentioned this issue Jul 9, 2024

Split Tensor component into several archetypes #6832

Closed

emilk mentioned this issue Jul 10, 2024

Refactor Tensor and Image archetypes #6844

Closed

1 task

emilk changed the title ~~Less arrow unions~~ Stop using arrow unions Jul 10, 2024

jleibs unassigned emilk Jul 22, 2024

jleibs mentioned this issue Jul 31, 2024

Move code-generated enum from union to int type. #7022

Closed

This was referenced Aug 5, 2024

Pinhole component: a union of different representations #2653

Open

Implement python arrow serialization for all our types #6151

Open

Wumpf removed their assignment Sep 12, 2024

This was referenced Feb 22, 2025

Backwards compatible data loaders for dataplatform #9091

Open

Replace TensorBuffer with datatype generics #9119

Open

Roundtrip tests should give better error message on union-type mismatch #3410

Closed

emilk added the project Tracking issues for so-called "Projects" label Feb 24, 2025

emilk mentioned this issue Feb 24, 2025

Improve compile times #1316

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stop using arrow unions #6388

Stop using arrow unions #6388

emilk commented May 20, 2024 •

edited

Loading

teh-cmc commented May 24, 2024 •

edited

Loading

jleibs commented May 27, 2024

jleibs commented May 27, 2024

This comment was marked as duplicate.

Stop using arrow unions #6388

Stop using arrow unions #6388

Comments

emilk commented May 20, 2024 • edited Loading

Why

TODO

Related

teh-cmc commented May 24, 2024 • edited Loading

Punting on datatype conversions by simplifying types

Conclusion

Punting on field accessor DSL by simplifying types

Conclusion

Other random killings

jleibs commented May 27, 2024

Why should Images use an untyped buffer + pixel format while tensors use a typed buffer?

What about "RGB" Tensors?

jleibs commented May 27, 2024

Typed buffer, multiple data-types (the proposal)

Untyped buffer with type-id

Typed buffer with union

This comment was marked as duplicate.

emilk commented May 20, 2024 •

edited

Loading

teh-cmc commented May 24, 2024 •

edited

Loading