Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stop using arrow unions #6388

Open
emilk opened this issue May 20, 2024 · 4 comments
Open

Stop using arrow unions #6388

emilk opened this issue May 20, 2024 · 4 comments
Labels
🏹 arrow Apache Arrow codegen/idl project Tracking issues for so-called "Projects" 🎄 tracking issue issue that tracks a bunch of subissues

Comments

@emilk
Copy link
Member

emilk commented May 20, 2024

Why

Arrow unions has downsides:

  • slow serialization/deserialization (no zero-copy)
  • hard to codegen (especially for Python)
  • complex for users that want the raw arrow data

TODO

  • TimeRangeBoundary
  • TensorBuffer

Related

@teh-cmc
Copy link
Member

teh-cmc commented May 24, 2024

Punting on datatype conversions by simplifying types

archetype Image {
    pixel_buffer: PixelBuffer,
    pixel_format: PixelFormat,
    resolution: Resolution2D,
    stride: Option<PixelStride>,
}

enum PixelFormat {
    /* Image formats */
    RGBA8_SRGB_22,
    RG32F,
    NV12,
    // ...

    /* Depth formats */
    F16,
    F32,
    F32_LINEAR_XXX,
    // ...

    /* Segmentation formats */
    U8,
    U16,
    // ...
}

archetype ImageEncoded {
    blob: ImageBlob,
    media_type: Option<MediaType>,
}

archetype DepthImage {
    depth_buffer: PixelBuffer,
    depth_format: PixelFormat,
    depth_meter: DepthMeter,
    resolution: Resolution2D,
    stride: Option<PixelStride>,
}

archetype SegmentationImage {
    buffer: PixelBuffer,
    buffer_format: PixelFormat,
    resolution: Resolution2D,
    stride: Option<PixelStride>,
}

component PixelStride {
    bytes_per_row: u32,
    bytes_per_plane: Option<u32>,
}
  • Tensor:
    We generate archetypes and components for all tensor variants (TensorF32, TensorU8, etc) and make sure they share the same indicator.
archetype TensorU8 {
    buffer: BufferU8,

    // One of these
    shape: TensorShape,
    shape: Vec<TensorDimension>,
}

component BufferU8 {
    data: [u8],
}

archetype TensorF32 {
    buffer: BufferF32,

    // One of these
    shape: TensorShape,
    shape: Vec<TensorDimension>,
}

component BufferF32 {
    data: [f32],
}
  • Mesh3D (more specifically, the embedded albedo texture):
    We stay away from data-oriented entity path references because they are A) effectively promises and B) prevent us from knowing what we're querying ahead of time.
    Nothing changes about the archetype itself (we just remove the TensorData): logging an Image (or an ImageEncoded) at the same path is now the approved ™️ way of setting up an albedo texture.
    PRs are welcome for SDK-side helpers to do this.

  • Transform3D:

// Two possibilities:
// - Only legal to set one of them
// - Or apply them all in deterministic order
archetype Transform {
    mat4: Option<Mat4>,
    translation: Option<Translation3>,
    mat3: Option<Mat3>,
    rotation: Option<Rotation3D>,
    scale3: Option<Scale3D>,
    scale: Option<Scale>,
}
  • AnnotationContext:
// TODO: Separate the skeleton stuff in its own archetype -- figure it out.
archetype AnnotationContext {
    class_ids: Vec<ClassId>,
    colors: Vec<Color>,
    labels: Vec<Text>,
}
  • Scalar of different sizes:
    We generate archetypes for all scalar variants (ScalarF32, ScalarU8, etc) and make sure they share the same indicator.
    At some point, we actually generate templated types in the target languages, if only for sanity.
archetype ScalarU8 {
    value: ScalarU8,
}

component ScalarU8 {
    value: u8,
}

archetype ScalarF32 {
    value: ScalarF32,
}

component ScalarF32 {
    value: f32,
}

Conclusion

  • No datatype conversions
  • No heterogeneous cells and stuff
  • Massively improves the raw arrow experience ™️
  • No more guessing pixel formats from tensor shapes (or only at a last resort, at the very least)

Punting on field accessor DSL by simplifying types

  • Killing field selection DSL:
    Offer ways to "augment" chunks derive new chunks from existing chunks, adding arbitrary extra columns in the process.
    This can happen at log time (SDK-side) or offline or server-side (ingestion time, fetching time), or whenever.
    Doesn't matter: the user gets notified of the chunks, and is free to add any list-arrays of their own making.
    Example:
  • User logs some structural data { velocity: f32, konfidence: bool }.
  • User now wants to plot velocity, but isn't able to re-log the data for whatever reason.
  • User augments the chunk and/or create a new one with a velocity column and just does the struct extraction and copy paste the data in a dedicated column.
    Update: either augment a chunk or create a new one.
    Update 2: always create new one.

Conclusion

  • More powerful than any DSL that is implementable in mid-term
  • "Totally valid workaround for not having a DSL" -- Someone said that, names were redacted
  • 👍

Other random killings

  • data-oriented entity references:
    We don't do those -- they are akin to promises and necessitate to inspect the data to know the query plan, which is a no-no at the moment.

  • blueprint entity references:
    Maybe at some point -- doesn't really matter, it's very orthogonal to everything else.
    Far simpler than data-oriented references anyhow.

  • Clear:
    oh god

@jleibs
Copy link
Member

jleibs commented May 27, 2024

Some additional notes on the above:

Why should Images use an untyped buffer + pixel format while tensors use a typed buffer?

While at first glance this proposal might seem to introduce an inconsistency, in practice it serves to highlight the fundamental differences between these two approaches to data representation.

Images are a way of describing a (possibly multi-channel) pixel value over a 2D image plane.

Images are almost always specifically grounded in data received from sensors or sent to displays. This usage, as it relates to special-built hardware, has given rise to pragmatic ways of describing these pixel values more efficiently for purposes of implementation. It is not uncommon for pixel encodings to pack data in ways that simply don't align with a uniform-shape tensor representation. See, chroma subsampling, bayer patterns, etc. It is also quite common to consider an approximate or interpolated pixel value as the data is inherently 2d-spatial.

As such a raw buffer + image encoding really is the most authentic representation we can achieve. For many low-level image libraries or sensor drivers we should be able to directly map this structure to an API that lets us access or load the raw image buffer + some metadata.

On the other hand, Tensors are much more generally mapped to multi-dimensional arrays. They are often used in pure data and computational contexts that have nothing to do with images. Due to the wildly varied applications, the patterns of tensor compression (beyond things like run-length-encoding, or sparse / dense representation) are much more varied and domain specific. This means there simply aren't equivalent forms of tensor-encoding that are as common/applicable as what you see in images. In this case, a strongly typed buffer of primitives dramatically simplifies questions of indexing and tensor-value access. This is the exact approach taken by the Arrow tensor spec (https://arrow.apache.org/docs/format/CanonicalExtensions.html#variable-shape-tensor). Again, most tensor libraries work under this assumption and so feeding a tensor library from a typed buffer + shape will be the most naturally way to work with this data.

What about "RGB" Tensors?

All that said, it's still a very common pattern for a user to decode an image into an HxWxC (or CxHxW) tensor. And this is, in fact, what many users will expect to provide as an input. A numpy ndarray is a tensor -- not an image buffer.

Even for users working with images, whether the user expects to provide an Image (buffer + encoding) or a Tensor (ndarray) will heavily depend on where the user sits in the software stack of their organization.

Rather than fight against this, we may also want to support an "ImageTensor" archetype, which would be a Tensor datatype which we know stores the pixels of an image in one of the common tensor arrangements. This would not support any pixel-encoded images. Only those that had already been decoded into multi-channel tensors.

@jleibs
Copy link
Member

jleibs commented May 27, 2024

Most of the choices for working with tensors fall into one of 4 categories.

Typed buffer, multiple data-types (the proposal)

Pros:

  • When processing a chunk the raw arrow data is much easier to work with
  • Opportunity to align with the official arrow spec for tensor representation
  • Aligns with our long-term direction of wanting to have multiple types and datatype conversions

Cons:

  • Multi-datatype representation means we must either proliferate typed components or introduce datatype conversions.

The current hypothesis is that proliferating types is a known challenge and can be mostly automated with a mixture of code-gen and some helper code, whereas datatype conversions is an unknown challenge.

Still this puts us on a pathway where once we support multi-typed components, we mostly delete a bunch of code and everything gets simpler. Any type conversions move from visualizer-space to data-query-space, but the types and arrow representations we work with don't actually need to change.

Untyped buffer with type-id

Pros

  • Avoids arrow unions while maintaining a single datatype.

Cons

  • Forces arrow users to do annoying user-space datatype casting.
  • Doesn't align with our long-term goals

Typed buffer with union

Pros

  • Status quo. Already works.

Cons

  • Forces arrow users to do annoying poorly supported union operations when loading or reading tensors.

@emilk emilk added the 🏹 arrow Apache Arrow label Jul 8, 2024
@emilk emilk assigned jleibs and Wumpf and unassigned jleibs Jul 8, 2024
@Wumpf

This comment was marked as duplicate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🏹 arrow Apache Arrow codegen/idl project Tracking issues for so-called "Projects" 🎄 tracking issue issue that tracks a bunch of subissues
Projects
None yet
Development

No branches or pull requests

4 participants