Support Deserializing Serde DataTypes to Arrow #3949

rtyler · 2023-03-26T03:42:10Z

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

I have found that creating serde_json::Value objects is a lot easier than trying to construct RecordBatch objects. This is especially useful when using serde's deserialization code for data ingestion. Crates like [serde_arrow}(https://crates.io/crates/serde_arrow) are out-dated and add unnecessary complexity

I have taken to using Decoder to convert Value objects to RecordBatchs easily.

Describe the solution you'd like

Renaming or moving the Decoder to where it's not considered for use on deserializing raw JSON buffers, e.g. the BufReader approach that it uses now, but rather can be used for naturally converting pre-deserialized JSON Value objects.

Describe alternatives you've considered

If Decoder gets bounced out of the repo, I would probably just refactor it into its own crate and push that 😄

The text was updated successfully, but these errors were encountered:

tustvold · 2023-03-26T12:11:25Z

The major reason for wanting to remove the old decoder, aside from its incredibly underwhelming performance, is that the code is very hard to maintain and reason about, in particular the way it handles nested types is very convoluted, and is likely also incorrect.

That being said I wonder if we can accommodate your use-case in a different manner, you mention "using serde's deserialization code", does this mean that the types are known statically ahead of time? I ask as providing a nicer story for creating RecordBatch from statically typed rows is something I have been playing around with, would this work for your use-case, or are the types not known at compile time?

rtyler · 2023-03-26T17:48:45Z

the code is very hard to maintain and reason about, in particular the way it handles nested types is very convolute

Yes I would agree with that assessment 😆

creating RecordBatch from statically typed rows is something I have been playing around with, would this work for your use-case, or are the types not known at compile time?

Yes the data is more or less known ahead of time. The work that I am doing is similar to what kafka-delta-ingest does, which is converting JSON data into Delta formatted data. In my cases I want to deserialize into structs with Serde in order to perform some operations with the data. If I have a path to re-use serde's Value type which is shared between serde_json and serde_yaml (and others) then I can use that as my "intermediary" format for deserializing known typed data before converting that into a RecordBatch for writing out to Delta.

* Add ListBuilder::append_value (#3949) * Review feedback

* Improve array builder documentation (#3949) * Review feedback

trueleo · 2023-03-29T10:26:05Z

the code is very hard to maintain and reason about, in particular the way it handles nested types is very convoluted, and is likely also incorrect.

I understand that this can be a PITA to maintain but i also agree with what @rtyler said

creating serde_json::Value objects is a lot easier than trying to construct RecordBatch objects.

At Parseable we want to support basic schema evolution on top level. So we flatten JSON before converting it into RecordBatch. To do that we need to convert it to serde_json::Value first ( it'll be amazing to have a flatten algorithm that does this at very low level but currently we don't have that ).

So for us it is hard to move away from Value -> RecordBatch flow. If maintainers at arrow-rs feel strongly about deprecating this then we will probably end up maintaining Decode ourselves.

edit: I am curious to check how fast RawDecoder is compared to older Decoder. Maybe we can justify serializing Value back to bytes.

tustvold · 2023-03-29T10:31:12Z

it'll be amazing to have a flatten algorithm that does this at very low level but currently we don't have that

Perhaps you could file a ticket for this, flattening structs at least is trivial. I'm not sure what schema transformation flattens lists though...

I am curious to check how fast RawDecoder is compared to older Decoder

It's at least twice as fast, but the JSON writer is very inefficient currently and so is likely to eat into that severely

tustvold · 2023-03-30T10:09:41Z

#3979 contains a POC of how we could support this going forward, PTAL

…o `RawDecoder` (#3949) (#3979) * Add serde support to RawDecoder (#3949) * Clippy * More examples * Use BTreeMap for deterministic test output * Use new Field constructors * Review feedback

tustvold · 2023-04-07T12:12:00Z

label_issue.py automatically added labels {'arrow'} from #3979

rtyler added the enhancement Any new improvement worthy of a entry in the changelog label Mar 26, 2023

tustvold added a commit to tustvold/arrow-rs that referenced this issue Mar 26, 2023

Improve array builder documentation (apache#3949)

071ebd8

tustvold added a commit to tustvold/arrow-rs that referenced this issue Mar 26, 2023

Improve array builder documentation (apache#3949)

cbf2185

tustvold added a commit to tustvold/arrow-rs that referenced this issue Mar 26, 2023

Improve array builder documentation (apache#3949)

666d316

tustvold mentioned this issue Mar 26, 2023

Improve array builder documentation (#3949) #3951

Merged

tustvold added a commit to tustvold/arrow-rs that referenced this issue Mar 26, 2023

Add ListBuilder::append_value (apache#3949)

159e8d5

tustvold added a commit to tustvold/arrow-rs that referenced this issue Mar 26, 2023

Add ListBuilder::append_value (apache#3949)

5846406

tustvold mentioned this issue Mar 26, 2023

Add ListBuilder::append_value (#3949) #3954

Merged

tustvold added a commit that referenced this issue Mar 28, 2023

Add ListBuilder::append_value (#3949) (#3954)

eb36d37

* Add ListBuilder::append_value (#3949) * Review feedback

tustvold added a commit that referenced this issue Mar 28, 2023

Improve array builder documentation (#3949) (#3951)

5620612

* Improve array builder documentation (#3949) * Review feedback

tustvold added a commit to tustvold/arrow-rs that referenced this issue Mar 30, 2023

Add serde support to RawDecoder (apache#3949)

e8d203f

tustvold added a commit to tustvold/arrow-rs that referenced this issue Mar 30, 2023

Add serde support to RawDecoder (apache#3949)

d267ed2

tustvold mentioned this issue Mar 30, 2023

Support Rust structures --> RecordBatch by adding Serde support to RawDecoder (#3949) #3979

Merged

tustvold closed this as completed in #3979 Apr 5, 2023

tustvold added the arrow Changes to the arrow crate label Apr 7, 2023

tustvold changed the title ~~Consider renaming rather than removing Decoder~~ Support Deserializing Serde DataTypes to Arrow Apr 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support Deserializing Serde DataTypes to Arrow #3949

Support Deserializing Serde DataTypes to Arrow #3949

rtyler commented Mar 26, 2023

tustvold commented Mar 26, 2023 •

edited

Loading

rtyler commented Mar 26, 2023

trueleo commented Mar 29, 2023 •

edited

Loading

tustvold commented Mar 29, 2023 •

edited

Loading

tustvold commented Mar 30, 2023

tustvold commented Apr 7, 2023

Support Deserializing Serde DataTypes to Arrow #3949

Support Deserializing Serde DataTypes to Arrow #3949

Comments

rtyler commented Mar 26, 2023

tustvold commented Mar 26, 2023 • edited Loading

rtyler commented Mar 26, 2023

trueleo commented Mar 29, 2023 • edited Loading

tustvold commented Mar 29, 2023 • edited Loading

tustvold commented Mar 30, 2023

tustvold commented Apr 7, 2023

tustvold commented Mar 26, 2023 •

edited

Loading

trueleo commented Mar 29, 2023 •

edited

Loading

tustvold commented Mar 29, 2023 •

edited

Loading