Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Deserializing Serde DataTypes to Arrow #3949

Closed
rtyler opened this issue Mar 26, 2023 · 6 comments · Fixed by #3979
Closed

Support Deserializing Serde DataTypes to Arrow #3949

rtyler opened this issue Mar 26, 2023 · 6 comments · Fixed by #3979
Labels
arrow Changes to the arrow crate enhancement Any new improvement worthy of a entry in the changelog

Comments

@rtyler
Copy link
Contributor

rtyler commented Mar 26, 2023

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

I have found that creating serde_json::Value objects is a lot easier than trying to construct RecordBatch objects. This is especially useful when using serde's deserialization code for data ingestion. Crates like [serde_arrow}(https://crates.io/crates/serde_arrow) are out-dated and add unnecessary complexity

I have taken to using Decoder to convert Value objects to RecordBatchs easily.

Describe the solution you'd like

Renaming or moving the Decoder to where it's not considered for use on deserializing raw JSON buffers, e.g. the BufReader approach that it uses now, but rather can be used for naturally converting pre-deserialized JSON Value objects.

Describe alternatives you've considered

If Decoder gets bounced out of the repo, I would probably just refactor it into its own crate and push that 😄

@rtyler rtyler added the enhancement Any new improvement worthy of a entry in the changelog label Mar 26, 2023
@tustvold
Copy link
Contributor

tustvold commented Mar 26, 2023

The major reason for wanting to remove the old decoder, aside from its incredibly underwhelming performance, is that the code is very hard to maintain and reason about, in particular the way it handles nested types is very convoluted, and is likely also incorrect.

That being said I wonder if we can accommodate your use-case in a different manner, you mention "using serde's deserialization code", does this mean that the types are known statically ahead of time? I ask as providing a nicer story for creating RecordBatch from statically typed rows is something I have been playing around with, would this work for your use-case, or are the types not known at compile time?

tustvold added a commit to tustvold/arrow-rs that referenced this issue Mar 26, 2023
tustvold added a commit to tustvold/arrow-rs that referenced this issue Mar 26, 2023
tustvold added a commit to tustvold/arrow-rs that referenced this issue Mar 26, 2023
tustvold added a commit to tustvold/arrow-rs that referenced this issue Mar 26, 2023
tustvold added a commit to tustvold/arrow-rs that referenced this issue Mar 26, 2023
@rtyler
Copy link
Contributor Author

rtyler commented Mar 26, 2023

the code is very hard to maintain and reason about, in particular the way it handles nested types is very convolute

Yes I would agree with that assessment 😆

creating RecordBatch from statically typed rows is something I have been playing around with, would this work for your use-case, or are the types not known at compile time?

Yes the data is more or less known ahead of time. The work that I am doing is similar to what kafka-delta-ingest does, which is converting JSON data into Delta formatted data. In my cases I want to deserialize into structs with Serde in order to perform some operations with the data. If I have a path to re-use serde's Value type which is shared between serde_json and serde_yaml (and others) then I can use that as my "intermediary" format for deserializing known typed data before converting that into a RecordBatch for writing out to Delta.

tustvold added a commit that referenced this issue Mar 28, 2023
* Add ListBuilder::append_value (#3949)

* Review feedback
tustvold added a commit that referenced this issue Mar 28, 2023
* Improve array builder documentation (#3949)

* Review feedback
@trueleo
Copy link
Contributor

trueleo commented Mar 29, 2023

the code is very hard to maintain and reason about, in particular the way it handles nested types is very convoluted, and is likely also incorrect.

I understand that this can be a PITA to maintain but i also agree with what @rtyler said

creating serde_json::Value objects is a lot easier than trying to construct RecordBatch objects.

At Parseable we want to support basic schema evolution on top level. So we flatten JSON before converting it into RecordBatch. To do that we need to convert it to serde_json::Value first ( it'll be amazing to have a flatten algorithm that does this at very low level but currently we don't have that ).

So for us it is hard to move away from Value -> RecordBatch flow. If maintainers at arrow-rs feel strongly about deprecating this then we will probably end up maintaining Decode ourselves.

edit: I am curious to check how fast RawDecoder is compared to older Decoder. Maybe we can justify serializing Value back to bytes.

@tustvold
Copy link
Contributor

tustvold commented Mar 29, 2023

it'll be amazing to have a flatten algorithm that does this at very low level but currently we don't have that

Perhaps you could file a ticket for this, flattening structs at least is trivial. I'm not sure what schema transformation flattens lists though...

I am curious to check how fast RawDecoder is compared to older Decoder

It's at least twice as fast, but the JSON writer is very inefficient currently and so is likely to eat into that severely

@tustvold
Copy link
Contributor

#3979 contains a POC of how we could support this going forward, PTAL

tustvold added a commit that referenced this issue Apr 5, 2023
…o `RawDecoder` (#3949) (#3979)

* Add serde support to RawDecoder (#3949)

* Clippy

* More examples

* Use BTreeMap for deterministic test output

* Use new Field constructors

* Review feedback
@tustvold
Copy link
Contributor

tustvold commented Apr 7, 2023

label_issue.py automatically added labels {'arrow'} from #3979

@tustvold tustvold added the arrow Changes to the arrow crate label Apr 7, 2023
@tustvold tustvold changed the title Consider renaming rather than removing Decoder Support Deserializing Serde DataTypes to Arrow Apr 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate enhancement Any new improvement worthy of a entry in the changelog
Projects
None yet
3 participants