-
Notifications
You must be signed in to change notification settings - Fork 884
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support Deserializing Serde DataTypes to Arrow #3949
Comments
The major reason for wanting to remove the old decoder, aside from its incredibly underwhelming performance, is that the code is very hard to maintain and reason about, in particular the way it handles nested types is very convoluted, and is likely also incorrect. That being said I wonder if we can accommodate your use-case in a different manner, you mention "using serde's deserialization code", does this mean that the types are known statically ahead of time? I ask as providing a nicer story for creating |
Yes I would agree with that assessment 😆
Yes the data is more or less known ahead of time. The work that I am doing is similar to what kafka-delta-ingest does, which is converting JSON data into Delta formatted data. In my cases I want to deserialize into structs with Serde in order to perform some operations with the data. If I have a path to re-use serde's |
I understand that this can be a PITA to maintain but i also agree with what @rtyler said
At Parseable we want to support basic schema evolution on top level. So we flatten JSON before converting it into RecordBatch. To do that we need to convert it to serde_json::Value first ( it'll be amazing to have a flatten algorithm that does this at very low level but currently we don't have that ). So for us it is hard to move away from edit: I am curious to check how fast RawDecoder is compared to older Decoder. Maybe we can justify serializing Value back to bytes. |
Perhaps you could file a ticket for this, flattening structs at least is trivial. I'm not sure what schema transformation flattens lists though...
It's at least twice as fast, but the JSON writer is very inefficient currently and so is likely to eat into that severely |
#3979 contains a POC of how we could support this going forward, PTAL |
|
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
I have found that creating
serde_json::Value
objects is a lot easier than trying to constructRecordBatch
objects. This is especially useful when using serde's deserialization code for data ingestion. Crates like [serde_arrow}(https://crates.io/crates/serde_arrow) are out-dated and add unnecessary complexityI have taken to using Decoder to convert Value objects to RecordBatchs easily.
Describe the solution you'd like
Renaming or moving the Decoder to where it's not considered for use on deserializing raw JSON buffers, e.g. the BufReader approach that it uses now, but rather can be used for naturally converting pre-deserialized JSON
Value
objects.Describe alternatives you've considered
If Decoder gets bounced out of the repo, I would probably just refactor it into its own crate and push that 😄
The text was updated successfully, but these errors were encountered: