-
Notifications
You must be signed in to change notification settings - Fork 166
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Move away from protobuf #249
Comments
It's a crying shame no one seems to have stood up by now to make a decent alternative C++ implementation for protobuf, because in the end, there's nothing wrong with the binary format. It's however also a crying shame Google couldn't have made the JSON format more subtly esoteric if they tried (all those wholly unnecessary case conversions and extension things!), because if that had also been sane, I would have made one by now. I'm not looking forward to re-engineering the validator if this goes through, but on the other hand, I had some ideas for refactoring the Acero/Substrait interface to be much less hardcoded that were ultimately blocked (in my head) by a need to expose the generated protobuf sources. For the unaware, we're painstakingly hiding them currently because exposing them breaks everything immediately if someone were to try to link libarrow to something else that also uses Substrait, and even now it's likely to subtly break when those Substrait versions differ even slightly, thanks to libprotobuf's descriptor table global. I'm not well-versed in any of the alternatives, but are there any that rely either on a library or code generation? Needing both is just the absolute worst of both worlds, especially when (like in protobuf) the generator and library version must match down to the build number. Also, something that decouples generator options and namespaces (that would be set by the user) from the descriptor files (provided by Substrait) so #207 can be closed, then printed out and burned at the stake for good measure. |
Flatbuffers generated code is header-only and, AFAICT, mostly self-contained. The flatbuffers headers are required but no linking is involved. |
Forgive my ignorance, but can https://github.com/nanopb/nanopb help, or are the issues too fundamental? |
I honestly have no idea. There's also https://github.com/protobuf-c/protobuf-c/ . Unknown is how mature these implementations, and whether they are able to handle incorrect data gracefully (i.e. return an error instead of crashing). |
I haven't looked at nanopb very extensively, but I have seen that it is C-only (i.e. no C++ abstractions) and that it's predominantly intended for streaming rather than object-oriented access, in order to be able to handle messages larger than what can fit in memory. That's a big thing when your memory is measured in kilobytes as it is for microcontrollers, but not so much when it's measured in gigabytes. All that means that the API will be terrible for what we want to do with it in Arrow. It also surely can't handle the JSON format. Honestly, I think it'd be better to build our own converter if this is the best option we've got. upb might be more promising, but it's also only C, and they warn that their C ABI is not I haven't looked at protobuf-c yet. |
cc @lidavidm |
upb apparently supports JSON and declares "no global state" which is probably the biggest problem we've run into so far right? But no releases or stable API is concerning. (And rather than not yet being stable, they explicitly declare they won't be stable.) From a quick search, protobuf-c doesn't implement JSON either. |
Streaming protobuf (e.g. nanopb) would be possible but somewhat tedious due
to the way extension sets are built. We would probably need to do a
two-pass conversion where we first convert to "Acero w/placeholders" and
then fill in the placeholders when the extension set is eventually streamed
in.
I'm +1 for an experiment using a different protobuf library. upb seems
most promising from the choices listed.
Interestingly, upb is also a google project (maintained by the same
protocol-buffers team, though different people it seems).
I'm +0 on flatbuffers. I agree it is much better for C++ development.
However, I think this is a problem unique to C++. In C#, for example,
flatbuffers is a rather annoying dependency. No one publishes built
flatbuffers packages that I can tell (e.g. no nuget packages) and so it
must be built from souce (which requires setting up a C++ dev env) as a
custom build step or, more typically, the artifacts are committed to the
repository and building them is a manual step. In contrast, protobuf has
full support from Microsoft (primarily due to grpc) and compiler tools to
support automatic integration into a build process. I don't know what the
situation is for Java.
…On Tue, Jul 19, 2022 at 6:58 AM David Li ***@***.***> wrote:
upb apparently supports JSON and declares "no global state" which is
probably the biggest problem we've run into so far right? But no releases
or stable API is concerning. (And rather than not *yet* being stable,
they explicitly declare they won't be stable.)
From a quick search, protobuf-c doesn't implement JSON either.
—
Reply to this email directly, view it on GitHub
<#249 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAM6CXORRHZA23RJLZBQ24DVU2YBPANCNFSM537AVP7Q>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
The order in which the fields are stored by protobuf and thus will stream in is explicitly undefined and not guaranteed to be stable even between consecutive serializations of the same message in the same context. This problem is not limited to extension sets; the order in which Substrait is to be parsed in order to always know the current schema is fairly strict in general. For example, the emit in RelCommon always comes after the rest of the relation. I don't see how you could stream this without making a huge mess or first building a Substrait-level tree... which is, like, the ONE job that a code generator for a deserializer has IMO. |
There's always Avro... |
It appears this discussion has gone stale. I'm going to close this because I don't know of anyone working on this and I'm not sure what further can be added to the discussion. At this point I think it would only be practical to address this by first adding support for a new serialization format and then later dropping or deprecating support for protobuf. Work to add a new serialization format would always be welcome but should be handled under a new issue. |
https://github.com/mapbox/protozero is another C++ implementation, but you need to write parsers by hand. |
Protocol buffers has massive usability issues in C++. The latest episode of this is protocolbuffers/protobuf#9947 , which is a giant PITA at the moment for Arrow C++ (and will be for other downstream users, no doubt).
In addition, protobuf is maintained in the "Google open source way", which means it's mostly developed internally, and an occasional sync job pushes many unrelated changes as a single changeset to the public repository.
There are usable, reasonable, lighter and more performant alternatives to protobuf with wide adoption and wide availability, such as Flatbuffers.
cc @wesm @westonpace
The text was updated successfully, but these errors were encountered: