Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move away from protobuf #249

Closed
pitrou opened this issue Jul 19, 2022 · 12 comments
Closed

Move away from protobuf #249

pitrou opened this issue Jul 19, 2022 · 12 comments

Comments

@pitrou
Copy link
Contributor

pitrou commented Jul 19, 2022

Protocol buffers has massive usability issues in C++. The latest episode of this is protocolbuffers/protobuf#9947 , which is a giant PITA at the moment for Arrow C++ (and will be for other downstream users, no doubt).

In addition, protobuf is maintained in the "Google open source way", which means it's mostly developed internally, and an occasional sync job pushes many unrelated changes as a single changeset to the public repository.

There are usable, reasonable, lighter and more performant alternatives to protobuf with wide adoption and wide availability, such as Flatbuffers.

cc @wesm @westonpace

@jvanstraten
Copy link
Contributor

It's a crying shame no one seems to have stood up by now to make a decent alternative C++ implementation for protobuf, because in the end, there's nothing wrong with the binary format. It's however also a crying shame Google couldn't have made the JSON format more subtly esoteric if they tried (all those wholly unnecessary case conversions and extension things!), because if that had also been sane, I would have made one by now.

I'm not looking forward to re-engineering the validator if this goes through, but on the other hand, I had some ideas for refactoring the Acero/Substrait interface to be much less hardcoded that were ultimately blocked (in my head) by a need to expose the generated protobuf sources. For the unaware, we're painstakingly hiding them currently because exposing them breaks everything immediately if someone were to try to link libarrow to something else that also uses Substrait, and even now it's likely to subtly break when those Substrait versions differ even slightly, thanks to libprotobuf's descriptor table global.

I'm not well-versed in any of the alternatives, but are there any that rely either on a library or code generation? Needing both is just the absolute worst of both worlds, especially when (like in protobuf) the generator and library version must match down to the build number.

Also, something that decouples generator options and namespaces (that would be set by the user) from the descriptor files (provided by Substrait) so #207 can be closed, then printed out and burned at the stake for good measure.

@pitrou
Copy link
Contributor Author

pitrou commented Jul 19, 2022

I'm not well-versed in any of the alternatives, but are there any that rely either on a library or code generation?

Flatbuffers generated code is header-only and, AFAICT, mostly self-contained. The flatbuffers headers are required but no linking is involved.
(there is a Flatbuffers library, but my understanding is that it only contains code for the code generation facilities themselves)

@nealrichardson
Copy link
Contributor

Forgive my ignorance, but can https://github.com/nanopb/nanopb help, or are the issues too fundamental?

@pitrou
Copy link
Contributor Author

pitrou commented Jul 19, 2022

I honestly have no idea. There's also https://github.com/protobuf-c/protobuf-c/ . Unknown is how mature these implementations, and whether they are able to handle incorrect data gracefully (i.e. return an error instead of crashing).

@jvanstraten
Copy link
Contributor

jvanstraten commented Jul 19, 2022

I haven't looked at nanopb very extensively, but I have seen that it is C-only (i.e. no C++ abstractions) and that it's predominantly intended for streaming rather than object-oriented access, in order to be able to handle messages larger than what can fit in memory. That's a big thing when your memory is measured in kilobytes as it is for microcontrollers, but not so much when it's measured in gigabytes. All that means that the API will be terrible for what we want to do with it in Arrow. It also surely can't handle the JSON format. Honestly, I think it'd be better to build our own converter if this is the best option we've got.

upb might be more promising, but it's also only C, and they warn that their C ABI is not yet stable. It's intended moreso as an alternative backend for other languages (Python etc.). I also don't know if it actually fixes all the libprotobuf problems.

I haven't looked at protobuf-c yet.

@pitrou
Copy link
Contributor Author

pitrou commented Jul 19, 2022

cc @lidavidm

@lidavidm
Copy link

upb apparently supports JSON and declares "no global state" which is probably the biggest problem we've run into so far right? But no releases or stable API is concerning. (And rather than not yet being stable, they explicitly declare they won't be stable.)

From a quick search, protobuf-c doesn't implement JSON either.

@westonpace
Copy link
Member

westonpace commented Jul 19, 2022 via email

@jvanstraten
Copy link
Contributor

The order in which the fields are stored by protobuf and thus will stream in is explicitly undefined and not guaranteed to be stable even between consecutive serializations of the same message in the same context. This problem is not limited to extension sets; the order in which Substrait is to be parsed in order to always know the current schema is fairly strict in general. For example, the emit in RelCommon always comes after the rest of the relation. I don't see how you could stream this without making a huge mess or first building a Substrait-level tree... which is, like, the ONE job that a code generator for a deserializer has IMO.

@adamkennedy
Copy link

There's always Avro...

@westonpace
Copy link
Member

It appears this discussion has gone stale. I'm going to close this because I don't know of anyone working on this and I'm not sure what further can be added to the discussion.

At this point I think it would only be practical to address this by first adding support for a new serialization format and then later dropping or deprecating support for protobuf. Work to add a new serialization format would always be welcome but should be handled under a new issue.

@louwers
Copy link

louwers commented Jun 8, 2024

https://github.com/mapbox/protozero is another C++ implementation, but you need to write parsers by hand.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants