-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[schemaregistry]ProtobufNative Schema Support #8372
[schemaregistry]ProtobufNative Schema Support #8372
Conversation
@hnail Can you send the PIP of adding native protobuf support to dev@ mailing list? So the community can review the PIP and we can add the PIP to Pulsar's wiki page. |
send <[DISCUSS] PIP-Add ProtobufNative Schema Support> to [email protected] done |
/pulsarbot run-failure-checks |
@sijie @codelipenghui Has there been progresses on this review ? |
...ain/java/org/apache/pulsar/broker/service/schema/ProtobufNativeSchemaCompatibilityCheck.java
Outdated
Show resolved
Hide resolved
...ava/org/apache/pulsar/broker/service/schema/validator/ProtobufNativeSchemaDataValidator.java
Outdated
Show resolved
Hide resolved
...ava/org/apache/pulsar/broker/service/schema/validator/ProtobufNativeSchemaDataValidator.java
Outdated
Show resolved
Hide resolved
@hnail overall the change looks really good. I made a few comments. PTAL. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, just left some minor comments.
pulsar-client/src/main/java/org/apache/pulsar/client/impl/schema/ProtobufNativeSchemaUtils.java
Outdated
Show resolved
Hide resolved
.../src/main/java/org/apache/pulsar/client/impl/schema/generic/GenericProtobufNativeSchema.java
Outdated
Show resolved
Hide resolved
@congbobo184 Could you please help review this PR? |
pulsar-client-api/src/main/java/org/apache/pulsar/client/api/Schema.java
Outdated
Show resolved
Hide resolved
pulsar-client-api/src/main/java/org/apache/pulsar/client/api/Schema.java
Outdated
Show resolved
Hide resolved
…chema/validator/ProtobufNativeSchemaDataValidator.java Co-authored-by: Sijie Guo <[email protected]>
…chema/validator/ProtobufNativeSchemaDataValidator.java Co-authored-by: Sijie Guo <[email protected]>
…chema.java Co-authored-by: congbo <[email protected]>
…chema.java Co-authored-by: congbo <[email protected]>
…t_protobuf_native_schema
/pulsarbot run-failure-checks |
Fixes apache#7642 Fixes apache#7674 ### Motivation Protobuf Native Schema Support : [PIP-Add ProtobufNative Schema Support ](https://docs.google.com/document/d/1XR_MNOuSXyig-CKsdVhr6IXvFwziBRdSoS3oEUiLFe8/edit?usp=sharing) This PR proposes to import a new protobuf-v3 schema based on protobuf native `Descriptor`, Current `ProtobufSchema` is based on AVRO schema, this Causes the following restrictions : - Current [ProtobufSchema](https://github.com/apache/pulsar/blob/master/pulsar-client/src/main/java/org/apache/pulsar/client/impl/schema/ProtobufSchema.java) based on avro-protobuf only works when JVM classloader can load `Java implementation class of GeneratedMessageV3` . this is unfriendly for long-running server ( e.g. Presto ), restart server for update class is unfriendly. - Describe `protobuf` schema by `avro` schema will cause losses of information, so based current [ProtobufSchema](https://github.com/apache/pulsar/blob/master/pulsar-client/src/main/java/org/apache/pulsar/client/impl/schema/ProtobufSchema.java) can’t support `AutoConsume` by `DynamicMessage`. - The amount of support avro language is less than support protobuf language. In consideration of backward compatibility , we add a new schema type `SchemaType.PROTOBUF_NATIVE` base on protobuf-v3 native Descriptor instead of modify SchemaType.PROTOBUF, aim to support `GenericProtobufNativeSchema` and `AutoConsumeSchema` for PROTOBUF. ### Modifications Describe in [PIP-Add ProtobufNative Schema Support ](https://docs.google.com/document/d/1XR_MNOuSXyig-CKsdVhr6IXvFwziBRdSoS3oEUiLFe8/edit?usp=sharing)
Add docs based on #8372 and update the green-highlighted parts. ![schema doc architecture](https://user-images.githubusercontent.com/50226895/103272642-5fa07380-49f8-11eb-8098-6eee7f930e93.png)
Fixes #4747 Fixes #7652 ### Motivation PIP-71: https://github.com/apache/pulsar/wiki/PIP-71:-Pulsar-SQL-migrate-SchemaHandle-to-presto-decoder **Pip-Doc** : [[PIP-71][SQL]Migrate SchemaHandle to Presto-decoder](https://docs.google.com/document/d/1KwG0GoHccju4-QNPfvT6tOwhp5Fvs6-iZlfLooPxTDM/edit?usp=sharing) In current version , pulsar-presto deserialize fields rely on SchemaHandler , but this causes the following restrictions : - **Metadata**: current nested field is dissociate with presto ParameterizedType , It treated nested field as a separated field , so presto compiler can’t understand the type hierarchy . nested field should be Row type in presto (e.g. Hive struct type support) . In the same way,array \ map type also shoud associate with presto ParameterizedTypes. - **Decoder** : SchemaHandler is hard to work with `RecordCursor.getObject()` to support ROW,MAP,ARRAY .etc The **motivations** of this pull request : - ` PulsarMetadata` take advantage of `ParameterizedType` to describe `row/array/map` Type instead of resolve nested columns in pulsar-presto connecter. - Customize `RowDecoder | RowDecoderFactory | ColumnDecoder` to work with pulsar interface, and with some our own extensions compare to presto original version , we can support more type for backward compatible (e.g. ` TIMESTAMP\DATE\TIME\Real\ARRAY\MAP\ROW ` support). - Decouple avro or schema type with `pulsar-presto main module` (RecordSet,ConnectorMetadata .etc ), aim to friendly with other schema type ( [ProtobufNative](#8372) 、thrift etc..). ### Modifications Describe in [PIP-71: Pulsar SQL migrate SchemaHandle to presto decoder](https://docs.google.com/document/d/1KwG0GoHccju4-QNPfvT6tOwhp5Fvs6-iZlfLooPxTDM/edit?usp=sharing) ---- ### Does this pull request potentially affect one of the following parts: *If `yes` was chosen, please highlight the changes* - Dependencies (does it add or upgrade a dependency): (**yes** ) - The public API: (no) - The schema: ( no) - The default values of configurations: (no) - The wire protocol: (no) - The rest endpoints: (no) - The admin cli options: (no) - Anything that affects deployment: (no) ### Documentation - Does this pull request introduce a new feature? (yes) [[PIP][SQL]Migrate SchemaHandle to Presto-decoder](https://docs.google.com/document/d/1KwG0GoHccju4-QNPfvT6tOwhp5Fvs6-iZlfLooPxTDM/edit?usp=sharing) * codeStyle fix * Update pulsar-sql/presto-pulsar/src/test/java/org/apache/pulsar/sql/presto/TestPulsarConnector.java Co-authored-by: ran <[email protected]> * Update pulsar-sql/presto-pulsar/src/test/java/org/apache/pulsar/sql/presto/TestPulsarConnector.java Co-authored-by: ran <[email protected]> * Update pulsar-sql/presto-pulsar/src/test/java/org/apache/pulsar/sql/presto/TestPulsarConnector.java Co-authored-by: ran <[email protected]> * add keyValue\Primitive schema test && add schema cyclic definition detect * merge master * merge master Co-authored-by: wangguowei <[email protected]> Co-authored-by: ran <[email protected]>
I have a question about FileDescriptorSet field. In https://github.com/apache/pulsar/blob/c01b1eeda3221bdbf863bf0f3f8373e93d90adef/pulsar-client/src/test/java/org/apache/pulsar/client/impl/schema/ProtobufNativeSchemaTest.java file there is an example test class. It is generated from Test.proto and ExternalMessage.proto files. The problem is that, no matter what I try with protoc , I can not get same FileDescriptorSet content. It's always slightly different. I tried csharp, java, cpp output, with or without --descriptor_set_out - I always get different byte array (and base64 string). And I can not create producer/consumer on apache pulsar client with FileDescriptorSet generated by me, I get com.google.protobuf.InvalidProtocolBufferException.invalidEndTag exception which as far as I googled tells that data is corrupted. When I try to create producer with FileDescriptorSet data from test, it works. What am I doing wrong while generating descriptors using protoc ? |
hello , happy for your take notice of this pull request .
The two realize above works for me :
so , I think the reason is new ObjectMapper().writeValueAsBytes(schemaData); which serialize byte[] to String , may be is 'UTF-8' or 'Base64' ? |
I found the culprit here, I deleted important information from source .proto file like java_package and java_outer_classname, and protoc was generating different info which apache pulsar was not able to use. |
I recommend you can use JAVA API to create Schema like this :
The java_package and java_outer_classname restrictions which you mentioned may need check in other language client when create schema by POJO API , be appreciate if your have time to help pull a issue to describe the problem for future optimization . |
I created issue here about csharp/java mismatch in generating Descriptors |
@hnail I think some bug in deserialize method of ProtobufNativeSchemaUtils. I generated descriptor using probuf-net library, here it is CgpUZXN0LnByb3RvEgVwcm90byImCgpYWFhNZXNzYWdlEgsKA2ZvbxgBIAEoCRILCgNiYXIYAiABKAFCD6oCDG5hdGl2ZUNoZWNrMmIGcHJvdG8z . If you check it on https://protogen.marcgravell.com/decode , it decodes fine. Here is json file with some additional fields: {"fileDescriptorSet":"ClMKClRlc3QucHJvdG8SC25hdGl2ZUNoZWNrIjAKClhYWE1lc3NhZ2USEAoDZm9vGAEgASgJUgNmb28SEAoDYmFyGAIgASgBUgNiYXJiBnByb3RvMw==","rootMessageTypeName":"proto.XXXMessage","rootFileDescriptorName":"Test.proto"} When I try to send it to apache-pulsar server, I get null-pointer exception on line 104 at Line 104 in 2a1828c
I created sample java app copy-pasting code from ProtobufNativeSchemaUtils. Here is simpler version of deserializer:
output is:
So Descriptor is fine! But if I try to use code from the repo:
I get nullpointer exception on line with getPackage(); - similar to apache-pulsar exception. |
@fjod Hi , Your Simpler Test code :
It is works fine . |
Fixes #4747 Fixes #7652 ### Motivation PIP-71: https://github.com/apache/pulsar/wiki/PIP-71:-Pulsar-SQL-migrate-SchemaHandle-to-presto-decoder **Pip-Doc** : [[PIP-71][SQL]Migrate SchemaHandle to Presto-decoder](https://docs.google.com/document/d/1KwG0GoHccju4-QNPfvT6tOwhp5Fvs6-iZlfLooPxTDM/edit?usp=sharing) In current version , pulsar-presto deserialize fields rely on SchemaHandler , but this causes the following restrictions : - **Metadata**: current nested field is dissociate with presto ParameterizedType , It treated nested field as a separated field , so presto compiler can’t understand the type hierarchy . nested field should be Row type in presto (e.g. Hive struct type support) . In the same way,array \ map type also shoud associate with presto ParameterizedTypes. - **Decoder** : SchemaHandler is hard to work with `RecordCursor.getObject()` to support ROW,MAP,ARRAY .etc The **motivations** of this pull request : - ` PulsarMetadata` take advantage of `ParameterizedType` to describe `row/array/map` Type instead of resolve nested columns in pulsar-presto connecter. - Customize `RowDecoder | RowDecoderFactory | ColumnDecoder` to work with pulsar interface, and with some our own extensions compare to presto original version , we can support more type for backward compatible (e.g. ` TIMESTAMP\DATE\TIME\Real\ARRAY\MAP\ROW ` support). - Decouple avro or schema type with `pulsar-presto main module` (RecordSet,ConnectorMetadata .etc ), aim to friendly with other schema type ( [ProtobufNative](apache/pulsar#8372) 、thrift etc..). ### Modifications Describe in [PIP-71: Pulsar SQL migrate SchemaHandle to presto decoder](https://docs.google.com/document/d/1KwG0GoHccju4-QNPfvT6tOwhp5Fvs6-iZlfLooPxTDM/edit?usp=sharing) ---- ### Does this pull request potentially affect one of the following parts: *If `yes` was chosen, please highlight the changes* - Dependencies (does it add or upgrade a dependency): (**yes** ) - The public API: (no) - The schema: ( no) - The default values of configurations: (no) - The wire protocol: (no) - The rest endpoints: (no) - The admin cli options: (no) - Anything that affects deployment: (no) ### Documentation - Does this pull request introduce a new feature? (yes) [[PIP][SQL]Migrate SchemaHandle to Presto-decoder](https://docs.google.com/document/d/1KwG0GoHccju4-QNPfvT6tOwhp5Fvs6-iZlfLooPxTDM/edit?usp=sharing) * codeStyle fix * Update pulsar-sql/presto-pulsar/src/test/java/org/apache/pulsar/sql/presto/TestPulsarConnector.java Co-authored-by: ran <[email protected]> * Update pulsar-sql/presto-pulsar/src/test/java/org/apache/pulsar/sql/presto/TestPulsarConnector.java Co-authored-by: ran <[email protected]> * Update pulsar-sql/presto-pulsar/src/test/java/org/apache/pulsar/sql/presto/TestPulsarConnector.java Co-authored-by: ran <[email protected]> * add keyValue\Primitive schema test && add schema cyclic definition detect * merge master * merge master Co-authored-by: wangguowei <[email protected]> Co-authored-by: ran <[email protected]>
Fixes #7642
Fixes #7674
Motivation
Protobuf Native Schema Support : PIP-Add ProtobufNative Schema Support
This PR proposes to import a new protobuf-v3 schema based on protobuf native
Descriptor
, CurrentProtobufSchema
is based on AVRO schema, this Causes the following restrictions :Java implementation class of GeneratedMessageV3
. this is unfriendly for long-running server ( e.g. Presto ), restart server for update class is unfriendly.protobuf
schema byavro
schema will cause losses of information, so based current ProtobufSchema can’t supportAutoConsume
byDynamicMessage
.In consideration of backward compatibility , we add a new schema type
SchemaType.PROTOBUF_NATIVE
base on protobuf-v3 native Descriptor instead of modify SchemaType.PROTOBUF, aim to supportGenericProtobufNativeSchema
andAutoConsumeSchema
for PROTOBUF.Modifications
Describe in PIP-Add ProtobufNative Schema Support
Usage Example
Create ProtobufNative schema Example
Class-Based schema consumer Example
ProtobufNative AUTO_CONSUME Example
Does this pull request potentially affect one of the following parts:
If
yes
was chosen, please highlight the changesDocumentation