-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use external config schema to construct Python SchemaTransform payload #26100
Use external config schema to construct Python SchemaTransform payload #26100
Conversation
R: @chamikaramj |
Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control |
Codecov Report
@@ Coverage Diff @@
## master #26100 +/- ##
==========================================
+ Coverage 71.41% 72.00% +0.58%
==========================================
Files 782 748 -34
Lines 102856 101109 -1747
==========================================
- Hits 73457 72805 -652
+ Misses 27922 26827 -1095
Partials 1477 1477
Flags with carried forward coverage won't be shown. Click here to find out more.
... and 52 files with indirect coverage changes 📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
Run Python_Xlang_Gcp_Direct PostCommit |
Run Python_Xlang_Gcp_Dataflow PostCommit |
self._kwargs = kwargs | ||
|
||
def _get_schema_proto_and_payload(self, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you also want to check that there are no kwargs beyond those in the schema?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah good idea, will add that check
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks.
@@ -180,14 +180,52 @@ def _get_named_tuple_instance(self): | |||
|
|||
|
|||
class SchemaTransformPayloadBuilder(PayloadBuilder): | |||
def __init__(self, identifier, **kwargs): | |||
self._identifier = identifier | |||
def __init__(self, schematransform_config, strict_schema=False, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it should be possible to use SchemaTransforms without the full config or the schema (i.e. just using the schema transform ID and a set of kwargs). Can you adjust the change so that the additional validation is optional ?
self._kwargs = kwargs | ||
|
||
def _get_schema_proto_and_payload(self, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we move the additional checks before this call and continue to use the existing external._get_schema_proto_and_payload() method ?
"SchemaTransform's configuration fields: %s" % | ||
(kwargs_fields, external_config_schema_fields)) | ||
|
||
# The discover API allows us to obtain an ordered configuration schema |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think instead of the "strict_schema" option, we should do a "rearrange_based_on_discovery" option. If the option is not provided, we use kwargs as is without the overhead of the additional RPC (this will work for anything other than TypedSchemaTransformProvider). For TypedSchemaTransformProvider, we would set the "rearrange_based_on_discovery" option to true and would rearrange kwargs based on a discovery call before the "_get_schema_proto_and_payload" invocation. WDYT ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, I like keeping the less costly option as the default. So the default can continue using the existing _get_schema_proto_and_payload()
. However, I think we can't use this method for the rearrange_based_on_discovery
option because it builds the payload based off kwargs only.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can rearrange kwargs before the method call and use the same method, can't we ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah that probably works actually, will try that
Run Python_Xlang_Gcp_Direct PostCommit |
Run Portable_Python PreCommit |
|
R: @chamikaramj |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Just one comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks.
Development on external SchemaTransforms may lead to changes in how their configurations look (ie. adding and removing fields, changing the order of fields). In addition to that, currently most Java SchemaTransformProviders inherit from TypedSchemaTransformProvider, which infers its configuration schema by using a configuration class and AutoValueSchema. While this approach is very convenient, the ordering of fields in the inferred schema is unfortunately not consistent. All of this is to say that the configuration schema of external transforms is prone to changes.
When we use an external SchemaTransform in Python, we build a payload that includes the configuration fields. These are the same fields used to set up the external SchemaTransform. Currently, we only use the input kwargs to construct the payload, so we are blind to what the external configuration schema actually is. The changes in this PR make it so that we first fetch the external configuration schema then construct the payload in accordance to that schema.