-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
extract-jdbc: add and adopt JdbcPartition and JdbcPartitionFactory #44458
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎ 1 Skipped Deployment
|
Warning This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
This stack of pull requests is managed by Graphite. Learn more about stacking. |
456b7b0
to
e82f2fe
Compare
b377b33
to
6a4767b
Compare
e82f2fe
to
f3f5fb0
Compare
6a4767b
to
76136ec
Compare
f3f5fb0
to
4152603
Compare
76136ec
to
a37b4b3
Compare
Closing because I'm merging the top of the stack #44482 into master using slash-approve-and-merge. |
Replaces #44398
Fixes airbytehq/airbyte-internal-issues#9093
This PR makes the Bulk CDK's JDBC toolkit useful for JDBC source connectors which aren't Oracle.
The current JDBC toolkit's greatest shortcoming (and blocker for porting source-mysql) is how tightly it's coupled to the stream state value model defined by
This model is going to make implementing CTID or XMIN states really annoying at best! Also forget about re-using the existing state values serialized by the existing source-mysql connector...
Therefore, it's necessary to make the connector fully own the state value model and translating stream state values into SQL queries. For this purpose this PR introduces the
JdbcPartition
andJdbcPartitionFactory
interfaces:JdbcPartition
object represents a specific subset of data within a table and provides a query to access it as well as the state value to use as a checkpoint for having emitted said data.JdbcPartitionFactory
object will map the input state value into aJdbcPartition
object representing the remaining data to read in the table; the factory can optionally subdivide that into smaller partitions.DefaultJdbcStreamStateValue
which can be used by naive JDBC source connectors.A pleasant consequence of this design is that the
PartitionsCreatorFactory
,PartitionsCreator
andPartitionReader
implementations for JDBC connectors are much more database-agnostic now:JdbcPartitionReader
is thePartitionReader
implementation for JDBC source connectors and it focuses on running an arbitrary query and packaging its results; it comes in a resumable- and in a non-resumable flavour.JdbcPartitionsCreator
is thePartitionsCreator
implementation for JDBC source connectors and it focuses on computing cursor column upper bounds andfetchSize
values, it comes in a sequential- and in a concurrent (i.e. splitting) flavour.JdbcPartitionsCreatorFactory
is thePartitionsCreatorFactory
implementation for JDBC source connectors and merely gluesJdbcPartitionFactory
to the read operation machinery.What this means for porting legacy JDBC source connectors like source-mysql to the bulk CDK is this:
source-mysql/src/resources/internal_models.yaml
needs to be kept if we are to maintain compatibility (and we really should).class MySqlPartitionFactory : JdbcPartitionFactory<MySqlSharedState,MySqlStreamState,MySqlPartition>
to deserialize the above state values intoMySqlPartition
objectsMySqlPartition
subclass may or may not be thin wrapper around aDefaultJdbcPartition
subclass, with mysql-specific fairy-dust sprinkled around it, especially to serializeinternal_models.yaml
objects.MySqlSharedState
andMySqlStreamState
may be even more bare-bones, who knows.Review guide
I considered breaking down this PR into more commits but the changes mainly consists of deleting and adding whole files so I'm not sure how useful that would be. Here's a suggested file reading order:
JdbcPartitionsCreatorFactory.kt
which has thePartitionsCreatorFactory
implementations for JDBC source connectors.JdbcPartitionFactory
, I would look atJdbcPartitionFactory.kt
andJdbcPartition.kt
next, which should give an idea of what aJdbcPartition
is.Default*.kt
files for the implementations, or withJdbcPartitionsCreator.kt
andJdbcPartitionReader.kt
which have thePartitionsCreator
andPartitionReader
implementations for source connectors.