Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

On-the-fly migrations of persisted catalogs #21757

Conversation

gosusnp
Copy link
Contributor

@gosusnp gosusnp commented Jan 23, 2023

What

With the introduction of protocol version v1, the assumption is that all the protocol objects handled by the platform will be v1. We are persisting AirbyteCatalog and ConfiguredAirbyteCatalog in our database which breaks the previous assumption as the platform will be reading v0 objects.

Closes #21234
Closes https://github.com/airbytehq/airbyte-internal-issues/issues/2535
Closes #21542

How

Add code to migrate version 0 of the AirbyteCatalog and ConfiguredAirbyteCatalog when reading from the database if needed. This is done by adding helpers to specifically detect v0 data types to infer the version of the objects and using this code in our DbConverters.

Update Frontend to support data types v1 as well.

Recommended reading order

  1. airbyte-commons-protocol/src/main/java/io/airbyte/commons/protocol/migrations/v1/CatalogMigrationV1Helper.java is the logic to detect version and apply the migration.
  2. the rest is about updating the reads we need to cover

@gosusnp gosusnp requested a review from a team as a code owner January 23, 2023 23:31
@octavia-squidington-iv octavia-squidington-iv added area/documentation Improvements or additions to documentation area/platform issues related to the platform area/worker Related to worker labels Jan 23, 2023
@gosusnp gosusnp temporarily deployed to more-secrets January 23, 2023 23:33 — with GitHub Actions Inactive
@gosusnp gosusnp temporarily deployed to more-secrets January 23, 2023 23:33 — with GitHub Actions Inactive
@github-actions
Copy link
Contributor

github-actions bot commented Jan 23, 2023

Affected Connector Report

NOTE ⚠️ Changes in this PR affect the following connectors. Make sure to do the following as needed:

  • Run integration tests
  • Bump connector or module version
  • Add changelog
  • Publish the new version

✅ Sources (25)

Connector Version Changelog Publish
source-alloydb 1.0.36
source-alloydb-strict-encrypt 1.0.36 🔵
(ignored)
🔵
(ignored)
source-bigquery 0.2.3
source-clickhouse 0.1.15
source-clickhouse-strict-encrypt 0.1.15 🔵
(ignored)
🔵
(ignored)
source-cockroachdb 0.1.19
source-cockroachdb-strict-encrypt 0.1.19 🔵
(ignored)
🔵
(ignored)
source-db2 0.1.17
source-db2-strict-encrypt 0.1.17 🔵
(ignored)
🔵
(ignored)
source-dynamodb 0.1.0
source-jdbc 0.3.5 🔵
(ignored)
🔵
(ignored)
source-mongodb-strict-encrypt 0.1.19 🔵
(ignored)
🔵
(ignored)
source-mongodb-v2 0.1.19
source-mssql 0.4.28
source-mssql-strict-encrypt 0.4.28 🔵
(ignored)
🔵
(ignored)
source-mysql 1.0.20
source-mysql-strict-encrypt 1.0.20 🔵
(ignored)
🔵
(ignored)
source-oracle 0.3.22
source-oracle-strict-encrypt 0.3.22 🔵
(ignored)
🔵
(ignored)
source-postgres 1.0.40
source-postgres-strict-encrypt 1.0.40 🔵
(ignored)
🔵
(ignored)
source-redshift 0.3.16
source-scaffold-java-jdbc 0.1.0 🔵
(ignored)
🔵
(ignored)
source-snowflake 0.1.29
source-tidb 0.2.2
  • See "Actionable Items" below for how to resolve warnings and errors.

❌ Destinations (16)

Connector Version Changelog Publish
destination-bigquery 1.2.13
destination-bigquery-denormalized 1.2.12
(diff seed version)
destination-clickhouse 0.2.2
(changelog missing)
destination-clickhouse-strict-encrypt 0.2.2 🔵
(ignored)
🔵
(ignored)
destination-jdbc 0.3.14 🔵
(ignored)
🔵
(ignored)
destination-mssql 0.1.22
destination-mssql-strict-encrypt 0.1.22 🔵
(ignored)
🔵
(ignored)
destination-mysql 0.1.20
destination-mysql-strict-encrypt 0.1.21
(mismatch: 0.1.20)
🔵
(ignored)
🔵
(ignored)
destination-oracle 0.1.19
destination-oracle-strict-encrypt 0.1.19 🔵
(ignored)
🔵
(ignored)
destination-postgres 0.3.26
destination-postgres-strict-encrypt 0.3.26 🔵
(ignored)
🔵
(ignored)
destination-redshift 0.3.55
destination-snowflake 0.4.46
destination-tidb 0.1.0
  • See "Actionable Items" below for how to resolve warnings and errors.

👀 Other Modules (1)

  • base-normalization

Actionable Items

(click to expand)

Category Status Actionable Item
Version
mismatch
The version of the connector is different from its normal variant. Please bump the version of the connector.

doc not found
The connector does not seem to have a documentation file. This can be normal (e.g. basic connector like source-jdbc is not published or documented). Please double-check to make sure that it is not a bug.
Changelog
doc not found
The connector does not seem to have a documentation file. This can be normal (e.g. basic connector like source-jdbc is not published or documented). Please double-check to make sure that it is not a bug.

changelog missing
There is no chnagelog for the current version of the connector. If you are the author of the current version, please add a changelog.
Publish
not in seed
The connector is not in the seed file (e.g. source_definitions.yaml), so its publication status cannot be checked. This can be normal (e.g. some connectors are cloud-specific, and only listed in the cloud seed file). Please double-check to make sure that it is not a bug.

diff seed version
The connector exists in the seed file, but the latest version is not listed there. This usually means that the latest version is not published. Please use the /publish command to publish the latest version.

@gosusnp gosusnp temporarily deployed to more-secrets January 24, 2023 02:00 — with GitHub Actions Inactive
@gosusnp gosusnp temporarily deployed to more-secrets January 24, 2023 02:00 — with GitHub Actions Inactive
@gosusnp gosusnp temporarily deployed to more-secrets January 24, 2023 18:12 — with GitHub Actions Inactive
@gosusnp gosusnp temporarily deployed to more-secrets January 24, 2023 18:12 — with GitHub Actions Inactive
@gosusnp gosusnp temporarily deployed to more-secrets January 24, 2023 19:38 — with GitHub Actions Inactive
@gosusnp gosusnp temporarily deployed to more-secrets January 24, 2023 19:38 — with GitHub Actions Inactive
@gosusnp gosusnp temporarily deployed to more-secrets January 24, 2023 19:47 — with GitHub Actions Inactive
@gosusnp gosusnp temporarily deployed to more-secrets January 24, 2023 19:47 — with GitHub Actions Inactive
@github-actions
Copy link
Contributor

github-actions bot commented Jan 24, 2023

Airbyte Code Coverage

File Coverage [84.84%] 🍏
SchemaMigrations.java 98.14% 🍏
SchemaMigrationV1.java 96.39% 🍏
DefaultJobPersistence.java 95.54% 🍏
ConfigRepository.java 81.87% 🍏
DbConverter.java 80.42% 🍏
CatalogMigrationV1Helper.java 0%
NormalizationActivityImpl.java 0%
Total Project Coverage 24.42%

@gosusnp gosusnp temporarily deployed to more-secrets January 24, 2023 23:20 — with GitHub Actions Inactive
@gosusnp gosusnp temporarily deployed to more-secrets January 24, 2023 23:20 — with GitHub Actions Inactive
Copy link
Contributor

@edgao edgao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CatalogMIgrationV1Helper looks right. I skimmed through the rest of the pr and didn't see anything noticeably off, though I definitely didn't fully understand all of it 🤷

also confirmed that UI changes have the right schema->type mappings.

@@ -917,20 +919,32 @@ private static Job getJobFromRecord(final Record record) {
return new Job(record.get(JOB_ID, Long.class),
Enums.toEnum(record.get("config_type", String.class), ConfigType.class).orElseThrow(),
record.get("scope", String.class),
Jsons.deserialize(record.get("config", String.class), JobConfig.class),
parseJobConfigFromString(record.get("config", String.class)),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how would we have a jobconfig with a v0 schema? Wouldn't we upgrade the schema before creating the jobconfig?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we're updating on the fly, there is no migrations of the DB. All jobs that were created before the upgrade will have v0 data.

…-the-quick-way' into gosusnp/on-the-fly-catalog-migrations
@octavia-squidington-iv octavia-squidington-iv removed area/documentation Improvements or additions to documentation area/server labels Jan 25, 2023
@gosusnp gosusnp temporarily deployed to more-secrets January 25, 2023 01:31 — with GitHub Actions Inactive
@gosusnp gosusnp temporarily deployed to more-secrets January 25, 2023 01:31 — with GitHub Actions Inactive
@github-actions
Copy link
Contributor

github-actions bot commented Jan 25, 2023

Platform Test Results

   239 files  +  48     239 suites  +48   24m 59s ⏱️ + 19m 25s
1 642 tests +200  1 631 ✔️ +195  11 💤 +7  0  - 2 
1 662 runs  +207  1 651 ✔️ +202  11 💤 +7  0  - 2 

Results for commit d3cd495. ± Comparison against base commit c3193ae.

♻️ This comment has been updated with latest results.

Comment on lines 631 to 640
"airbyte.datatype.WellKnownTypes.json#/definitions/String": "String",
"airbyte.datatype.WellKnownTypes.json#/definitions/BinaryData": "Binary Data",
"airbyte.datatype.WellKnownTypes.json#/definitions/Number": "Number",
"airbyte.datatype.WellKnownTypes.json#/definitions/Integer": "Integer",
"airbyte.datatype.WellKnownTypes.json#/definitions/Boolean": "Boolean",
"airbyte.datatype.WellKnownTypes.json#/definitions/Date": "Date",
"airbyte.datatype.WellKnownTypes.json#/definitions/TimestampWithTimezone": "Timestamp with Timezone",
"airbyte.datatype.WellKnownTypes.json#/definitions/TimestampWithoutTimezone": "Timestamp without Timezone",
"airbyte.datatype.WellKnownTypes.json#/definitions/TimeWithTimezone": "Time with Timezone",
"airbyte.datatype.WellKnownTypes.json#/definitions/TimeWithoutTimezone": "Time without Timezone",
Copy link
Contributor

@edmundito edmundito Jan 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, we still want to use the airbyte.datatype values below this so that we don't have duplicate strings. The utility function may need a way to translate these types, for example:

"WellKnownTypes.json#/definitions/TimestampWithTimezone" -> "airbyte.datatype.timestamp_with_timezone"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a map in useTranslateDataType utility and updated the strings here.

@gosusnp gosusnp temporarily deployed to more-secrets January 25, 2023 18:37 — with GitHub Actions Inactive
@gosusnp gosusnp temporarily deployed to more-secrets January 25, 2023 18:37 — with GitHub Actions Inactive
@gosusnp gosusnp temporarily deployed to more-secrets January 25, 2023 19:08 — with GitHub Actions Inactive
Copy link
Contributor

@edmundito edmundito left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Frontend changes look good and tested on areas that use the data type

Copy link
Member

@colesnodgrass colesnodgrass left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look fine.

Are we emitting any metrics when upgradeschemaifNeeded is need to track how often that method is actually upgrading schemas? If we're not, that is probably something we should track and emit to DD.

Comment on lines +1187 to +1189
// We do not apply the on-the-fly migration here because the only caller is getOrInsertActorCatalog
// which is using this to figure out if the catalog has already been inserted. Migrating on the fly
// here will cause us to add a duplicate each time we check for existence of a catalog.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems risky, granted this is a private method so there is less risk, but I wonder if this code should be moved to the getOrInsertActorCatalog method to prevent any future issues from occurring.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am open to inlining this in getOrInsertActorCatalog. One extra note is that this behavior (the insert only inserts when it's actually different) is currently tested.

@@ -3,19 +3,37 @@ import { useIntl } from "react-intl";

export interface AirbyteConnectorData {
type: string;
$ref?: string;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any tests need to be updated or added for this change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The UI change happened in this PR because it was breaking some integration tests.

@gosusnp gosusnp temporarily deployed to more-secrets January 26, 2023 19:47 — with GitHub Actions Inactive
@gosusnp gosusnp temporarily deployed to more-secrets January 26, 2023 19:47 — with GitHub Actions Inactive
@gosusnp gosusnp temporarily deployed to more-secrets January 26, 2023 21:32 — with GitHub Actions Inactive
@gosusnp gosusnp temporarily deployed to more-secrets January 26, 2023 22:57 — with GitHub Actions Inactive
@gosusnp gosusnp temporarily deployed to more-secrets January 26, 2023 22:57 — with GitHub Actions Inactive
@gosusnp gosusnp temporarily deployed to more-secrets January 26, 2023 23:01 — with GitHub Actions Inactive
Copy link
Contributor

@benmoriceau benmoriceau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, I'll let Cole to approve

@gosusnp gosusnp merged commit 8889370 into gosusnp/platform-use-protocol-v1-the-quick-way Jan 27, 2023
@gosusnp gosusnp deleted the gosusnp/on-the-fly-catalog-migrations branch January 27, 2023 18:18
gosusnp added a commit that referenced this pull request Jan 30, 2023
* Add Airbyte Protocol V1 support.

* Fix VersionedAirbyteStreamFactoryTest

* Remove AirbyteMessageMigrationV0 example

* Add Protocol Version constants

* 🎉Updated normalization to handle new datatypes (#19721)

* Updated normalization simple stream processing to handle new datatypes

* Updated normalization nested stream processing to handle new datatypes

* Updated normalization nested stream processing to handle new datatypes

* Updated normalization drop_scd_catalog processing to handle new datatypes

* Updated normalization ephemeral test processing to handle new datatypes

* fixed more tests for normalization

* fixed more tests for normalization

* fixed more tests for normalization

* fixed more tests for normalization

* fixed more issues

* fixed more issues (clickhouse)

* fixed more issues

* fixed more issues

* fixed more issues

* added binary type processing for some DBs

* cleared commented code and moved some hardcodes to processing as macro

* fixed codestyle and cleared commented code

* minor refactor

* minor refactor

* minor refactor

* fixed bool cast error

* fixed dict->str cast error

* fixed is_combining_node cast py check

* removed commented code

* removed commented code

* committed autogenerated normalization_test_output files

* committed autogenerated normalization_test_output files (new files)

* refactored utils.py

* Updated utils.py to use Callable functions and get rid of property_type in is_number and is_bool functions

* committed autogenerated normalization_test_output files (new files)

* fixed typo in TIMESTAMP_WITH_TIMEZONE_TYPE

* updated stream_processor to handle string type first as a wider type

* fixed arrays normalization by updating is_simple_property method as per new approaches

* format

Co-authored-by: Edward Gao <[email protected]>

* Update airbyte protocol migration (#20745)

* Extract MigrationContainer from AirbyteMessageMigrator

* Add ConfiguredAirbyteCatalogMigrations

* Add ConfiguredAirbyteCatalog to AirbyteMessageMigrations

* Enable ConfiguredAirbyteCatalog migration

* Fix tests

* Remove extra this.

* Add missing docs

* Typo

Co-authored-by: Edward Gao <[email protected]>

* Data types update: Implement protocol message migrations (#19240)

* Extract MigrationContainer from AirbyteMessageMigrator

* Add ConfiguredAirbyteCatalogMigrations

* Add ConfiguredAirbyteCatalog to AirbyteMessageMigrations

* Enable ConfiguredAirbyteCatalog migration

* set up scaffolding

* [wip] more scaffolding, basic unit test

* minimal green code

* [wip] add failing test for other primitive types

* correct version number

* handle basic primitive type decls

* add implicit cases

* add recursive schema

* formatting

* comment

* support not

* fix indentation

* handle all nested schema cases

* handle boolean schemas

* verify empty schema handling

* cleanup

* extract map

* code organization

* extract method

* reformat

* [wip] more tests, minor fix type array handling

* corrected test

* cleanup

* reformat

* switch to v1

* add support for multityped fields

* missed test case

* nested test class

* basic record upgrade

* implement record upgrades

* slight refactor

* comments+clarificationso

* extract constants

* (partly) correct model classes

* add de/ser

* formatting

* extract constants

* fix json reference

* update docs

* switch to v1 models

* fix compile+test

* add base64 handling

* use vnull

* Data types update: Implement protocol message downgrade path (#19909)

* rough skeleton for passing catalog into migration

* basic test

* more scaffolding

* basic implementation

* add primitives test

* add in other tests (nested fields currently failing)

* add formats

* impleent oneOf handling

* formatting

* oneOf handling

* better tests

* comments + organization

* progress

* basic test case

* downgrade objects, ish

* basic array implementation

* handle numeric failure

* test for new type

* handle array items

* empty schema handling

* first pass at oneof handling

* add more tests+handling

* more tests

* comments

* add empty oneof test case

* format + reorganize

* more reorganize

* fix name

* also downgrade binary data

* only import vnull

* move migrations into v1 package

* extract schema mutation code

* comment

* extract schema migration to new class

* extract record downgrade logic for future use

* format

* fix build after rebase

* rename private method for consistency

* also implement configuredcatalog migrations >.>

* quick and dirty tests

* slight cleanup

* fix tests

* pmd

* pmd test

* null check on message objects

* maybe fix acceptance tests?

* fix name

* extract constants

* more fixes

* tmp

* meh

* fix cdc acc tests

* revert to master source-postgres

* remove log messages

* revert other misc hacks

* integers are valid cursors

* remove unrelated change

* fix build

* fix build more?

* [MUST REVERT] use dev normalization

* capture kube logs

* also here?

* no debug logs?

* delete dup from merging

* add final everywhere

* revert test changes

Co-authored-by: Jimmy Ma <[email protected]>

* On-the-fly migrations of persisted catalogs (#21757)

* On the fly catalog migration for normalization activity

* On the fly catalog migration for job persistence

* On the fly migration for standard sync persistence

* On the fly migration for airbyte catalogs

* Refactor code to share JsonSchema traversal

* Add V0 Data type search function

* PMD and Format

* Fix getOrInsertActorCatalog and ConfigRepositoryE2E tests

* Null-proofing CatalogMigrationV1Helper

* More null checks

* Fix test

* Format

* Add data type v1 support to the FE

* Changes AC test check to check exited ps (#21672)

some docker compose changes no longer show exited
processes.  this broke out test

this change should fix master

tested in a runner that failed

* Move wellknown types mapping to the utility function

* use protocolv1 normalization

---------

Co-authored-by: Topher Lubaway <[email protected]>
Co-authored-by: Edward Gao <[email protected]>

* Update protocol support range (#21996)

* bump normalization version to 0.3.0

* Add version check on normalization (#22048)

* Add normalization min version check

* Add visible for testing

---------

Co-authored-by: Edward Gao <[email protected]>
Co-authored-by: Eugene <[email protected]>
Co-authored-by: Topher Lubaway <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/frontend Related to the Airbyte webapp area/platform issues related to the platform area/worker Related to worker
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants