Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Source MySQL CDC: _airbyte_start_at and _airbyte_end_at = _airbyte_emitted_at #9215

Closed
octavia-squidington-iii opened this issue Dec 30, 2021 · 2 comments · Fixed by #9281

Comments

@octavia-squidington-iii
Copy link
Collaborator

Is this your first time deploying Airbyte: Yes
OS Version / Instance: EC2
Deployment: Docker-compose evaluation
Airbyte Version: 0.35.1-alpha
Source name/version: MySQL 8.0 (RDS)
Destination name/version: BigQuery
Step: After first incremental sync
Description: the values in my table for _airbyte_start_at and _airbyte_end_at are incorrect and instead both start and end are just the time of the most recent incremental sync. All the changes that occurred during the interval are correctly tagged, and _cdc_updated_at is also accurate, but _airbyte_start_at and _airbyte_end_at = _airbyte_emitted_at in all cases

https://airbytehq.slack.com/archives/C01MFR03D5W/p1640845411497200?thread_ts=1640845411.497200&cid=C01MFR03D5W

@zrait
Copy link

zrait commented Dec 30, 2021

I mentioned this in the Slack thread but posting here for context. It seems like StreamProcessor should fall back on computing start and end times with queries over _cdc_updated_at instead of falling back on _airbyte_emitted_at when there's no cursor explicitly specified, which seems like it will always be the case with CDC and MySQL (which uses a source-defined cursor)? I'm pretty sure if the default dbt normalization code was falling back on _cdc_updated_at it would properly handle connectors that use source-defined cursors, but I've only looked very briefly at the relevant code.

I think @ChristopheDuong has been most involved with the relevant code so would be great for him to chime in.

@ChristopheDuong
Copy link
Contributor

ChristopheDuong commented Jan 4, 2022

I haven't done the integration of CDC data in normalization but you are right, the code is preferring emitted_at columns over CDC updated at columns at the moment:

Reversing the two could probably work better as the CDC tags must be more relevant than runtime dates yes.

Changing the fallback logic might be needed in the get_cursor_field method too.
For now, the emitted_at column is the one being fallen back to. It could try to get one of the CDC columns instead and make sure all tests are still passing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Archived in project
Development

Successfully merging a pull request may close this issue.

6 participants