Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Eradicate errors in newly-introduced history-mode tables #41

Merged
merged 3 commits into from
Aug 21, 2023

Conversation

fivetran-jamie
Copy link
Collaborator

@fivetran-jamie fivetran-jamie commented Aug 17, 2023

PR Overview

This PR will address the following Issue/Feature:
extension of #40

The addition of a _fivetran_active field altered the grain of the *_history tables, as certain changes in Google ads (suc as budgetary changes) may or may not change the updated_at field (but will still pass new records to the Fivetran connector)

This PR will result in the following new package version:

v0.9.3 -- nothing should change for people without the _fivetran_active column. for those with it, this PR will fix errors popping up around the new grain

Please detail what change(s) this PR introduces and any additional information that should be known during the review of this PR:

PR Checklist

Basic Validation

Please acknowledge that you have successfully performed the following commands locally:

  • dbt compile
  • dbt run –full-refresh
  • dbt run
  • dbt test
  • dbt run –vars (if applicable)

Before marking this PR as "ready for review" the following have been applied:

  • The appropriate issue has been linked and tagged
  • You are assigned to the corresponding issue and this PR
  • BuildKite integration tests are passing

Detailed Validation

Please acknowledge that the following validation checks have been performed prior to marking this PR as "ready for review":

  • You have validated these changes and assure this PR will address the respective Issue/Feature.
  • You are reasonably confident these changes will not impact any other components of this package or any dependent packages.
  • You have provided details below around the validation steps performed to gain confidence in these changes.

I actually did find an internal dataset with the new _fivetran_active field. here's the output of a run +dbt test using main

18:35:32  Completed with 3 errors and 0 warnings:
18:35:32  
18:35:32  Failure in test dbt_utils_unique_combination_of_columns_stg_google_ads__ad_group_history_ad_group_id__updated_at (models/stg_google_ads.yml)
18:35:32    Got 174 results, configured to fail if != 0
18:35:32  
18:35:32    compiled Code at target/compiled/google_ads_source/models/stg_google_ads.yml/dbt_utils_unique_combination_o_0c1cbeb5a9539431a7fbce6af1a21d7a.sql
18:35:32  
18:35:32  Failure in test dbt_utils_unique_combination_of_columns_stg_google_ads__ad_history_ad_id__ad_group_id__updated_at (models/stg_google_ads.yml)
18:35:32    Got 7 results, configured to fail if != 0
18:35:32  
18:35:32    compiled Code at target/compiled/google_ads_source/models/stg_google_ads.yml/dbt_utils_unique_combination_o_0cf5dbf0b60dae1b36794a079a6f8b74.sql
18:35:32  
18:35:32  Failure in test dbt_utils_unique_combination_of_columns_stg_google_ads__campaign_history_campaign_id__updated_at (models/stg_google_ads.yml)
18:35:32    Got 1 result, configured to fail if != 0
18:35:32  
18:35:32    compiled Code at target/compiled/google_ads_source/models/stg_google_ads.yml/dbt_utils_unique_combination_o_bd5040437362e14b36ab7ce3eaa14d1d.sql
18:35:32  
18:35:32  Done. PASS=23 WARN=0 ERROR=3 SKIP=0 TOTAL=26

and using this working branch:

18:38:44  Completed successfully
18:38:44  
18:38:44  Done. PASS=26 WARN=0 ERROR=0 SKIP=0 TOTAL=26

Moreover, @kaogilvie verified in the thread of #40 that this fix worked for them

Standard Updates

Please acknowledge that your PR contains the following standard updates:

  • Package versioning has been appropriately indexed in the following locations:
    • indexed within dbt_project.yml
    • indexed within integration_tests/dbt_project.yml
  • CHANGELOG has individual entries for each respective change in this PR
  • [NA] README updates have been applied (if applicable)
  • [NA] DECISIONLOG updates have been updated (if applicable)
  • Appropriate yml documentation has been added (if applicable)

dbt Docs

Please acknowledge that after the above were all completed the below were applied to your branch:

  • docs were regenerated (unless this PR does not include any code or yml updates)

If you had to summarize this PR in an emoji, which would it be?

🧯

@fivetran-jamie fivetran-jamie self-assigned this Aug 17, 2023
@fivetran-avinash fivetran-avinash self-requested a review August 17, 2023 21:30
Copy link
Contributor

@fivetran-avinash fivetran-avinash left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @fivetran-jamie, is the intent of the PR to filter out all _fivetran_active = false records? This PR does eliminate the test failures but it also filters out all the false records and there are generally a lot of them.

For example, comparing the ad group history model before and after, over 90% of the records have been removed. I know it's eventually filtered out in the final google_ads package, but not sure if we want to be that restrictive in the source package.

I wonder if the best way to eliminate the test failures is to min/max _fivetran_start and _fivetran_end on the id/updated_at grains (if _fivetran_start and _fivetran_end exist). Although that is a bit of a heftier PR to apply that logic.

Also, do we also need to create an issue for this PR for tags/Github tracking?

@fivetran-jamie
Copy link
Collaborator Author

that is a very good point, as the staging models basically become non-historical with this filter on... we could just simply swap updated_at with _fivetran_start in the uniqueness tests. i do wonder though if users would prefer to limit out non-active records for computational reasons

curious what @fivetran-joemarkiewicz thinks (and if any package users want to chime in, i'm all ears 👂 🌽 )

@fivetran-joemarkiewicz
Copy link
Contributor

@fivetran-avinash thank you for critically reviewing this PR and having a keen eye on how we may possibly keep the historical records so customers may still leverage them. However, after discussing this with the product team we decided the best immediate approach is to filter out the historical records for this first phase of the history rollouts.

I think this is something we should discuss more as a team to determine how we should best approach these newer history tables in connectors as they will be added to more connectors in the future. In the past, we have simply taken the approach of filtering out any non active records to make it easier for users to leverage the data in the staging models without needing to account for any historical nuance. This is similar to what we did in the Salesforce package originally to counteract historical data (although we did add a variable to introduce history records if the customer wanted). Although I do not feel the variable is the correct route going forward.

We can discuss this in our next data team and with customers for how we may want to handle these going forward, but for right now we should filter them out to avoid the errors the customers are seeing.

Copy link
Contributor

@fivetran-avinash fivetran-avinash left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the note from @fivetran-joemarkiewicz above, I've gone ahead and approved!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants