-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
attempted fix for hierarchy stream primary key #28
Conversation
attempted fix for hierarchy stream primary key
…ake them possible
overhauld of the primary key designations and flatten of schemas to m…
Thanks for this PR @jlloyd-widen I'm trying to fix #30 first, generally the idea here seems right! Thanks for adding the test, I'll come back after I get 30 fixed |
I'm a little worried about responses with GAQL's of something like https://github.com/AutoIDM/tap-googleads/blob/main/tap_googleads/streams.py#L316 Looks like you can have multiple nested objects returned by Google's api. Maybe we just pull the primary key out of the record and put it at the root level and leave the rest? I haven't dove in enough to know. |
This is pretty well addressed by your PR. I wonder if leaving the flattening up to the USER would be appropriate here as we could hop onto the new functionality here https://gitlab.com/meltano/sdk/-/merge_requests/236/diffs#5419fd77f507b41e5e0fe17b6fbf7375fbf2c3eb . That doesn't answer the question about how to get a primary key for each of these reports though. The work you put in to make the primary key for each stream is good. Things I'm thinking about:
I'm not in love with the surrogate key idea. We're at a trade off point of do we keep the data as close to the source data as possible by leaving the data in the same structure as before but making a surrogate key? Or do we denest the objects ourselves Just questions not really answers to anything. |
Gave this a shot with some realistic data fails on the
Also fails on the
|
Added Data for one of the failures
|
@visch This got me thinking that the best way to handle this probably wouldn't be to introduce flattening like I have here, but instead to introduce the ability to copy (not cut or flatten) the primary keys from where ever the are naturally in the json structure to the root level of the record. You could use json paths to designate each primary key. For example, if the primary key json path {
"root_data_1": "foo",
"object_1": {
"example_primary_key": 1,
"other_data_1": "bar"
}
} would become: {
"root_data_1": "foo",
"example_primary_key": 1,
"object_1": {
"example_primary_key": 1,
"other_data_1": "bar"
}
} The downside is that this introduces duplicate data into almost all of the streams, the upsides are
I think I might do this in a separate PR to demonstrate and then we can decide on which approach is preferred. I'll link all these issues and PRs once I have it. |
Revert "overhauld of the primary key designations and flatten of schemas to m…"
Revert "attempted fix for hierarchy stream primary key"
I think I like that idea best! Maybe the key name could be something like _sdc_primary_key to follow singer conventions thus far? |
That makes sense for a single primary key. What's the naming convention for a composite primary key? UPDATE: Actually there isn't a convention, but the closest pattern I could come up with was |
Hmm, I was thinking we'd say SCHEMA primary_keys = ['_sdc_primary_key`] and if We may want to add a check to be sure the generated primary key is actually unique? But that might be overkill honestly Maybe I'm missing something? Let me know! I guess we could always just not use |
That's a better idea. I'm in favor of using the |
Fix pk with config
@visch updated the PR to match the discussion. Give it a run and let me know what you think. Hopefully this also resolves the errors you were getting. |
Skimmed the PR, generally looks good I'll deep dive once we get these bugs ironed out
|
@visch Your account apparently does have the field available that I was using to set a primary key so I used an alternative field in the same query. Give this new commit a shot in your account and see if it works. |
Ran this and it did work! Doing a code review now |
for i, level in enumerate(levels): | ||
val = val[level] | ||
if i == len(levels) - 1: | ||
pk_str += ":" + str(val) if pk_str else str(val) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know for mssql there's a limit to the size a PK index can be. https://docs.microsoft.com/en-us/sql/relational-databases/tables/primary-and-foreign-key-constraints?view=sql-server-ver15#:~:text=A%20table%20can%20contain%20only,key%20length%20of%20900%20bytes.
Postgres has some similar length I believe. I think we're fine for now, but we may have to evaluate this at some point and maybe just md5 the whole thing
Made some changes to queries while this was being written. We'll have to merge this PR, but I think we'll be pretty close to set after that. I'll go ahead and merge them and make a PR for this branch on your github. We'll see how this goes :D |
Merged with latest changes on AutoIDM main branch
Merged your PR on my repo's branch so this should not have merge conflicts anymore and should be up to date. |
Added a few more things (You were too fast for me, and I forgot a few things ha) Could you run the latest stuff with your Google Ads data to be sure it works on your side? After that I think we're good to go, part of me wants to run an audit on some data here to be sure we didn't miss any data on the migration. I'll think about that, if this works on your side I may just merge it! |
@@ -0,0 +1,32 @@ | |||
from tap_googleads.utils import replicate_pk_at_root |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for writing tests!
Fix pk merged
Sorry for the happy trigger finger ;) I ran all streams locally with your latest changes. Everything ran smoothly so I merged it. Let me know if you have any other edits. |
* Sync all customers for a given stream * Add logging to see when we retry requests * Update currently_syncing with customerId too. Write state as soon as we update it * Add the customerId to the bookmark keys * Add shuffle for customerId and tap_stream_id; add shuffle unit tests * Bug fix for when currently_syncing is null * Fix exception handling typeError * Fix none cases for currently_syncing * Fix currently_syncing to write a tuple we can read in later * Add get_customer_ids so we can use it in the tests * Fix manipulated_state to account for customer_ids * Update assertion for currently_syncing * Fix currently syncing assertion * Move bookmark access into Full Table assertions section Full Table doesn't need the "stream_name and customer id" key logic * Remove duplicate assertion * Revert 6db016e7ec29c2b00973b671c1efdf9451aca9c2 * Update bookmark to read stream->customer->replication_key * Update tap to write bookmarks as stream->customer->replication_key * Update manipulated state to nest stream->customer->replication_key * Run bookmark assertions for every customer * Fix dict comprehension typo * Fix conflict with main * Remove `get_state_key` again, use env var instead of hardcoded value * Add missing dependency * Move currently-syncing-null-out to the end of sync to prevent gaps * Sort selected_streams and customers to guarantee consistency across runs * Don't let the tap write (None, None) * Sort selected_streams and customers effectively * Update currently_syncing test assertions * Add sort functions for streams and customers * Update `shuffle` to handle a missing value * Update unit tests to use sort_function, add a test for shuffling streams * Add end date (AutoIDM#28) * Add optional end date, add unit tests Co-authored-by: Andy Lu <[email protected]> * Test functions can't be named run_test apparently * Rename do_thing * Extract `get_queries_from_sync` as a function * Remove unused variable * Refactor tests to be more explicit * Mock singer.utils.now to return a specific date Co-authored-by: Andy Lu <[email protected]> * add conversion_window test * fixed conversion window unittests, bug removed Co-authored-by: dylan-stitch <[email protected]> Co-authored-by: Andy Lu <[email protected]> Co-authored-by: kspeer <[email protected]>
* add conversion window test * add conversion window test * wip updated tests to worka with currently syncing dev branch [skip ci] * Revert removal of metric compatibility removal (AutoIDM#29) * Revert removal of metric compatibility removal * Whitespace cleanup * Add currently syncing (AutoIDM#24) * Sync all customers for a given stream * Add logging to see when we retry requests * Update currently_syncing with customerId too. Write state as soon as we update it * Add the customerId to the bookmark keys * Add shuffle for customerId and tap_stream_id; add shuffle unit tests * Bug fix for when currently_syncing is null * Fix exception handling typeError * Fix none cases for currently_syncing * Fix currently_syncing to write a tuple we can read in later * Add get_customer_ids so we can use it in the tests * Fix manipulated_state to account for customer_ids * Update assertion for currently_syncing * Fix currently syncing assertion * Move bookmark access into Full Table assertions section Full Table doesn't need the "stream_name and customer id" key logic * Remove duplicate assertion * Revert 6db016e7ec29c2b00973b671c1efdf9451aca9c2 * Update bookmark to read stream->customer->replication_key * Update tap to write bookmarks as stream->customer->replication_key * Update manipulated state to nest stream->customer->replication_key * Run bookmark assertions for every customer * Fix dict comprehension typo * Fix conflict with main * Remove `get_state_key` again, use env var instead of hardcoded value * Add missing dependency * Move currently-syncing-null-out to the end of sync to prevent gaps * Sort selected_streams and customers to guarantee consistency across runs * Don't let the tap write (None, None) * Sort selected_streams and customers effectively * Update currently_syncing test assertions * Add sort functions for streams and customers * Update `shuffle` to handle a missing value * Update unit tests to use sort_function, add a test for shuffling streams * Add end date (AutoIDM#28) * Add optional end date, add unit tests Co-authored-by: Andy Lu <[email protected]> * Test functions can't be named run_test apparently * Rename do_thing * Extract `get_queries_from_sync` as a function * Remove unused variable * Refactor tests to be more explicit * Mock singer.utils.now to return a specific date Co-authored-by: Andy Lu <[email protected]> * add conversion_window test * fixed conversion window unittests, bug removed Co-authored-by: dylan-stitch <[email protected]> Co-authored-by: Andy Lu <[email protected]> Co-authored-by: kspeer <[email protected]> * Bump to v0.2.0, update changelog (AutoIDM#31) * Bump to v0.2.0, update changelog * Add link for this PR, fix link syntax * Update changelog format * expanded conversion window testing for error case, BUG linked * parallelism 8 -> 12 * added unittest for start date within conversion window Co-authored-by: kspeer <[email protected]> Co-authored-by: Dylan <[email protected]> Co-authored-by: dylan-stitch <[email protected]> Co-authored-by: Andy Lu <[email protected]>
fixes #18 . Specifically, it fixes the Primary Key errors mentioned in that issue. This required flattening of the json structures returned by the API. There were also several instances where query terms were missing in the google query and the primary keys were just completely incorrect.