Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: default test TargetDuplicateRecords failing, records not deduped using key_properties #41

Closed
pnadolny13 opened this issue Jun 5, 2023 · 0 comments · Fixed by #47
Assignees

Comments

@pnadolny13
Copy link
Contributor

The test data https://github.com/meltano/sdk/blob/main/singer_sdk/testing/target_test_streams/duplicate_records.singer sends 5 records but only 2 distinct key property IDs so they should be updating instead of appending. I added an assertion to the test to make sure only 2 records were present in the final table but theres still 5.

I think in #15 there was an attempt to dedupe using temp tables, maybe that was to fix this issue.

pnadolny13 added a commit that referenced this issue Jun 6, 2023
Closes #28

- implements a fix for the test suites not running properly, also added
it to the SDK meltano/sdk#1749
- implements a lot of the default test validates methods to make asserts
- comment out the tests that fail for legitimate bugs
- logged the bugs
  - #43
  - #41
  - #40
- I also logged
#42 because I
wrote a test to assert the exception but I'm not actually sure if we
want that behavior or not

---------

Co-authored-by: Ken Payne <[email protected]>
@pnadolny13 pnadolny13 self-assigned this Jun 8, 2023
pnadolny13 added a commit that referenced this issue Jun 14, 2023
Closes #41

The challenge is that we're using a merge statement which is
successfully deduplicating against what already exists in the target
table but within the batch of records in the stage there are also dupes.
The test was failing because no data existed in the destination table so
we weren't updating any records, only inserting, but within our staging
file we had multiple primary keys ID 1 and 2 so they all get inserting
and the result is duplicates in the destination table.

The way I fixed it in this PR is by adding a qualify row_num = 1 to
deduplicate within our staging file select query. It uses the SEQ8
function, which I've never used before, to order the records based on
their place in the file i.e. the bottom of the table takes precedence
over the top. I looks to work as expected but it feels a little sketchy,
I wonder if unsorted streams would have issues where the wrong record
gets selected. Ideally the user would tell us a sort by column to know
how to take the latest.

---------

Co-authored-by: Ken Payne <[email protected]>
@github-project-automation github-project-automation bot moved this to Planned in Data Team Jun 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

1 participant