fix: duplicate record test #47

pnadolny13 · 2023-06-08T02:53:56Z

Closes #41

The challenge is that we're using a merge statement which is successfully deduplicating against what already exists in the target table but within the batch of records in the stage there are also dupes. The test was failing because no data existed in the destination table so we weren't updating any records, only inserting, but within our staging file we had multiple primary keys ID 1 and 2 so they all get inserting and the result is duplicates in the destination table.

The way I fixed it in this PR is by adding a qualify row_num = 1 to deduplicate within our staging file select query. It uses the SEQ8 function, which I've never used before, to order the records based on their place in the file i.e. the bottom of the table takes precedence over the top. I looks to work as expected but it feels a little sketchy, I wonder if unsorted streams would have issues where the wrong record gets selected. Ideally the user would tell us a sort by column to know how to take the latest.

… raw files

kgpayne · 2023-06-08T16:17:12Z

@pnadolny13 We currently only do a merge-upsert when key_properties are supplied, which I suppose assumes sorted streams. Adding handling for unsorted streams which have a replication_key supplied (to sort on) makes sense to me. If we always explicitly sort the staged data by replication_key (when its defined) for dedupe we can catch both cases and potentially avoid using SEQ8. It doesn't look like the Sink class has any helper attributes that check if a stream is sorted or not, so I'd say using the replication_key when available is our best bet 🤔

kgpayne · 2023-06-08T16:26:09Z

Just had a quick scan of the PPW variant and I can't see any indication that they are sorting or explicitly handling unsorted streams. I also don't see any temp-table sorting in target-postgres 🤔 cc @visch

pnadolny13 · 2023-06-09T13:05:06Z

@kgpayne thanks for the feedback! Is there a case where the replication_key is not sortable though? Like I think users can set a string hash as key properties, in that case we'd do an unreliable sort by accident.

We currently only do a merge-upsert when key_properties are supplied, which I suppose assumes sorted streams

Oh interesting yeah I guess if the stream was unsorted the merge-upsert logic, which is the same in the PPW variant, would have issues too. Maybe thats a safe assumption given how long those PPW have been reliably running in production 🤔 .

Especially since the current default target doesnt support this I'm almost leaning towards deferring this until later if theres no clean assumptions we can make. I would rather see the edge case of duplicate records in snowflake rather than accidentally choosing the wrong data as latest.

pnadolny13 · 2023-06-14T21:42:24Z

@kgpayne I figured out where this is handled in the PPW variant.

They iterate records coming in and collect them in a dictionary for batching. The key that they use for the dict is based on the PK if one is supplied or it uses a key based on the incrementing row counter that they have for metrics. In this case if 2 records with the same PK arrive and the key properties are set on the stream then the last to arrive wins, overwriting the first record. I think its safe to replicate this behavior in this target for now while we explore the sorting edge cases.

pnadolny13 added 3 commits June 7, 2023 21:42

fix camelcase test, dont conform schema when using for selecting from…

dd903c6

… raw files

use qualify to dedupe within a batch file before merging

8b3fbdd

fix dedup spelling

0e3f79e

pnadolny13 requested review from kgpayne and edgarrmondragon June 8, 2023 12:35

pnadolny13 marked this pull request as ready for review June 8, 2023 12:35

Base automatically changed from fix_camelcase_test to main June 8, 2023 16:34

Merge branch 'main' into fix_duplicate_record_test

645f844

Merge branch 'main' into fix_duplicate_record_test

96b79f1

fix key formatting post reserved word handling fix

f9f396b

pnadolny13 merged commit 5511ccd into main Jun 14, 2023

pnadolny13 deleted the fix_duplicate_record_test branch June 14, 2023 22:09

pnadolny13 mentioned this pull request Jun 14, 2023

spike: explore sorting edge cases #61

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: duplicate record test #47

fix: duplicate record test #47

pnadolny13 commented Jun 8, 2023

kgpayne commented Jun 8, 2023 •

edited

Loading

kgpayne commented Jun 8, 2023

pnadolny13 commented Jun 9, 2023

pnadolny13 commented Jun 14, 2023 •

edited

Loading

fix: duplicate record test #47

fix: duplicate record test #47

Conversation

pnadolny13 commented Jun 8, 2023

kgpayne commented Jun 8, 2023 • edited Loading

kgpayne commented Jun 8, 2023

pnadolny13 commented Jun 9, 2023

pnadolny13 commented Jun 14, 2023 • edited Loading

kgpayne commented Jun 8, 2023 •

edited

Loading

pnadolny13 commented Jun 14, 2023 •

edited

Loading