Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Poor performance on int_hubspot__contact_merge_adjust #109

Closed
2 of 4 tasks
kcraig-ats opened this issue May 24, 2023 · 0 comments
Closed
2 of 4 tasks

[Bug] Poor performance on int_hubspot__contact_merge_adjust #109

kcraig-ats opened this issue May 24, 2023 · 0 comments

Comments

@kcraig-ats
Copy link
Contributor

Is there an existing issue for this?

  • I have searched the existing issues

Describe the issue

With the changes that came in v0.9.0 our project has seen a increase in run times with the HubSpot package. It looks like int_hubspot__contact_merge_adjust is the issue, and specifically the merge_contacts macro that is driving the decrease in performance. We've seen run times range between 15-35 minutes.

Relevant error log or model output

No response

Expected behavior

I don't expect the query to run as long as it does.

dbt Project configurations

vars:
  hubspot_source:
    hubspot_schema: hubspot_fivetran 
  hubspot_email_event_forward_enabled: false
  hubspot_email_event_print_enabled: false
  hubspot_email_event_spam_report_enabled: false
  hubspot_service_enabled: false
  hubspot_contact_property_enabled: false
  hubspot__pass_through_all_columns: true

Package versions

  - package: fivetran/hubspot
    version: [">=0.9.0", "<0.10.0"]

What database are you using dbt with?

redshift

dbt Version

1.3

Additional Context

My guess is the merge_contacts query is suboptimal in redshift. I played around with a few solutions that sped up my run:

  • I filtered the return set of numbers so that the generated number was <= (select max(json_array_length(json_serialize(split_to_array(calculated_merged_vids, ';')), true)) from contacts). I added an additional subquery to numbers to achieve this, but I also considered doing this by setting the upper_bound variable using the run_query macro.
  • I also filtered contacts where calculated_merged_vids is not null. I'm not sure this achieved a significant boost but calculated_merged_vids is populated for <.01% of our contacts.

The query time averaged a little over a minute with the changes.

Are you willing to open a PR to help address this issue?

  • Yes.
  • Yes, but I will need assistance and will schedule time during our office hours for guidance
  • No.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant