Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🎉 Source stripe - enhanced performance for streams which run substreams #10359

Merged
merged 6 commits into from
Mar 10, 2022

Conversation

midavadim
Copy link
Contributor

@midavadim midavadim commented Feb 15, 2022

Enhanced performance for stripe source

What

Tested on sync stripe -> bigquery (with airbyte creds):

Succeeded
40.75 MB | 18,385 records | 57m 31s | Sync

The problem was with a few streams:

  1. invoice_line_items: 31m / 4985 items
    2022-02-10 16: 23:40 �[44msource�[0m > Syncing stream: invoice_line_items
    2022-02-10 16:30:19 �[32mINFO�[m i.a.w.DefaultReplicationWorker(lambda$getReplicationRunnable$5):281 - Records read: 5000
    2022-02-10 16:37:07 �[32mINFO�[m i.a.w.DefaultReplicationWorker(lambda$getReplicationRunnable$5):281 - Records read: 6000
    2022-02-10 16:43:50 �[32mINFO�[m i.a.w.DefaultReplicationWorker(lambda$getReplicationRunnable$5):281 - Records read: 7000
    2022-02-10 16:49:47 �[32mINFO�[m i.a.w.DefaultReplicationWorker(lambda$getReplicationRunnable$5):281 - Records read: 8000
    2022-02-10 16:54:20 �[32mINFO�[m i.a.w.DefaultReplicationWorker(lambda$getReplicationRunnable$5):281 - Records read: 9000
    2022-02-10 16:54:23 �[44msource�[0m > Read 4985 records from invoice_line_items stream
    2022-02-10 16:54:23 �[44msource�[0m > Finished syncing SourceStripe
    2022-02-10 16:54:23 �[44msource�[0m > SourceStripe runtimes:

  2. subscription_items: 10m / 1995 items
    2022-02-10 16:58:30 �[44msource�[0m > Syncing stream: subscription_items
    2022-02-10 17:00:22 �[32mINFO�[m i.a.w.DefaultReplicationWorker(lambda$getReplicationRunnable$5):281 - Records read: 15000
    2022-02-10 17:05:57 �[32mINFO�[m i.a.w.DefaultReplicationWorker(lambda$getReplicationRunnable$5):281 - Records read: 16000
    2022-02-10 17:08:44 �[44msource�[0m > Read 1995 records from subscription_items stream
    2022-02-10 17:08:44 �[44msource�[0m > Finished syncing SourceStripe
    2022-02-10 17:08:44 �[44msource�[0m > SourceStripe runtimes:

Reason:
invoice_line_items - stream runs 1 request for each of 4372 invoices (main stream)
subscription_items - stream runs 1 request for each of 1686 subscriptions (main stream)

How

Research shows that records from main streams already contain 1st page of needed items (invoice_line_items and subscription_items).
But In major cases, pagination requests are not performed because line items are fully reported in main streams' streams

  1. invoice_line_items: 2:20m / 4988 items
    2022-02-15 17:52:30 �[44msource�[0m > Syncing stream: invoice_line_items
    2022-02-15 17:52:34 �[32mINFO�[m i.a.w.DefaultReplicationWorker(lambda$getReplicationRunnable$5):281 - Records read: 4000
    2022-02-15 17:53:05 �[32mINFO�[m i.a.w.DefaultReplicationWorker(lambda$getReplicationRunnable$5):281 - Records read: 5000
    2022-02-15 17:53:35 �[32mINFO�[m i.a.w.DefaultReplicationWorker(lambda$getReplicationRunnable$5):281 - Records read: 6000
    2022-02-15 17:54:03 �[32mINFO�[m i.a.w.DefaultReplicationWorker(lambda$getReplicationRunnable$5):281 - Records read: 7000
    2022-02-15 17:54:29 �[32mINFO�[m i.a.w.DefaultReplicationWorker(lambda$getReplicationRunnable$5):281 - Records read: 8000
    2022-02-15 17:54:50 �[44msource�[0m > Read 4988 records from invoice_line_items stream
    2022-02-15 17:54:50 �[44msource�[0m > Finished syncing SourceStripe

  2. subscription_items: 50s / 1995 items
    2022-02-15 17:58:53 �[44msource�[0m > Syncing stream: subscription_items
    2022-02-15 17:59:03 �[32mINFO�[m i.a.w.DefaultReplicationWorker(lambda$getReplicationRunnable$5):281 - Records read: 15000
    2022-02-15 17:59:31 �[32mINFO�[m i.a.w.DefaultReplicationWorker(lambda$getReplicationRunnable$5):281 - Records read: 16000
    2022-02-15 17:59:40 �[44msource�[0m > Read 1995 records from subscription_items stream
    2022-02-15 17:59:40 �[44msource�[0m > Finished syncing SourceStripe

Results of manual tests:
Last: new version
Previous: Old version
image

Recommended reading order

  1. x.java
  2. y.python

🚨 User Impact 🚨

Are there any breaking changes? What is the end result perceived by the user? If yes, please merge this PR with the 🚨🚨 emoji so changelog authors can further highlight this if needed.

Pre-merge Checklist

Expand the relevant checklist and delete the others.

New Connector

Community member or Airbyter

  • Community member? Grant edit access to maintainers (instructions)
  • Secrets in the connector's spec are annotated with airbyte_secret
  • Unit & integration tests added and passing. Community members, please provide proof of success locally e.g: screenshot or copy-paste unit, integration, and acceptance test output. To run acceptance tests for a Python connector, follow instructions in the README. For java connectors run ./gradlew :airbyte-integrations:connectors:<name>:integrationTest.
  • Code reviews completed
  • Documentation updated
    • Connector's README.md
    • Connector's bootstrap.md. See description and examples
    • docs/SUMMARY.md
    • docs/integrations/<source or destination>/<name>.md including changelog. See changelog example
    • docs/integrations/README.md
    • airbyte-integrations/builds.md
  • PR name follows PR naming conventions

Airbyter

If this is a community PR, the Airbyte engineer reviewing this PR is responsible for the below items.

  • Create a non-forked branch based on this PR and test the below items on it
  • Build is successful
  • Credentials added to Github CI. Instructions.
  • /test connector=connectors/<name> command is passing
  • New Connector version released on Dockerhub by running the /publish command described here
  • After the connector is published, connector added to connector index as described here
  • Seed specs have been re-generated by building the platform and committing the changes to the seed spec files, as described here
Updating a connector

Community member or Airbyter

  • Grant edit access to maintainers (instructions)
  • Secrets in the connector's spec are annotated with airbyte_secret
  • Unit & integration tests added and passing. Community members, please provide proof of success locally e.g: screenshot or copy-paste unit, integration, and acceptance test output. To run acceptance tests for a Python connector, follow instructions in the README. For java connectors run ./gradlew :airbyte-integrations:connectors:<name>:integrationTest.
  • Code reviews completed
  • Documentation updated
    • Connector's README.md
    • Connector's bootstrap.md. See description and examples
    • Changelog updated in docs/integrations/<source or destination>/<name>.md including changelog. See changelog example
  • PR name follows PR naming conventions

Airbyter

If this is a community PR, the Airbyte engineer reviewing this PR is responsible for the below items.

  • Create a non-forked branch based on this PR and test the below items on it
  • Build is successful
  • Credentials added to Github CI. Instructions.
  • /test connector=connectors/<name> command is passing
  • New Connector version released on Dockerhub by running the /publish command described here
  • After the new connector version is published, connector version bumped in the seed directory as described here
  • Seed specs have been re-generated by building the platform and committing the changes to the seed spec files, as described here
Connector Generator
  • Issue acceptance criteria met
  • PR name follows PR naming conventions
  • If adding a new generator, add it to the list of scaffold modules being tested
  • The generator test modules (all connectors with -scaffold in their name) have been updated with the latest scaffold by running ./gradlew :airbyte-integrations:connector-templates:generator:testScaffoldTemplates then checking in your changes
  • Documentation which references the generator is updated as needed

@CLAassistant
Copy link

CLAassistant commented Feb 15, 2022

CLA assistant check
All committers have signed the CLA.

@github-actions github-actions bot added the area/connectors Connector related issues label Feb 15, 2022
@codecov
Copy link

codecov bot commented Feb 15, 2022

Codecov Report

❗ No coverage uploaded for pull request base (master@342840d). Click here to learn what that means.
The diff coverage is n/a.

❗ Current head c41d934 differs from pull request most recent head bfe9e81. Consider uploading reports for the commit bfe9e81 to get more accurate results

Impacted file tree graph

@@            Coverage Diff            @@
##             master   #10359   +/-   ##
=========================================
  Coverage          ?   70.15%           
=========================================
  Files             ?        3           
  Lines             ?      258           
  Branches          ?        0           
=========================================
  Hits              ?      181           
  Misses            ?       77           
  Partials          ?        0           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 342840d...bfe9e81. Read the comment docs.

@midavadim
Copy link
Contributor Author

midavadim commented Feb 15, 2022

/test connector=connectors/source-stripe

🕑 connectors/source-stripe https://github.com/airbytehq/airbyte/actions/runs/1849046238
❌ connectors/source-stripe https://github.com/airbytehq/airbyte/actions/runs/1849046238
🐛 https://gradle.com/s/np7fmu4qgrcgq

@octavia-squidington-iii octavia-squidington-iii temporarily deployed to more-secrets February 15, 2022 19:44 Inactive
@midavadim midavadim temporarily deployed to more-secrets February 16, 2022 11:00 Inactive
@midavadim midavadim temporarily deployed to more-secrets February 16, 2022 11:00 Inactive
@midavadim
Copy link
Contributor Author

midavadim commented Feb 16, 2022

/test connector=connectors/source-stripe

🕑 connectors/source-stripe https://github.com/airbytehq/airbyte/actions/runs/1852384652
✅ connectors/source-stripe https://github.com/airbytehq/airbyte/actions/runs/1852384652
Python tests coverage:

Name                                                 Stmts   Miss  Cover
------------------------------------------------------------------------
source_acceptance_test/__init__.py                       2      0   100%
source_acceptance_test/base.py                          10      4    60%
source_acceptance_test/config.py                        74      6    92%
source_acceptance_test/tests/__init__.py                 4      0   100%
source_acceptance_test/tests/test_core.py              275    106    61%
source_acceptance_test/tests/test_full_refresh.py       52      2    96%
source_acceptance_test/tests/test_incremental.py        69     38    45%
source_acceptance_test/utils/__init__.py                 6      0   100%
source_acceptance_test/utils/asserts.py                 37      2    95%
source_acceptance_test/utils/common.py                  70     17    76%
source_acceptance_test/utils/compare.py                 62     23    63%
source_acceptance_test/utils/connector_runner.py       110     48    56%
source_acceptance_test/utils/json_schema_helper.py     105     13    88%
------------------------------------------------------------------------
TOTAL                                                  876    259    70%
Name                        Stmts   Miss  Cover
-----------------------------------------------
source_stripe/__init__.py       2      0   100%
source_stripe/source.py        22     11    50%
source_stripe/streams.py      237     90    62%
-----------------------------------------------
TOTAL                         261    101    61%

@octavia-squidington-iii octavia-squidington-iii temporarily deployed to more-secrets February 16, 2022 11:04 Inactive
Copy link
Collaborator

@bazarnov bazarnov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not the big deal, but I think we can optimise the code even more here, by reusing some parts, please read the comments bellow.

Copy link
Contributor

@antixar antixar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Loading of the substreams was implemented without the normal iterable logic

@midavadim midavadim temporarily deployed to more-secrets February 18, 2022 18:54 Inactive
@midavadim midavadim temporarily deployed to more-secrets February 18, 2022 18:54 Inactive
Copy link
Contributor

@antixar antixar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a simple integration test for this new logic? For example we can compare records for 2 flows:

  1. your new logic
  2. mock the lines property by an empty array.
    Both lists should be same

@bazarnov
Copy link
Collaborator

bazarnov commented Feb 21, 2022

/test connector=connectors/source-stripe

🕑 connectors/source-stripe https://github.com/airbytehq/airbyte/actions/runs/1876462955
❌ connectors/source-stripe https://github.com/airbytehq/airbyte/actions/runs/1876462955
🐛 https://gradle.com/s/blbpleyvjzska

@octavia-squidington-iii octavia-squidington-iii temporarily deployed to more-secrets February 21, 2022 13:40 Inactive
@midavadim
Copy link
Contributor Author

midavadim commented Feb 23, 2022

/test connector=connectors/source-stripe

🕑 connectors/source-stripe https://github.com/airbytehq/airbyte/actions/runs/1886541794
✅ connectors/source-stripe https://github.com/airbytehq/airbyte/actions/runs/1886541794
Python tests coverage:

Name                                                 Stmts   Miss  Cover
------------------------------------------------------------------------
source_acceptance_test/utils/__init__.py                 6      0   100%
source_acceptance_test/tests/__init__.py                 4      0   100%
source_acceptance_test/__init__.py                       2      0   100%
source_acceptance_test/tests/test_full_refresh.py       52      2    96%
source_acceptance_test/utils/asserts.py                 37      2    95%
source_acceptance_test/config.py                        74      6    92%
source_acceptance_test/utils/json_schema_helper.py     105     13    88%
source_acceptance_test/utils/common.py                  70     17    76%
source_acceptance_test/utils/compare.py                 62     23    63%
source_acceptance_test/tests/test_core.py              275    106    61%
source_acceptance_test/base.py                          10      4    60%
source_acceptance_test/utils/connector_runner.py       110     48    56%
source_acceptance_test/tests/test_incremental.py        69     38    45%
------------------------------------------------------------------------
TOTAL                                                  876    259    70%
Name                        Stmts   Miss  Cover
-----------------------------------------------
source_stripe/__init__.py       2      0   100%
source_stripe/streams.py      235     66    72%
source_stripe/source.py        22     11    50%
-----------------------------------------------
TOTAL                         259     77    70%

@midavadim midavadim temporarily deployed to more-secrets February 23, 2022 09:53 Inactive
@midavadim midavadim temporarily deployed to more-secrets February 23, 2022 09:53 Inactive
@octavia-squidington-iii octavia-squidington-iii temporarily deployed to more-secrets February 23, 2022 09:55 Inactive
@midavadim
Copy link
Contributor Author

Could you add a simple integration test for this new logic? For example we can compare records for 2 flows:

  1. your new logic
  2. mock the lines property by an empty array.
    Both lists should be sam
  1. It is not possible to do such test. The idea of this whole update is that then 'lines' is empty then we should not run any additional requests.
  2. I will try to add some unit test for this update read_records method with mocked another methods


# filter out 'bank_account' source items only
if self.filter:
items = [i for i in items if i.get(self.filter["attr"]) == self.filter["value"]]
Copy link
Contributor

@ChristopheDuong ChristopheDuong Mar 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this bank_account filter? where does it come from?

It seems like maybe the comment is not in sync with code, right?
(only used in customers.bank_account?)

can we add comments on what the generic filter is for? i am guessing bank_account is one example of such usage?

@midavadim
Copy link
Contributor Author

@ChristopheDuong I saw you approved that review, but there were a few comments from you. I have replied to these comments. Could you please let me know if I can merge this PR into main.

@github-actions github-actions bot added the area/documentation Improvements or additions to documentation label Mar 10, 2022
@midavadim midavadim temporarily deployed to more-secrets March 10, 2022 15:39 Inactive
@midavadim midavadim temporarily deployed to more-secrets March 10, 2022 15:39 Inactive
@midavadim
Copy link
Contributor Author

midavadim commented Mar 10, 2022

/publish connector=connectors/source-stripe

🕑 connectors/source-stripe https://github.com/airbytehq/airbyte/actions/runs/1964024118
✅ connectors/source-stripe https://github.com/airbytehq/airbyte/actions/runs/1964024118

@midavadim midavadim temporarily deployed to more-secrets March 10, 2022 16:06 Inactive
@midavadim midavadim temporarily deployed to more-secrets March 10, 2022 16:06 Inactive
@midavadim midavadim requested a review from antixar March 10, 2022 16:08
@midavadim midavadim dismissed antixar’s stale review March 10, 2022 16:14

Maksim is not able to finish review

@midavadim midavadim merged commit a1a4bbc into master Mar 10, 2022
@midavadim midavadim deleted the midavadim/9404-stripe-performance-improvement branch March 10, 2022 16:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/connectors Connector related issues area/documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Source Stripe: improve performance Source Stripe: sync freeze with 80k records
6 participants