Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[source-stripe] Customer_Balance_Transaction records are occasionally getting missed #48857

Closed
1 task done
williamkaper opened this issue Dec 9, 2024 · 10 comments · Fixed by #49811
Closed
1 task done

Comments

@williamkaper
Copy link
Contributor

williamkaper commented Dec 9, 2024

Connector Name

source-stripe

Connector Version

5.8.0

What step the error happened?

None

Relevant information

We are seeing CBT's that aren't getting picked up on created/updated. I would have assumed since there is a created event for customer_balance_transaction (doc here), the updated would catch it if the record was "created" after the last sync but with a created value that was before the last sync. https://github.com/williamkaper/airbyte/blob/e1217e8c7e5ce8400dc7a0eb0c25e2426734c[…]e-integrations/connectors/source-stripe/source_stripe/source.py

If that is not the case, then I'm worried that I might need to change this sync so its more like a normal balance transaction where it only has a created date and can leverage a look-back window to pick it up. Is there anyway I could get someone on the team to take a quick peek at this and weigh in?

Relevant log output

Contribute

  • Yes, I want to contribute
@williamkaper
Copy link
Contributor Author

We are running on community edition so I can share a public example, but I was able to verify the CBT did exist when I manually called Stripe's API through python, so I'm fearing it could be an issue with a way I implemented it and it should have been a created only one..

@williamkaper
Copy link
Contributor Author

williamkaper commented Dec 10, 2024

@marcosmarxm @maxi297 I may have had an epiphany on what is going on, but I'd love for you to check my thinking to make sure it pencils out before I open up a PR fix for it.

In PR 46864 I introduced support for customer balance transactions to pull incrementally instead of requiring a full resync each time. The previous implementation only tried pulling records when a customer's balance was zero, but that missed updated where offsetting CBT transactions occurred raising and then lowering the balance. It was also an intensive pull when a Stripe customer had a lot of customer records because unfortunately, Stripe doesn't offer a list all style operation to get all customer balance transactions for all customers. The API for customer balance transactions needs to be called per customer. (This itself is its own pain if you have millions of customers in your Stripe database due to Stripe not being willing to add a sensible batch retrieve method for all customers).

When I coded this solution, I assumed the way UpdatedCursorIncrementalStripeSubStream would work is it would call it once per customer to pull any incremental updates on the event_type customer_cash_balance_transaction. However, I'm noticing for the customers that I'm seeing CBTs not getting captured, they have never been updated. That is making me wonder if CBT's are only getting checked if the customer first has been incrementally updated.

On top of that, I think the event I used for CBT is wrong and applies to cash balance transactions (which are different than customer balance transactions). I don't see any direct events for CBT.

Before I revert this PR or try to fix it, I want to make sure I'm not fixing one thing and breaking another. Here was the old code in streams.py that was full syncing everything for CBT but only when there was an invoice and the balance was zero (balance zero was a bug that dropped records that created and offsetted between airbyte syncs). streams.py.

Some ideas:

  1. Revert and remove the balance=0 check. It is the most straight forward but it will resync any customer's data every time for all their balance transactions, which can be a lot.
  2. Change the implementation from UpdatedCursorIncrementalStripeSubStream to ParentIncrementalStripeSubStream. This may have the same issue the current implementation has in regards to customers not updating not triggering the substream to pick up, but it would allow me to change the cursor field to created and drop the events.
  3. A new hybrid createdcursorincrementalstripestream that does what the old logic did in checkin catalogs #1, but only incrementally based on the created cursor (not sure how to even start on this one).

I'm leaning toward #1, but I want to make sure I understand #2 correctly.

@williamkaper
Copy link
Contributor Author

williamkaper commented Dec 10, 2024

I updated above to more accurately reflect my current research/troubleshooting. I've also reached out to Stripe to verify there isn't another event we can key on help facilitate a truly incremental sync: Image

@marcosmarxm
Copy link
Member

@maxi297 do you mind taking a look into this issue reported by William?

@williamkaper
Copy link
Contributor Author

williamkaper commented Dec 11, 2024

This is my first attempt at a fix - 5395a98

@maxi297
Copy link
Contributor

maxi297 commented Dec 11, 2024

Just to share my understanding:

  • customer_cash_balance_transaction events return a customer_cash_balance_transaction objects
    • Those objects are different than customer_balance_transaction
    • They can be retrieved through v1/customers/<customer id>/cash_balance_transactions and not through v1/customers/<customer id>/balance_transactions
  • I don't see any events for customer_balance_transaction
  • Based on this documentation, the customer balance transaction is immutable so we can probably retrieve it filtering on created[gte]: Because the credit balance is computed from a ledger — an immutable list of debit and credit transactions — it provides an audit trail of transactions for the customer. These [Customer Balance Transactions](https://docs.stripe.com/api/customer_balance_transactions/object) can refer to the object related to the adjustment

The change you are proposing would change it so that we do the following in incremental mode:

state = start_date
created_query_param = start_date
for each customer_id:
  for each transaction get customer_balance_transactions where created > created_query_param:
    emit transaction
    state = max(state, transaction.created)

That change makes sense to me in the short run but it seems like we could be at risk of losing transactions on the next incremental sync. The example would be if we have the following set of events:

  • T0: Our connector does a request for customer ID customer_id1
  • T1: A transaction is added for customer_id1
  • T2: A transaction is added for customer_id2
  • T3: Our connector does a request for customer ID customer_id2 and a transaction with created = T2 so we update the state to T2

In the case above on the next sync, we will query only for transactions where created > T2 which means that the transaction that happened at T1 for customer_id1 will never be emitted. I assume:

  • All ParentIncrementalStripeSubStream streams can be affected by this issue
  • Either the likelihood of such dataloss is low or we don't have enough users to have this bug being reported to us

Does that reasoning make sense to you? I've created an issue for this problem. In the meanwhile, I think your change makes sense. If you think you might be affected by this dataloss, I suggest that you run things in full_refresh mode for now. We will refine this issue with the team next Tuesday and see what is the priority and how we will tackle this.

@williamkaper
Copy link
Contributor Author

williamkaper commented Dec 11, 2024

@maxi297 That makes sense. I agree, with this change, the risk of data loss is low but non-existent. Full Refresh on CBT is problematic, because of Stripe's bullshit API for CBTs where they don't emit an event on created nor do they have a list-all for all customers CBTs, either of which would solve this.

We have a HubiFi customer with a Stripe Account with over 1.5M customer records. Doing a full refresh means we need make 1.5M calls every full refresh. No only does that take a good 6+ hours to complete, but it spikes Airbtyes memory consumption (I'm guessing all the customer records gets loaded into memory and the customers that are processed aren't being dereferenced thus are getting held onto).

If this change works, for now I'm going to run on a custom build of the stripe connector with the commit above cherry-picked. If you guys aren't getting any complaints and aren't going to look at fixing this more probably, I may put this up as a minor PR to clean up my implementation that uses the wrong event. Just let me know how that goes next week.

@maxi297
Copy link
Contributor

maxi297 commented Dec 12, 2024

@williamkaper we will be looking at this change soon-ish. The expectation is that we will be working on moving source-stripe to our no-code framework in the coming weeks in order to simplify maintenance. It's not only for maintenance though. For example, the problem I mentioned above is already solved for free with this. So I think I would process in two steps:

  • We can apply the changes from 5395a98 (very nit is I think we can remove the legacy_cursor but I'm fine with keeping it). Can you put up a PR for that?
  • Our team will be looking that the dataloss issue while doing the migration to no-code in the following weeks

Let me know if you have concerns around this plan and we'll figure these out together 💪

@williamkaper
Copy link
Contributor Author

williamkaper commented Dec 13, 2024 via email

@williamkaper
Copy link
Contributor Author

@williamkaper we will be looking at this change soon-ish. The expectation is that we will be working on moving source-stripe to our no-code framework in the coming weeks in order to simplify maintenance. It's not only for maintenance though. For example, the problem I mentioned above is already solved for free with this. So I think I would process in two steps:

  • We can apply the changes from 5395a98 (very nit is I think we can remove the legacy_cursor but I'm fine with keeping it). Can you put up a PR for that?
  • Our team will be looking that the dataloss issue while doing the migration to no-code in the following weeks

Let me know if you have concerns around this plan and we'll figure these out together 💪

PR up: #49811

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants