Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automated retries #1776

Open
wants to merge 13 commits into
base: master
Choose a base branch
from
Open

Automated retries #1776

wants to merge 13 commits into from

Conversation

ekzyis
Copy link
Member

@ekzyis ekzyis commented Dec 30, 2024

Description

close #1492 based on #1785, #1787

All failed invoices are returned to every client periodically via a new query failedInvoices. To make sure an invoice is only retried by one client at a time, the retryPaidAction mutation makes sure that only one client is able to transition the invoice from FAILED to RETRY_PENDING.

To stop after three payment attempts (= two retries), a new integer column Invoice.paymentAttempt tracks at which payment attempt we are. This number is increased when we retry an invoice and pass newAttempt: true which we do when we retry these fetched failed invoices. When the number is increased, the payment will start from the beginning with all sender and receiver wallets available.

TODO:

  • retry returned failed invoices on client
  • count retries so we can stop returning them after we hit a limit
  • only show failed invoices that have exhausted retries in notifications
  • don't retry invoices that have been manually cancelled?

added userCancel column, see #1785

  • don't return failed invoices to clients that don't have send wallets

client only polls when it has send wallets

  • don't return intermediate failed invoices that will be retried due to sender or receiver fallbacks

added "cancelledAt" < now() - interval '${WALLET_RETRY_AFTER_MS} milliseconds' filter

  • ⚠️ make sure when this is deployed, this will not start retrying old failed invoices ⚠️

added WALLET_RETRY_BEFORE_MS used in this filter:

AND now()::timestamp <@ tsrange(
  "cancelledAt" + interval '${WALLET_RETRY_AFTER_MS} milliseconds',
  "cancelledAt" + interval '${WALLET_RETRY_BEFORE_MS} milliseconds'
)
  • use lock in retry mutation

Additional Context

  • this will still retry invoices that have been created before deployment and failed if they aren't older than a day
  • if a client detaches all wallets after they tried a payment for the first time, it will never show up in notifications as failed since it wasn't retried often enough. I consider this to be an edge case we don't need to worry about.
  • see https://github.com/stackernews/stacker.news/pull/1776/files#r1907791409

Checklist

Are your changes backwards compatible? Please answer below:

yes

On a scale of 1-10 how well and how have you QA'd this change and any features it might affect? Please answer below:

7. Tested automated retries for posting, replies and zapping with this patch:

diff --git a/api/resolvers/wallet.js b/api/resolvers/wallet.js
index 8e6f3f0e..fb9edd2f 100644
--- a/api/resolvers/wallet.js
+++ b/api/resolvers/wallet.js
@@ -469,7 +469,6 @@ const resolvers = {
         SELECT * FROM "Invoice"
         WHERE "userId" = ${me.id}
         AND "actionState" = 'FAILED'
-        AND "userCancel" = false
         AND "cancelledAt" < now() - ${`${WALLET_RETRY_AFTER_MS} milliseconds`}::interval
         AND "cancelledAt" > now() - ${`${WALLET_RETRY_BEFORE_MS} milliseconds`}::interval
         AND "paymentAttempt" < ${WALLET_MAX_RETRIES}
diff --git a/lib/constants.js b/lib/constants.js
index fe46b71a..09b6981a 100644
--- a/lib/constants.js
+++ b/lib/constants.js
@@ -198,7 +198,7 @@ export const WALLET_CREATE_INVOICE_TIMEOUT_MS = 45_000
 // interval between which failed invoices are returned to a client for automated retries.
 // retry-after must be high enough such that intermediate failed invoices that will already
 // be retried by the client due to sender or receiver fallbacks are not returned to the client.
-export const WALLET_RETRY_AFTER_MS = 60_000 // 1 minute
+export const WALLET_RETRY_AFTER_MS = 5_000 // 1 minute
 export const WALLET_RETRY_BEFORE_MS = 86_400_000 // 24 hours
 // we want to attempt a payment three times so we retry two times
 export const WALLET_MAX_RETRIES = 2
@@ -277,7 +277,7 @@ function RetryHandler ({ children }) {
           console.error('retry poll failed:', err)
         }
         if (!stopped) queuePoll()
-      }, NORMAL_POLL_INTERVAL)
+      }, 1_000)
     }
     const stopPolling = () => {
       stopped = true

Simulate multiple clients with this patch:

diff --git a/wallets/index.js b/wallets/index.js
index 42fb55be..8a9f5f3f 100644
--- a/wallets/index.js
+++ b/wallets/index.js
@@ -255,9 +255,9 @@ function RetryHandler ({ children }) {
         return
       }

-      for (const inv of failedInvoices) {
+      for (const inv of [...failedInvoices, ...failedInvoices]) {
         try {
-          await retry(inv)
+          retry(inv).catch(err => console.error('retry failed:', err))
         } catch (err) {
           // some retries are expected to fail since only one client at a time is allowed to retry
           // these should show up as 'invoice not found' errors

Test p2p zaps with this patch that makes forwards fail and disables the fallback to CCs:

diff --git a/api/paidAction/index.js b/api/paidAction/index.js
index 8feabc30..031e4d7e 100644
--- a/api/paidAction/index.js
+++ b/api/paidAction/index.js
@@ -355,6 +355,7 @@ export async function retryPaidAction (actionType, args, incomingContext) {
       invoiceArgs = { bolt11, wrappedBolt11, wallet, maxFee }
     } catch (err) {
       console.log('failed to retry wrapped invoice, falling back to SN:', err)
+      throw err
     }
   }
diff --git a/worker/paidAction.js b/worker/paidAction.js
index 9b3ecb5a..b9172ebf 100644
--- a/worker/paidAction.js
+++ b/worker/paidAction.js
@@ -268,7 +268,7 @@ export async function paidActionForwarding ({ data: { invoiceId, ...args }, mode
     payViaPaymentRequest({
       lnd,
       request: bolt11,
-      max_fee_mtokens: String(maxFeeMsats),
+      max_fee_mtokens: String(0),
       pathfinding_timeout: LND_PATHFINDING_TIMEOUT_MS,
       confidence: LND_PATHFINDING_TIME_PREF_PPM,
       max_timeout_height: maxTimeoutHeight

For frontend changes: Tested on mobile, light and dark mode? Please answer below:

n/a

Did you introduce any new environment variables? If so, call them out explicitly here:

no

@ekzyis ekzyis added feature new product features that weren't there before wallets labels Dec 30, 2024
@ekzyis ekzyis marked this pull request as draft December 30, 2024 03:28
@ekzyis ekzyis force-pushed the automated-retries branch 4 times, most recently from eb65eb5 to 7c875d1 Compare January 1, 2025 18:02
@ekzyis ekzyis force-pushed the automated-retries branch from 7c875d1 to c0157fc Compare January 1, 2025 18:28
@ekzyis ekzyis force-pushed the automated-retries branch 10 times, most recently from 8757074 to 09cb758 Compare January 8, 2025 18:44
@ekzyis ekzyis marked this pull request as ready for review January 8, 2025 19:54
@ekzyis ekzyis requested a review from huumn January 8, 2025 19:54
api/resolvers/wallet.js Outdated Show resolved Hide resolved
wallets/index.js Outdated Show resolved Hide resolved
wallets/index.js Outdated Show resolved Hide resolved
Comment on lines +209 to +211
<RetryHandler>
{children}
</RetryHandler>
Copy link
Member Author

@ekzyis ekzyis Jan 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RetryHandler isn't a context provider, but I think we should use more providers inside context providers nonetheless. There are too many in pages/_app.js.

For example, we could put all providers for the carousel (price, block height, chain fee) into the same component.

@ekzyis ekzyis force-pushed the automated-retries branch from f79f837 to ccbde0d Compare January 9, 2025 01:22
Comment on lines 490 to 494
SELECT
'unlockInvoice',
jsonb_build_object('id', id),
now() + interval '10 minutes',
now() + interval '15 minutes'
Copy link
Member

@huumn huumn Jan 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a cool construct.

  1. Can you explain your reasoning behind the times for unlocking?
  2. As to why it needs the lock, is this correct: it only exists to prevent races creating the invoice (either wrapped or direct to SN) that the race winner will pay?
  3. If I have (2) correct, did you consider skip locking in retryPaidAction in a separate tx0, before the invoice was created, rather than when it's fetched? Then, it'd be implicitly unlocked in tx1, where the invoice is set to RETRYING? I guess it's possible tx1 fails, and doesn't unlock by setting to RETRYING, but maybe that can be fixed by having tx0 set an expiring lock like lockedAt is null OR lockedAt < now() - interval '2 minutes'
    • I'm just trying to brainstorm ways to avoid the pgboss job (async stuff like this tends to be the most confusing for me) and shorten the time between retries and the notification for retrying

Copy link
Member Author

@ekzyis ekzyis Jan 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain your reasoning behind the times for unlocking?

I didn't think too much about when exactly it should unlock. It should just be high enough such that we can reasonably assume that the client who received and locked this invoice is not going to retry this invoice. I think we can probably safely decrease the unlock time 10 minutes to 1 minute.

As to why it needs the lock, is this correct: it only exists to prevent races creating the invoice (either wrapped or direct to SN) that the race winner will pay?

Mhh, yes, by making sure we only hand out invoices to one client at a time to avoid any such races.

I guess it's possible tx1 fails, and doesn't unlock by setting to RETRYING, but maybe that can be fixed by having tx0 set an expiring lock like lockedAt is null OR lockedAt < now() - interval '2 minutes'

Yeah, an expiring lock would be the alternative to the async stuff.

update: changed it to use locks that expire after one minute in 15c799d

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just noticed you didn't address (3) exactly. I might have been confusing.

I don't think we should lock on query - we should lock in retryPaidAction. ie multiple clients can call retryPaidAction, but only one will successfully get an invoice to retry.

Is that not possible or is the current way better for some reason?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bump

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whoops, forgot to reply

I just noticed you didn't address (3) exactly. I might have been confusing.

I didn't address (3) exactly, only the async stuff because of

I'm just trying to brainstorm ways to avoid the pgboss job

so I thought that was the main issue and didn't elaborate on the "lock location"

I don't think we should lock on query - we should lock in retryPaidAction. ie multiple clients can call retryPaidAction, but only one will successfully get an invoice to retry.

Is that not possible or is the current way better for some reason?

I am also using the lock set on query to know if the payment attempt counter should be increased in the retry mutation.

I could increase the payment attempt counter on the client and the client includes it in the retry mutation but letting the client pick that number makes this a little harder to think about for example wrt trusting inputs 🤔

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically, anytime an invoice is transitioned from FAILED to RETRYING, we want this incremented in the new invoice record, right?

No, during sender and receiver fallbacks, we are also retrying but we want to keep the payment attempt counter the same so we can filter receiver wallets based on that. When we retry a locked invoice, we increment the counter to make all sender/receiver wallets available again and start a "new chain of invoices" where all have the same payment attempt counter.

Anyway, if you're certain that what I'm asking makes that impossible, then I'll accept it is until I've taken a closer look.

Changing the lock+payment attempt counter increment logic is not impossible but I think it's also good enough as it is now

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I don't see why splitting this up into separate requests makes this possible in way that separate txs in retryPaidAction wouldn't. I'm mostly suggesting the logic/counting stays the same, but we can get rid of the lock on query by locking before the retry, and if we can't acquire the lock because another client retried it already, returning null/error.

Anyway, I haven't read the code so I'm probably missing something subtle.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm mostly suggesting the logic/counting stays the same, but we can get rid of the lock on query by locking before the retry

Maybe I am actually missing something, but I can't tell from this description how you want to know during retry if you should increment the counter if you lock every invoice before retry. If the logic stays the same, every retry would then increment the counter which would break receiver fallbacks.

So basically, the lock on query allows us to distinguish "normal retries" (= same payment counter) from "full retries" (= increment payment counter).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without solving this myself, assuming it can be solved, the easiest way for you to understand might be to ask a riddle: how would you do exactly what you do now without acquiring the lock in the query by acquiring the lock only during the mutation? hint: you can run multiple txs, and logic in between them, in the same mutation.

Copy link
Member Author

@ekzyis ekzyis Jan 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I created an intermediate retry state RETRY_PENDING in 89a6b42 to lock during retry:

// make sure only one client at a time can retry by immediately transitioning to an intermediate state
const [invoice] = await models.$queryRaw`
    UPDATE "Invoice"
    SET "actionState" = 'RETRY_PENDING'
    WHERE id in (
      SELECT id FROM "Invoice"
      WHERE id = ${invoiceId} AND "userId" = ${me.id} AND "actionState" = 'FAILED'
      FOR UPDATE
    )
    RETURNING *`
if (!invoice) {
  throw new Error('Invoice not found')
}

This intermediate state avoids that concurrent retries request invoices from attached wallets even though their state transition from FAILED to RETRYING at the end will fail. This makes this a pessimistic lock instead of an optimistic lock though. But I think that's okay because I suspect many cases where the mobile and desktop client of a stacker retries at the same time so we'd request enough unused invoices from attached wallets for no apparent reason that it would be confusing and lead to questions.

I also updated the documentation that FAILED now always transitions to RETRY_PENDING before RETRYING.

@ekzyis ekzyis force-pushed the automated-retries branch from ccbde0d to 15c799d Compare January 13, 2025 10:45
@ekzyis ekzyis force-pushed the automated-retries branch 3 times, most recently from 6651a11 to e4d2570 Compare January 17, 2025 17:34
@ekzyis ekzyis marked this pull request as draft January 21, 2025 08:52
@ekzyis ekzyis force-pushed the automated-retries branch 5 times, most recently from 2d17904 to a604f0b Compare January 22, 2025 11:02
Comment on lines -63 to -67
if (invoice.actionState !== 'FAILED') {
if (invoice.actionState === 'PAID') {
throw new Error('Invoice is already paid')
}
throw new Error(`Invoice is not in failed state: ${invoice.actionState}`)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removing these error messages probably makes error messages during bugs worse, for example if retry via notifications fail

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the thing to do in this case is return the pending/paid invoice rather than an error.

We should do the same thing in #1669 - if an attempt is made to vote again, just return the original vote.

It's kind of painting over the bug - which is some kind of clientside cache issue - but it's way better UX.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wouldn't this cause the client to retry a paid invoice if the retry mutation doesn't throw?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we'd have to modify the client logic when retrying, or ....

we could just throw a more specific error, catch it, then update the cache appropriately.

Copy link
Member Author

@ekzyis ekzyis Jan 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have put this PR back into review after some Q&A of 9b0bcd3. I think what we discuss here should be done in a separate PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh for sure. I was just riffing on your comment. I keep encountering that poll bug.

@ekzyis ekzyis force-pushed the automated-retries branch 2 times, most recently from 18e7b5a to 89a6b42 Compare January 22, 2025 12:13
@ekzyis ekzyis marked this pull request as ready for review January 23, 2025 11:45
@ekzyis ekzyis force-pushed the automated-retries branch from 89a6b42 to 9b0bcd3 Compare January 23, 2025 11:51
@ekzyis ekzyis force-pushed the automated-retries branch from 9b0bcd3 to 6615d14 Compare January 27, 2025 15:07
...result,
type: paidActionType(invoice.actionType)
}
} catch (err) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if retryPaidAction causes the server to crash and the catch block isn't hit?

Won't the invoice be kept in RETRY_PENDING limbo?

Wouldn't the expiring lock you had previously still work with this retry flow? Or no?

SELECT id FROM "Invoice"
WHERE id = ${invoiceId} AND "userId" = ${me.id} AND "actionState" = 'FAILED'
FOR UPDATE
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why use a set query here? Wouldn't this suffice:

          UPDATE "Invoice"
          SET "actionState" = 'RETRY_PENDING'
          WHERE id = ${invoiceId} AND "userId" = ${me.id} AND "actionState" = 'FAILED'
          RETURNING *

id is a unique column so there'd only ever be one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature new product features that weren't there before wallets
Projects
None yet
Development

Successfully merging this pull request may close these issues.

automatic retries of failed paid actions
2 participants