Automated retries #1776

ekzyis · 2024-12-30T03:28:15Z

Description

All failed invoices are returned to every client periodically via a new query failedInvoices. To make sure an invoice is only retried by one client at a time, the retryPaidAction mutation makes sure that only one client is able to transition the invoice from FAILED to RETRY_PENDING.

To stop after three payment attempts (= two retries), a new integer column Invoice.paymentAttempt tracks at which payment attempt we are. This number is increased when we retry an invoice and pass newAttempt: true which we do when we retry these fetched failed invoices. When the number is increased, the payment will start from the beginning with all sender and receiver wallets available.

TODO:

retry returned failed invoices on client
count retries so we can stop returning them after we hit a limit
only show failed invoices that have exhausted retries in notifications
don't retry invoices that have been manually cancelled?

added userCancel column, see #1785

don't return failed invoices to clients that don't have send wallets

client only polls when it has send wallets

don't return intermediate failed invoices that will be retried due to sender or receiver fallbacks

added "cancelledAt" < now() - interval '${WALLET_RETRY_AFTER_MS} milliseconds' filter

⚠️ make sure when this is deployed, this will not start retrying old failed invoices ⚠️

added WALLET_RETRY_BEFORE_MS used in this filter:

AND now()::timestamp <@ tsrange(
  "cancelledAt" + interval '${WALLET_RETRY_AFTER_MS} milliseconds',
  "cancelledAt" + interval '${WALLET_RETRY_BEFORE_MS} milliseconds'
)

use lock in retry mutation

Additional Context

this will still retry invoices that have been created before deployment and failed if they aren't older than a day
if a client detaches all wallets after they tried a payment for the first time, it will never show up in notifications as failed since it wasn't retried often enough. I consider this to be an edge case we don't need to worry about.
~~see https://github.com/stackernews/stacker.news/pull/1776/files#r1907791409~~

Checklist

Are your changes backwards compatible? Please answer below:

yes

On a scale of 1-10 how well and how have you QA'd this change and any features it might affect? Please answer below:

7. Tested automated retries for posting, replies and zapping with this patch:

diff --git a/api/resolvers/wallet.js b/api/resolvers/wallet.js
index 8e6f3f0e..fb9edd2f 100644
--- a/api/resolvers/wallet.js
+++ b/api/resolvers/wallet.js
@@ -469,7 +469,6 @@ const resolvers = {
         SELECT * FROM "Invoice"
         WHERE "userId" = ${me.id}
         AND "actionState" = 'FAILED'
-        AND "userCancel" = false
         AND "cancelledAt" < now() - ${`${WALLET_RETRY_AFTER_MS} milliseconds`}::interval
         AND "cancelledAt" > now() - ${`${WALLET_RETRY_BEFORE_MS} milliseconds`}::interval
         AND "paymentAttempt" < ${WALLET_MAX_RETRIES}
diff --git a/lib/constants.js b/lib/constants.js
index fe46b71a..09b6981a 100644
--- a/lib/constants.js
+++ b/lib/constants.js
@@ -198,7 +198,7 @@ export const WALLET_CREATE_INVOICE_TIMEOUT_MS = 45_000
 // interval between which failed invoices are returned to a client for automated retries.
 // retry-after must be high enough such that intermediate failed invoices that will already
 // be retried by the client due to sender or receiver fallbacks are not returned to the client.
-export const WALLET_RETRY_AFTER_MS = 60_000 // 1 minute
+export const WALLET_RETRY_AFTER_MS = 5_000 // 1 minute
 export const WALLET_RETRY_BEFORE_MS = 86_400_000 // 24 hours
 // we want to attempt a payment three times so we retry two times
 export const WALLET_MAX_RETRIES = 2
@@ -277,7 +277,7 @@ function RetryHandler ({ children }) {
           console.error('retry poll failed:', err)
         }
         if (!stopped) queuePoll()
-      }, NORMAL_POLL_INTERVAL)
+      }, 1_000)
     }
     const stopPolling = () => {
       stopped = true

Simulate multiple clients with this patch:

diff --git a/wallets/index.js b/wallets/index.js
index 42fb55be..8a9f5f3f 100644
--- a/wallets/index.js
+++ b/wallets/index.js
@@ -255,9 +255,9 @@ function RetryHandler ({ children }) {
         return
       }

-      for (const inv of failedInvoices) {
+      for (const inv of [...failedInvoices, ...failedInvoices]) {
         try {
-          await retry(inv)
+          retry(inv).catch(err => console.error('retry failed:', err))
         } catch (err) {
           // some retries are expected to fail since only one client at a time is allowed to retry
           // these should show up as 'invoice not found' errors

Test p2p zaps with this patch that makes forwards fail and disables the fallback to CCs:

diff --git a/api/paidAction/index.js b/api/paidAction/index.js
index 8feabc30..031e4d7e 100644
--- a/api/paidAction/index.js
+++ b/api/paidAction/index.js
@@ -355,6 +355,7 @@ export async function retryPaidAction (actionType, args, incomingContext) {
       invoiceArgs = { bolt11, wrappedBolt11, wallet, maxFee }
     } catch (err) {
       console.log('failed to retry wrapped invoice, falling back to SN:', err)
+      throw err
     }
   }
diff --git a/worker/paidAction.js b/worker/paidAction.js
index 9b3ecb5a..b9172ebf 100644
--- a/worker/paidAction.js
+++ b/worker/paidAction.js
@@ -268,7 +268,7 @@ export async function paidActionForwarding ({ data: { invoiceId, ...args }, mode
     payViaPaymentRequest({
       lnd,
       request: bolt11,
-      max_fee_mtokens: String(maxFeeMsats),
+      max_fee_mtokens: String(0),
       pathfinding_timeout: LND_PATHFINDING_TIMEOUT_MS,
       confidence: LND_PATHFINDING_TIME_PREF_PPM,
       max_timeout_height: maxTimeoutHeight

For frontend changes: Tested on mobile, light and dark mode? Please answer below:

n/a

Did you introduce any new environment variables? If so, call them out explicitly here:

no

api/resolvers/wallet.js

wallets/index.js

components/use-invoice.js

wallets/index.js

ekzyis · 2025-01-08T20:19:22Z

wallets/index.js

+      <RetryHandler>
+        {children}
+      </RetryHandler>


RetryHandler isn't a context provider, but I think we should use more providers inside context providers nonetheless. There are too many in pages/_app.js.

For example, we could put all providers for the carousel (price, block height, chain fee) into the same component.

huumn · 2025-01-10T00:49:21Z

api/resolvers/wallet.js

+          SELECT
+            'unlockInvoice',
+            jsonb_build_object('id', id),
+            now() + interval '10 minutes',
+            now() + interval '15 minutes'


This is a cool construct.

Can you explain your reasoning behind the times for unlocking?

As to why it needs the lock, is this correct: it only exists to prevent races creating the invoice (either wrapped or direct to SN) that the race winner will pay?

If I have (2) correct, did you consider skip locking in retryPaidAction in a separate tx0, before the invoice was created, rather than when it's fetched? Then, it'd be implicitly unlocked in tx1, where the invoice is set to RETRYING? I guess it's possible tx1 fails, and doesn't unlock by setting to RETRYING, but maybe that can be fixed by having tx0 set an expiring lock like lockedAt is null OR lockedAt < now() - interval '2 minutes'

I'm just trying to brainstorm ways to avoid the pgboss job (async stuff like this tends to be the most confusing for me) and shorten the time between retries and the notification for retrying

Can you explain your reasoning behind the times for unlocking?

I didn't think too much about when exactly it should unlock. It should just be high enough such that we can reasonably assume that the client who received and locked this invoice is not going to retry this invoice. I think we can probably safely decrease the unlock time 10 minutes to 1 minute.

As to why it needs the lock, is this correct: it only exists to prevent races creating the invoice (either wrapped or direct to SN) that the race winner will pay?

Mhh, yes, by making sure we only hand out invoices to one client at a time to avoid any such races.

I guess it's possible tx1 fails, and doesn't unlock by setting to RETRYING, but maybe that can be fixed by having tx0 set an expiring lock like lockedAt is null OR lockedAt < now() - interval '2 minutes'

Yeah, an expiring lock would be the alternative to the async stuff.

update: changed it to use locks that expire after one minute in 15c799d

I just noticed you didn't address (3) exactly. I might have been confusing.

I don't think we should lock on query - we should lock in retryPaidAction. ie multiple clients can call retryPaidAction, but only one will successfully get an invoice to retry.

Is that not possible or is the current way better for some reason?

whoops, forgot to reply

I just noticed you didn't address (3) exactly. I might have been confusing.

I didn't address (3) exactly, only the async stuff because of

I'm just trying to brainstorm ways to avoid the pgboss job

so I thought that was the main issue and didn't elaborate on the "lock location"

I don't think we should lock on query - we should lock in retryPaidAction. ie multiple clients can call retryPaidAction, but only one will successfully get an invoice to retry.

Is that not possible or is the current way better for some reason?

I am also using the lock set on query to know if the payment attempt counter should be increased in the retry mutation.

I could increase the payment attempt counter on the client and the client includes it in the retry mutation but letting the client pick that number makes this a little harder to think about for example wrt trusting inputs 🤔

Basically, anytime an invoice is transitioned from FAILED to RETRYING, we want this incremented in the new invoice record, right?

No, during sender and receiver fallbacks, we are also retrying but we want to keep the payment attempt counter the same so we can filter receiver wallets based on that. When we retry a locked invoice, we increment the counter to make all sender/receiver wallets available again and start a "new chain of invoices" where all have the same payment attempt counter.

Anyway, if you're certain that what I'm asking makes that impossible, then I'll accept it is until I've taken a closer look.

Changing the lock+payment attempt counter increment logic is not impossible but I think it's also good enough as it is now

I guess I don't see why splitting this up into separate requests makes this possible in way that separate txs in retryPaidAction wouldn't. I'm mostly suggesting the logic/counting stays the same, but we can get rid of the lock on query by locking before the retry, and if we can't acquire the lock because another client retried it already, returning null/error.

Anyway, I haven't read the code so I'm probably missing something subtle.

I'm mostly suggesting the logic/counting stays the same, but we can get rid of the lock on query by locking before the retry

Maybe I am actually missing something, but I can't tell from this description how you want to know during retry if you should increment the counter if you lock every invoice before retry. If the logic stays the same, every retry would then increment the counter which would break receiver fallbacks.

So basically, the lock on query allows us to distinguish "normal retries" (= same payment counter) from "full retries" (= increment payment counter).

Without solving this myself, assuming it can be solved, the easiest way for you to understand might be to ask a riddle: how would you do exactly what you do now without acquiring the lock in the query by acquiring the lock only during the mutation? hint: you can run multiple txs, and logic in between them, in the same mutation.

I created an intermediate retry state RETRY_PENDING in 89a6b42 to lock during retry:

// make sure only one client at a time can retry by immediately transitioning to an intermediate state const [invoice] = await models.$queryRaw` UPDATE "Invoice" SET "actionState" = 'RETRY_PENDING' WHERE id in ( SELECT id FROM "Invoice" WHERE id = ${invoiceId} AND "userId" = ${me.id} AND "actionState" = 'FAILED' FOR UPDATE ) RETURNING *` if (!invoice) { throw new Error('Invoice not found') }

This intermediate state avoids that concurrent retries request invoices from attached wallets even though their state transition from FAILED to RETRYING at the end will fail. This makes this a pessimistic lock instead of an optimistic lock though. But I think that's okay because I suspect many cases where the mobile and desktop client of a stacker retries at the same time so we'd request enough unused invoices from attached wallets for no apparent reason that it would be confusing and lead to questions.

I also updated the documentation that FAILED now always transitions to RETRY_PENDING before RETRYING.

ekzyis · 2025-01-22T11:10:26Z

api/resolvers/paidAction.js

-      if (invoice.actionState !== 'FAILED') {
-        if (invoice.actionState === 'PAID') {
-          throw new Error('Invoice is already paid')
-        }
-        throw new Error(`Invoice is not in failed state: ${invoice.actionState}`)


removing these error messages probably makes error messages during bugs worse, for example if retry via notifications fail

I think the thing to do in this case is return the pending/paid invoice rather than an error.

We should do the same thing in #1669 - if an attempt is made to vote again, just return the original vote.

It's kind of painting over the bug - which is some kind of clientside cache issue - but it's way better UX.

wouldn't this cause the client to retry a paid invoice if the retry mutation doesn't throw?

Yeah, we'd have to modify the client logic when retrying, or ....

we could just throw a more specific error, catch it, then update the cache appropriately.

I have put this PR back into review after some Q&A of 9b0bcd3. I think what we discuss here should be done in a separate PR.

Oh for sure. I was just riffing on your comment. I keep encountering that poll bug.

huumn · 2025-01-28T19:24:43Z

api/resolvers/paidAction.js

+          ...result,
+          type: paidActionType(invoice.actionType)
+        }
+      } catch (err) {


What happens if retryPaidAction causes the server to crash and the catch block isn't hit?

Won't the invoice be kept in RETRY_PENDING limbo?

Wouldn't the expiring lock you had previously still work with this retry flow? Or no?

huumn · 2025-01-28T19:49:56Z

api/resolvers/paidAction.js

+            SELECT id FROM "Invoice"
+            WHERE id = ${invoiceId} AND "userId" = ${me.id} AND "actionState" = 'FAILED'
+            FOR UPDATE
+          )


Why use a set query here? Wouldn't this suffice:

UPDATE "Invoice" SET "actionState" = 'RETRY_PENDING' WHERE id = ${invoiceId} AND "userId" = ${me.id} AND "actionState" = 'FAILED' RETURNING *

id is a unique column so there'd only ever be one.

ekzyis added feature new product features that weren't there before wallets labels Dec 30, 2024

ekzyis marked this pull request as draft December 30, 2024 03:28

ekzyis force-pushed the automated-retries branch 4 times, most recently from eb65eb5 to 7c875d1 Compare January 1, 2025 18:02

ekzyis mentioned this pull request Jan 1, 2025

Distinguish invoices cancelled by user #1785

Merged

ekzyis force-pushed the automated-retries branch from 7c875d1 to c0157fc Compare January 1, 2025 18:28

ekzyis mentioned this pull request Jan 2, 2025

Allow cancel of own invoices without hmac #1787

Merged

1 task

ekzyis force-pushed the automated-retries branch 10 times, most recently from 8757074 to 09cb758 Compare January 8, 2025 18:44

ekzyis marked this pull request as ready for review January 8, 2025 19:54

ekzyis requested a review from huumn January 8, 2025 19:54

huumn reviewed Jan 8, 2025

View reviewed changes

api/resolvers/wallet.js Outdated Show resolved Hide resolved

ekzyis commented Jan 8, 2025

View reviewed changes

wallets/index.js Outdated Show resolved Hide resolved

ekzyis commented Jan 8, 2025

View reviewed changes

components/use-invoice.js Show resolved Hide resolved

ekzyis commented Jan 8, 2025

View reviewed changes

wallets/index.js Outdated Show resolved Hide resolved

ekzyis commented Jan 8, 2025

View reviewed changes

ekzyis force-pushed the automated-retries branch from f79f837 to ccbde0d Compare January 9, 2025 01:22

huumn reviewed Jan 10, 2025

View reviewed changes

ekzyis force-pushed the automated-retries branch from ccbde0d to 15c799d Compare January 13, 2025 10:45

ekzyis force-pushed the automated-retries branch 3 times, most recently from 6651a11 to e4d2570 Compare January 17, 2025 17:34

ekzyis marked this pull request as draft January 21, 2025 08:52

ekzyis force-pushed the automated-retries branch 5 times, most recently from 2d17904 to a604f0b Compare January 22, 2025 11:02

ekzyis commented Jan 22, 2025

View reviewed changes

ekzyis force-pushed the automated-retries branch 2 times, most recently from 18e7b5a to 89a6b42 Compare January 22, 2025 12:13

ekzyis marked this pull request as ready for review January 23, 2025 11:45

ekzyis force-pushed the automated-retries branch from 89a6b42 to 9b0bcd3 Compare January 23, 2025 11:51

ekzyis added 12 commits January 27, 2025 16:07

Poll failed invoices with visibility timeout

19d90d5

Don't return intermediate failed invoices

a441bf0

Don't retry too old invoices

37d3ffd

Retry invoices on client

5a305fd

Only attempt payment 3 times

fcb547b

Fix fallbacks during last retry

86b6638

Rename retry column to paymentAttempt

a690c86

Fix no index used

71b96b9

Resolve TODOs

ffc051e

Use expiring locks

a082cc1

Better comments for constants

27465c1

Acquire lock during retry

6615d14

ekzyis force-pushed the automated-retries branch from 9b0bcd3 to 6615d14 Compare January 27, 2025 15:07

Merge branch 'master' into automated-retries

0da5cad

huumn reviewed Jan 28, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automated retries #1776

Automated retries #1776

ekzyis commented Dec 30, 2024 •

edited

Loading

ekzyis Jan 8, 2025 •

edited

Loading

huumn Jan 10, 2025 •

edited

Loading

ekzyis Jan 13, 2025 •

edited

Loading

huumn Jan 14, 2025

huumn Jan 16, 2025

ekzyis Jan 17, 2025

ekzyis Jan 18, 2025

huumn Jan 18, 2025

ekzyis Jan 18, 2025

huumn Jan 18, 2025

ekzyis Jan 22, 2025 •

edited

Loading

ekzyis Jan 22, 2025

huumn Jan 22, 2025

ekzyis Jan 23, 2025

huumn Jan 23, 2025

ekzyis Jan 23, 2025 •

edited

Loading

huumn Jan 23, 2025

huumn Jan 28, 2025

huumn Jan 28, 2025

Automated retries #1776

Are you sure you want to change the base?

Automated retries #1776

Conversation

ekzyis commented Dec 30, 2024 • edited Loading

Description

Additional Context

Checklist

ekzyis Jan 8, 2025 • edited Loading

Choose a reason for hiding this comment

huumn Jan 10, 2025 • edited Loading

Choose a reason for hiding this comment

ekzyis Jan 13, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ekzyis Jan 22, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ekzyis Jan 23, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ekzyis commented Dec 30, 2024 •

edited

Loading

ekzyis Jan 8, 2025 •

edited

Loading

huumn Jan 10, 2025 •

edited

Loading

ekzyis Jan 13, 2025 •

edited

Loading

ekzyis Jan 22, 2025 •

edited

Loading

ekzyis Jan 23, 2025 •

edited

Loading