Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add timeout retry strategy to outbound payment #1418

Conversation

kafaichoi
Copy link
Contributor

@kafaichoi kafaichoi commented Apr 12, 2022

Closes #1408

This PR allow us to set timeout(time-based) as an alternative to existing count-based approach in outbound payment retry.

I also have some naming suggestion. I can make another PR(to avoid too much changes in this PR) to fix this if you agree that's improvement

  1. InvoicePayment.payment_cache was actually used to store "retry attempts"(so it can be zero) but now it's used to store payment attempt. Should we rename it to payment_attempts_cache

@codecov-commenter
Copy link

codecov-commenter commented Apr 12, 2022

Codecov Report

Merging #1418 (8d99732) into main (d166044) will increase coverage by 0.02%.
The diff coverage is 88.96%.

❗ Current head 8d99732 differs from pull request most recent head ca4099f. Consider uploading reports for the commit ca4099f to get more accurate results

@@            Coverage Diff             @@
##             main    #1418      +/-   ##
==========================================
+ Coverage   90.88%   90.90%   +0.02%     
==========================================
  Files          76       77       +1     
  Lines       42067    42116      +49     
  Branches    42067    42116      +49     
==========================================
+ Hits        38232    38286      +54     
+ Misses       3835     3830       -5     
Impacted Files Coverage Δ
lightning-invoice/src/lib.rs 87.39% <ø> (ø)
lightning/src/routing/scoring.rs 96.19% <ø> (+1.53%) ⬆️
lightning/src/util/time.rs 81.25% <81.25%> (ø)
lightning-invoice/src/payment.rs 92.90% <94.38%> (+0.14%) ⬆️
lightning-background-processor/src/lib.rs 94.37% <100.00%> (ø)
lightning-invoice/src/utils.rs 96.77% <0.00%> (+0.14%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d166044...ca4099f. Read the comment docs.

@jkczyz jkczyz self-requested a review April 12, 2022 15:36
Copy link
Contributor

@jkczyz jkczyz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution! Generally speaking, I'm in favor of renaming wherever it makes sense. Left some comments and will put a little more thought into naming as the review proceeds.

/// This should be >= 1 as we only insert PaymentAttempts for a PaymentHash if there is at least
/// one attempt.
counts: usize,
latest_attempted_at: SystemTime,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this needed? Looks like it is set but never read.

Copy link
Contributor Author

@kafaichoi kafaichoi Apr 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. It's not used anywhere now. But it will be useful for improving logging? Should I remove it or use it for improving logging msg(shown both count and total time spent)?

https://github.com/lightningdevkit/rust-lightning/pull/1418/files#diff-5165a75ba263a549c7fb9f4b555e449e44b0b680a80a9715a99d11157802a0d7R452

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we can simply store SystemTime::now() plus the duration from MaxDuration (i.e., when timeout occurs) and do any necessary math when logging.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, I would remove it if it's not being used.

.or_insert(PaymentAttempts::new());

if ! self.retry_attempts.is_retryable_now(&PaymentAttempts{
counts: attempts.counts -1,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not quite sure I follow why the count is incremented above and then decremented here.

Copy link
Contributor Author

@kafaichoi kafaichoi Apr 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the original logic in the code is always add one count(no matter if it's retryable here). But since now the retryable logic is encapsulated into a function that takes a PaymentAttempts and if we simply passed the updated one, then that will be incorrect(as we want to test if it can be tested before adding 1 count).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't this be accounted for in is_retryable_now? It's adding 1 back as it is.

Copy link
Contributor Author

@kafaichoi kafaichoi Apr 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the logic in retry_payment is

  1. always add 1 to counts in PaymentAttempts
  2. check if it's retryable
  3. the rest of logic

Retry::is_retryable_now take PaymentAttempts struct and use the counts field to determine if it's retryable if retry strategy is count-based. So in step2, if we don't decrement it by 1, it will be incorrect.

Retry::Attempts(max_retry_count) => max_retry_count + 1 > attempts.counts, the + 1 should be correct. For example,
if max_retry_count is 1 and attempts.count is 1, this means current retry count is 0(the first attempt is not retry attempt) so the + 1 here is required to make this function return true in this case.

Comment on lines 186 to 188
counts: usize,
latest_attempted_at: SystemTime,
first_attempted_at: SystemTime
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to store both count and time if only one is used?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not necessary but the advantage is we can have more information in error log if retry fail. related comment #1418 (comment)

And if really want to optimize this, I guess it makes sense to merge payment_cache and retry_attempts fields in InvoicePayer into one enum field to avoid impossible state(where retry strategy is count base but we store time vice versa).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... not sure how we'd do that because you need to know the maximum attempts when inserting into the map. Let me know if I misunderstood or if you have a more concrete idea.

We could also use a type parameter implementing a common trait (in a similar manner to the Time trait) such that there is never an invalid state.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried but unable to do the optimization of storing only either count and time(based on Retry). and storing both has a pro for displaying user more info in retry error. So I will put this optimization on pause but please let me know if this is high-priority I can give it another try.

/// one attempt.
counts: usize,
latest_attempted_at: SystemTime,
first_attempted_at: SystemTime
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that SystemTime is only available with std feature, so this will need to be conditional. See places in this file where #[cfg(feature = "std")] is used.

With that in mind, RetryAttempts::MaxDuration won't be applicable without the feature, too.

/// Note that this is the number of *path* failures, not full payment retries. For multi-path
/// payments, if this is less than the total number of paths, we will never even retry all of the
/// payment's paths.
/// Retry strategy of retrying payment path failures for an [`Invoice`].
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: s/Retry strategy of retrying/Strategy for retrying

Comment on lines 238 to 274
pub enum RetryAttempts {
/// Count Out. Max number of attempts to retry payment.
///
/// Note that this is the number of *path* failures, not full payment retries. For multi-path
/// payments, if this is less than the total number of paths, we will never even retry all of the
/// payment's paths.
MaxCount(usize),
/// Time Out. Max time spent to retry payment
MaxDuration(Duration),
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about:

enum Retry {
    Attempts(usize),
    Timeout(Duration),
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

much more concise! Thanks

assert_eq!(*event_handled.borrow(), false);
assert_eq!(*payer.attempts.borrow(), 2);

std::thread::sleep(Duration::from_millis(101));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sleeps in testing tend to be a bit flaky and slow down test execution. Ideally, we can inject the time into the code such that in testing it can be manually advanced. We do that in the scorer code:

struct SinceEpoch(Duration);

We wouldn't need to expose the Time trait in InvoicePayer's public interface since it is an implementation detail and can be conditioned for tests to use SinceEpoch instead of SystemTime. It would be great if we can refactor the time code into a util::time module such that it could be reused here. We could symlink it into the lightning-invoice crate to re-use across crates.

/// This should be >= 1 as we only insert PaymentAttempts for a PaymentHash if there is at least
/// one attempt.
counts: usize,
latest_attempted_at: SystemTime,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we can simply store SystemTime::now() plus the duration from MaxDuration (i.e., when timeout occurs) and do any necessary math when logging.

Comment on lines 186 to 188
counts: usize,
latest_attempted_at: SystemTime,
first_attempted_at: SystemTime
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... not sure how we'd do that because you need to know the maximum attempts when inserting into the map. Let me know if I misunderstood or if you have a more concrete idea.

We could also use a type parameter implementing a common trait (in a similar manner to the Time trait) such that there is never an invalid state.

.or_insert(PaymentAttempts::new());

if ! self.retry_attempts.is_retryable_now(&PaymentAttempts{
counts: attempts.counts -1,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't this be accounted for in is_retryable_now? It's adding 1 back as it is.

} else {
*retry_count += 1;
let payment_attempts = payment_cache.get_mut(&payment_hash).unwrap();
if self.retry_attempts.is_retryable_now(payment_attempts) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please indent with tabs instead of spaces here and throughout.

@kafaichoi kafaichoi marked this pull request as draft April 14, 2022 02:59
@kafaichoi kafaichoi force-pushed the timeout-outbound-paymnet-retires-instead-of-count-out branch from f47a834 to 7faa2e3 Compare April 14, 2022 03:28
@kafaichoi
Copy link
Contributor Author

Thank you for the thorough review. @jkczyz I have changed the PR status to draft and will change it back once it's ready to review again later.

@kafaichoi kafaichoi force-pushed the timeout-outbound-paymnet-retires-instead-of-count-out branch 2 times, most recently from 4187576 to d23b0c2 Compare April 14, 2022 04:18
Comment on lines 250 to 286
fn is_retryable_now(&self, attempts: &PaymentAttempts) -> bool {
match self {
Retry::Attempts(0) => false,
Retry::Attempts(max_retry_count) => max_retry_count + 1 > attempts.counts,
Retry::Timeout(max_duration) => *max_duration >= SystemTime::now().duration_since(attempts.first_attempted_at).unwrap()
}
}
Copy link
Contributor

@maxgiraldo maxgiraldo Apr 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is more of a curiosity, but do we only ever care about the current time, or could this be used to ask if something is retry-able at other times?

Suggested change
fn is_retryable_now(&self, attempts: &PaymentAttempts) -> bool {
match self {
Retry::Attempts(0) => false,
Retry::Attempts(max_retry_count) => max_retry_count + 1 > attempts.counts,
Retry::Timeout(max_duration) => *max_duration >= SystemTime::now().duration_since(attempts.first_attempted_at).unwrap()
}
}
fn is_retryable_at(&self, attempts: &PaymentAttempts, at: SystemTime) -> bool {
match self {
Retry::Attempts(0) => false,
Retry::Attempts(max_retry_count) => max_retry_count + 1 > attempts.counts,
Retry::Timeout(max_duration) => *max_duration >= at.duration_since(attempts.first_attempted_at).unwrap()
}
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea that was my initial reason adding the latest_attempted_at but since this is not needed anywhere yet so I removed it according to other feedback.


/// A state in which time has no meaning.
#[derive(Clone, Copy, Debug, PartialEq, Eq)]
pub struct Eternity;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd really rather not make this pub. One (kinda dirty) hack to avoid that while re-using the module in both crates is to symlink the file file in both crates.

Copy link
Contributor Author

@kafaichoi kafaichoi Apr 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about moving it lightning::util::test_utils?(see last commit 6b241ed)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I'd kinda still prefer we keep the trait sealed (which I think requires the symlink hack, but maybe there's a way to use a foreign sealed trait?).

I'm not sure that we have an immediate use-case for users implementing their own Time, and even if we wanted users to be able to cancel payments on no-std we'd probably want to do it via some kind of "abandon payment now" API, not with reentrancy calling user code on a Time trait. Additionally, we can't capture traits with associated types in bindings, so we'd have to make it non-exported there anyway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right. There is only one utils.rs instead of utils/mod.rs under lightning-invoice, So to avoid too much code change. I simply symlink it as time_utils.rs. Please let me know if it's reasonable?

Comment on lines 15 to 17
/// Returns an instance corresponding to the current moment.
fn now() -> Self;

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: EOL whitespace here (and a number of similar places in this file).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry. I will fix this after other issues resolved(I need to find a way to config my editor to avoid these)

@kafaichoi kafaichoi force-pushed the timeout-outbound-paymnet-retires-instead-of-count-out branch 5 times, most recently from 99d3203 to 6b241ed Compare April 17, 2022 07:06
counts: usize,
#[cfg(feature = "std")]
/// This field is only used when retry is [`Retry::Timeout`] which is only build with feature std
first_attempted_at: T
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I believe this leaves an unused parameter T in no-std builds. We either need to always include the field and let the ConfiguredTime "turn it off" in no-std or we need have two completely separate types for no-std vs std (which is gonna be a lot of code duplication).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then it's probably better to just always include this field(remove the configuration std condition check)?
Btw how can I replicate the error/warning of unused parameter T in no-std builds in local?(everything look fine when I do cargo build for no-std)

@kafaichoi kafaichoi force-pushed the timeout-outbound-paymnet-retires-instead-of-count-out branch 2 times, most recently from 22f4f76 to 29a381c Compare April 18, 2022 08:03
@kafaichoi kafaichoi marked this pull request as ready for review April 18, 2022 08:08
@kafaichoi kafaichoi force-pushed the timeout-outbound-paymnet-retires-instead-of-count-out branch from 29a381c to 5204952 Compare April 18, 2022 10:12
Copy link
Collaborator

@TheBlueMatt TheBlueMatt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! I'm a fan of the high-level structure now, I think.

pub type InvoicePayer<P, R, S, L, E> = InvoicePayerUsingTime::<P, R, S, L, E, ConfiguredTime>;

#[cfg(not(feature = "no-std"))]
type ConfiguredTime = std::time::SystemTime;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should use Instant not SystemTime - we need a counter not the current wall clock time.

impl Retry {
fn is_retryable_now<T: Time>(&self, attempts: &PaymentAttempts<T>) -> bool {
match self {
Retry::Attempts(0) => false,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this a separate branch? If max_retry_count is 0, we'll always return false when counts is 2, which it will be when we first check for a retry.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed this branch after switching back to store retry_count instead of total attempt count

fn is_retryable_now<T: Time>(&self, attempts: &PaymentAttempts<T>) -> bool {
match self {
Retry::Attempts(0) => false,
Retry::Attempts(max_retry_count) => max_retry_count + 1 > attempts.counts,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just make this >=?

#[derive(Debug, Clone, Copy)]
struct PaymentAttempts<T: Time> {
/// This should be >= 1 as we only insert PaymentAttempts for a PaymentHash if there is at least
/// /// one attempt.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have a double /// here.

impl <T: Time>PaymentAttempts<T> {
fn new() -> Self {
PaymentAttempts{
counts: 1,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure why we insert at 1 here, that's a behavior change from the original code, which always inserted at 0.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, also see: #1418 (comment)

I'm not sure I follow why the count logic needs to change if you are just adding another strategy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding of previous code is the count we store is the number of retry.

For example, in pay_pubkey, it's set as 0 when there is no entry found in payment_cache.
https://github.com/lightningdevkit/rust-lightning/pull/1418/files#diff-5165a75ba263a549c7fb9f4b555e449e44b0b680a80a9715a99d11157802a0d7L330

However, in this PR, I changed it to store number of attempt(which should be number of retry + 1). And we will only want to create and store PaymentAttempts when there is at least one attempt. That's the reason why count is set as 1 in PaymentAttempts::new.

Copy link
Contributor Author

@kafaichoi kafaichoi Apr 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have actually tried to change storing total attempt to storing retry count(like current behaviour). 145bede

It does look cleaner :). Please let me know if you still find it weird that I have to do - 1 here b08f0ea @jkczyz

The biggest reason is I think Retry::is_retryable_now shouldn't be aware that PaymentAttempts passed has just been added 1 retry count. It should simply take PaymentAttempts and determine if it's retryable right now based on the current field. (My apology if my explanation is still confusing)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Looks much better modulo Matt's comment about incrementing after calling is_retryable_now.

@@ -217,7 +268,28 @@ pub trait Router<S: Score> {
/// payments, if this is less than the total number of paths, we will never even retry all of the
/// payment's paths.
#[derive(Clone, Copy, Debug, Eq, Hash, PartialEq)]
pub struct RetryAttempts(pub usize);
pub enum Retry {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The enum-level docs probably need updating to no longer only talk about counts.

let payment_attempts = payment_cache.get_mut(&payment_hash).unwrap();
if self.retry.is_retryable_now(payment_attempts) {
payment_attempts.counts += 1;
Ok(self.pay_internal(params, payment_hash, send_payment)?)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we need to drop the lock in the retry branch? Looks like you swapped the branch order but left the lock-drop in the other branch now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry totally my mistake

Comment on lines +256 to 269
use util::time::Eternity;
#[cfg(feature = "no-std")]
type ConfiguredTime = Eternity;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you make this a single statement so we don't import Eternity when not configured?

Copy link
Contributor Author

@kafaichoi kafaichoi Apr 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

emm in line 255. there is also #[cfg(feature = "no-std")] so we will only import Eternity and ConfiguredTime set as Eternity together when no-std feature is set

I am not sure how to do the import and type declaration in one single statement. and it looks like there is cfg-if crate for providing a helpful macro to avoid duplicate cfg here? Is that what u meant?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, never mind. Forgot that you still would need to import the util module even if you fully-qualified as util::time::Eternity on line 258.

Comment on lines 1294 to 1296
#[test]
fn time_passes_when_advanced() {
let now = SinceEpoch::now();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can remove these two tests since they were moved.

Comment on lines 7 to 8
//! A very simple serialization framework which is used to serialize/deserialize messages as well
//! as ChannelsManagers and ChannelMonitors.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to update the docs.

Comment on lines 116 to 117
//! let invoice_payer = InvoicePayer::new(&payer, router, &scorer, &logger, event_handler,
//! Retry::Attempts(2));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to wrap the doc example. Or at very least would need to indent accordingly.

Comment on lines 157 to 158
use core::fmt;
use core::fmt::{Display, Formatter};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you move these before core::ops::Deref?

/// payment's paths.
Attempts(usize),
#[cfg(feature = "std")]
/// Time Out. Max time spent to retry payment
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The phrasing is a little awkward here. How about "Time elapsed before abandoning retries for a payment."?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yours is much better. Thanks!

impl <T: Time>PaymentAttempts<T> {
fn new() -> Self {
PaymentAttempts{
counts: 1,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, also see: #1418 (comment)

I'm not sure I follow why the count logic needs to change if you are just adding another strategy.

first_attempted_at: T
}

impl <T: Time>PaymentAttempts<T> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: no space after impl but add a space before PaymentAttempts .


impl <T: Time>PaymentAttempts<T> {
fn new() -> Self {
PaymentAttempts{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: space after PaymentAttempts

type InvoicePayerUsingSinceEpoch <P, R, S, L, E> = InvoicePayerUsingTime::<P, R, S, L, E, SinceEpoch>;

let invoice_payer =
InvoicePayerUsingSinceEpoch::new(&payer, router, &scorer, &logger, event_handler, Retry::Timeout(Duration::from_millis(100)));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might as well use something more realistic looking like seconds since no actual sleeps occur.

@kafaichoi kafaichoi force-pushed the timeout-outbound-paymnet-retires-instead-of-count-out branch 6 times, most recently from a7384af to b08f0ea Compare April 23, 2022 06:38
vincenzopalazzo
vincenzopalazzo previously approved these changes May 2, 2022
TheBlueMatt
TheBlueMatt previously approved these changes May 2, 2022
Comment on lines 80 to 99
#[cfg(not(feature = "no-std"))]
impl Time for std::time::SystemTime {
fn now() -> Self {
std::time::SystemTime::now()
}

fn duration_since(&self, earlier: Self) -> Duration {
self.duration_since(earlier).unwrap()
}

fn duration_since_epoch() -> Duration {
use std::time::SystemTime;
SystemTime::now().duration_since(SystemTime::UNIX_EPOCH).unwrap()
}

fn elapsed(&self) -> Duration {
std::time::SystemTime::elapsed(self).unwrap()
}
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this should have been removed from the first commit since the code was not needed.

Comment on lines 116 to 96

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, this should be in the first commit.

// You may not use this file except in accordance with one or both of these
// licenses.

//! Time Trait and different implementations. Currently, it's mainly used in tests so we can
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here and below:

s/Time Trait/[Time] trait

Comment on lines 9 to 10
//! This file is symlinked to lightning_invoice/src/time_utils.rs so we can keep Time Trait sealed
//! while allowing other crates(lightning_invoice) to use it.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The symlink is the other way around, no? Just say other crates may symlink this file instead of referencing which files, as otherwise the list may go out of date.

//! while allowing other crates(lightning_invoice) to use it.

use core::ops::Sub;
use core::time::Duration;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: add blank line between imports and Time docs

#[cfg(feature = "std")] {
if has_expired(params) {
log_trace!(self.logger, "Invoice expired for payment {}; not retrying (attempts: {})", log_bytes!(payment_hash.0), attempts);
log_trace!(self.logger, "Invoice expired for payment {}; not retrying ({:?})", log_bytes!(payment_hash.0), attempts);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... The attempts won't be accurate here since you've incremented earlier (and in Debug) but no retry occurs here. Similarly below.

Would it simplify things if we made PaymentAttempts::retry_count simply a count representing the number of attempts up until the retry? I believe that is what was done previously and as is still documented on the hash map:

/// Caches the overall attempts at making a payment, which is updated prior to retrying.

Essentially, the attempt is counted only once we get an event back on whether a path was successful or not. So renaming to count, unconditionally incrementing in this function, and not incrementing in Debug should give the correct number. Would need to modify is_retryable_now, too, IIUC.

Then you don't need to worry about the lock since you can increment in the above look-up, clone, and drop the lock implicitly in one statement similarly to the code prior to the change.

Copy link
Contributor Author

@kafaichoi kafaichoi May 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah u are right that the log in this branch(has_expired true) is not accurate. But can we simply move this has_expire check before is_retryable_now? I ran test in local after this change and all test case passed as well.

As per your suggestion, I believed that was my initial approach(got forced-pushed) but I decided to change it to only store retry_count because this part( #1418 (comment)) is confusing to others.

If we want to avoid the confusing part in my previous forced-pushed commit, does it mean we need to change the logic in Retry::is_retryable_now to be aware of the fact count always get incremented by 1 before calling retry_payment. I found that it's actually more confusing as now Retry::is_retryable_now is not a context-free function especially now is_retryable_now is also reused inside PaymentSendFailure::AllFailedRetrySafe

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah u are right that the log in this branch(has_expired true) is not accurate. But can we simply move this has_expire check before is_retryable_now? I ran test in local after this change and all test case passed as well.

Either order should be fine, but arguably you wouldn't attempt to retry if the count had been exceeded already.

As per your suggestion, I believed that was my initial approach(got forced-pushed) but I decided to change it to only store retry_count because this part( #1418 (comment)) is confusing to others.

Ah, I see. Seems like the confusion could have been resolved by maintaining the original behavior, no?

If we want to avoid the confusing part in my previous forced-pushed commit, does it mean, does it mean we need to change the logic in Retry::is_retryable_now to be aware of the fact count always get incremented by 1 before calling retry_payment. I found that it's actually more confusing as now Retry::is_retryable_now is not a context-free function especially now is_retryable_now is also reused inside PaymentSendFailure::AllFailedRetrySafe

Hmm... so the invariant was that the attempt count was incremented only after the result of the attempt is known. For PaymentSendFailure::AllFailedRetrySafe, we receive this result immediately rather than needing to wait for an event, so we never hit the retry_payment path.

I can see your argument as to why a retry count may make more sense. However, such a change should really be a separate commit with the what and why clearly described in the commit message. Admittedly, I've been lax about asking for the rationale in commit messages, which is okay when the commit is obvious. But the atomicity of commits is important, and this one makes two separate behavioral changes. I'd prefer if they were separate so that they can be considered independently.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW I believe the original behavior was inconsistent - AFAICT we'll retry automatically once even with the max-retries count set to 0 if the payment fails immediately when we try to send instead of when an HTLC fails. Not that that behavior is really broken, its just confusing, so its all a bit of a mess.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... I think it should work as expected. For AllFailedRetrySafe the number of attempts before increment (i.e., retry_count) and max retries are both 0, so an error is returned here:

let retry_count = payment_cache.get_mut(&payment_hash).unwrap();
if *retry_count >= self.retry_attempts.0 {
Err(e)

And for PartialFailure we hit the retry_payment case where the number of attempts is incremented and compared against maximum number of attempts (i.e. max retries plus one).

let max_payment_attempts = self.retry_attempts.0 + 1;
let attempts = *self.payment_cache.lock().unwrap()
.entry(payment_hash)
.and_modify(|attempts| *attempts += 1)
.or_insert(1);
if attempts >= max_payment_attempts {
log_trace!(self.logger, "Payment {} exceeded maximum attempts; not retrying (attempts: {})", log_bytes!(payment_hash.0), attempts);
return Err(());
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, maybe I'm wrong.....its more than a bit confusing, though...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jkczyz Thank you so much for breaking it down. I will break the last commit into 2 as per your suggestion so it's easier for us to see if the change make sense or not(And I will also look into if the original approach result in some incorrect log msg) . However, I will probably only have time this weekend🙏🏽

Copy link
Contributor Author

@kafaichoi kafaichoi May 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jkczyz I have made changes to restore the previous behavior and store attempts instead of retry attempts. After spending more time understanding the code, I actually think it's cleaner to keep in the original way :)

And I changed it back to impl Display for PaymentAttempts(somehow I can't find your related comment now)

@kafaichoi kafaichoi dismissed stale reviews from TheBlueMatt and vincenzopalazzo via 48329c0 May 4, 2022 13:20
@kafaichoi kafaichoi force-pushed the timeout-outbound-paymnet-retires-instead-of-count-out branch from c04b14b to 48329c0 Compare May 4, 2022 13:20
@TheBlueMatt
Copy link
Collaborator

Oops looks like this needs a rebase, sorry about that.

@kafaichoi kafaichoi force-pushed the timeout-outbound-paymnet-retires-instead-of-count-out branch from 48329c0 to 9245565 Compare May 4, 2022 23:44
@TheBlueMatt
Copy link
Collaborator

TheBlueMatt commented May 13, 2022

Ugh needs rebase again. Really sorry somehow I missed you'd pushed the commits above and didn't manage to review before it got stale again :(

@kafaichoi kafaichoi force-pushed the timeout-outbound-paymnet-retires-instead-of-count-out branch from 9245565 to 14248ee Compare May 14, 2022 04:14
@kafaichoi
Copy link
Contributor Author

Ugh needs rebase again. Really sorry somehow I missed you'd pushed the commits above and didn't manage to review before it got stale again :(

no worry! I actually felt so bad this relative simple PR causing core developers lots of time reviewing on this 😭

@kafaichoi kafaichoi force-pushed the timeout-outbound-paymnet-retires-instead-of-count-out branch 6 times, most recently from ca4099f to 00ee895 Compare May 14, 2022 07:06
@TheBlueMatt
Copy link
Collaborator

no worry! I actually felt so bad this relative simple PR causing core developers lots of time reviewing on this 😭

Nah! Final round review takes a bit of time, but passes before then take 5-10 minutes, don't sweat it :).

TheBlueMatt
TheBlueMatt previously approved these changes May 14, 2022
impl Retry {
fn is_retryable_now<T: Time>(&self, attempts: &PaymentAttempts<T>) -> bool {
match (self, attempts) {
(Retry::Attempts(max_retry_count), PaymentAttempts { count, .. }) => max_retry_count >= &count,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: could you wrap at 100 characters?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure. (wish cargo fmt is integrated here 😬)

} else {
*retry_count += 1;
let payment_attempts = payment_cache.get_mut(&payment_hash).unwrap();
if self.retry.is_retryable_now(payment_attempts) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this break for Retry::Attempts(0)? The cache will have a value of 0 which would cause is_retryable_now to return true. But in this case we wouldn't want to the retry since InvoicePayer is configured for no retries. Perhaps we need a test for this case?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was gonna let this one slide, IIUC it's how upstream works now, and not super surprising that we'd not instantly fail and at least try to find a route that we can send over. Though really this should only happen in race conditions and pretty exceedingly rare, if everything is working right.

Dunno if it's worth trying to build a test for but can just move the increment up.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, let's just move the increment up. Upstream, the check has the operands flipped with the opposite meaning (error), so works as intended.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah good catch. I have moved the increment up.

@kafaichoi kafaichoi force-pushed the timeout-outbound-paymnet-retires-instead-of-count-out branch from 00ee895 to b7d6d0d Compare May 17, 2022 06:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Time Out Outbound Payment Retires instead of Count-Out
6 participants