Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Penalize failed channels #1144

Merged

Conversation

jkczyz
Copy link
Contributor

@jkczyz jkczyz commented Oct 27, 2021

Adds a new payment_path_failed method to routing::Score for penalizing failed channels, which is called by InvoicePayer before retrying failed payments in the course of handling PaymentPathFailed events.

Implements the new method in Scorer by applying a configurable penalty on top of the base penalty. The new penalty decays over time but may be further increased if the channel continues to fail.

TODO: Test new Scorer behavior.

@codecov
Copy link

codecov bot commented Oct 27, 2021

Codecov Report

Merging #1144 (db05a14) into main (59659d3) will decrease coverage by 0.09%.
The diff coverage is 82.88%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #1144      +/-   ##
==========================================
- Coverage   90.51%   90.42%   -0.10%     
==========================================
  Files          69       70       +1     
  Lines       35865    39419    +3554     
==========================================
+ Hits        32462    35643    +3181     
- Misses       3403     3776     +373     
Impacted Files Coverage Δ
lightning-invoice/src/utils.rs 74.52% <50.00%> (-8.99%) ⬇️
lightning/src/routing/mod.rs 50.00% <50.00%> (ø)
lightning/src/routing/scorer.rs 52.38% <51.35%> (-14.29%) ⬇️
lightning-invoice/src/payment.rs 92.78% <93.58%> (-0.05%) ⬇️
lightning/src/routing/router.rs 92.91% <93.75%> (-2.64%) ⬇️
lightning-background-processor/src/lib.rs 95.94% <100.00%> (+1.71%) ⬆️
lightning/src/ln/channelmanager.rs 87.09% <100.00%> (+3.06%) ⬆️
lightning/src/ln/functional_test_utils.rs 97.01% <100.00%> (+1.90%) ⬆️
lightning/src/ln/functional_tests.rs 97.29% <100.00%> (-0.09%) ⬇️
lightning/src/ln/shutdown_tests.rs 95.89% <100.00%> (ø)
... and 15 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 59659d3...db05a14. Read the comment docs.

@jkczyz jkczyz force-pushed the 2021-10-invoice-payer-scoring branch from 371b971 to 0edc550 Compare October 27, 2021 16:43
@TheBlueMatt TheBlueMatt added this to the 0.0.103 milestone Oct 27, 2021
@jkczyz jkczyz force-pushed the 2021-10-invoice-payer-scoring branch 2 times, most recently from f538e01 to a90a9e5 Compare October 27, 2021 17:07
#[cfg(not(feature = "no-std"))]
fn decay_from(penalty_msat: u64, last_failure: &SystemTime, decay_interval: Duration) -> u64 {
let decays = last_failure.elapsed().ok().map_or(0, |elapsed| {
elapsed.as_secs() / decay_interval.as_secs()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to self: handle divide by zero 😛

where
P::Target: Payer,
R: Router,
S: routing::Score,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As an alternative to an owned Score, should we consider a "two-layer" trait with a Score and LockedScore where router takes the second?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kinda prefer this way as it just uses plain references. Could be convinced otherwise, though.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I'm looking at this from two directions:

  • in the case of bindings, we cannot serialize from a reference to a trait - all traits have to be concretized into a dyn Trait and from there you can't (type-safe) go back to the original type to serialize it. We'd have to add support for that, which would be highly language specific and checked at runtime or make Score require Writeable, which is somewhat awkward, though not insane.
  • in the general case, its a bit awkward for users to have to get their InvoicePayer to serialize their Score data - it ends up dictating a chunk of user layout, instead of taking a reference which gives the user more flexibility.

Obviously its somewhat awkward in that we end up forcing users into a two-layer wrapper thinggy, but luckily:

  • Our implementation does works against it without the user having to write any additional characters,
  • we can implement the parent trait for Thing<Deref<Target=Mutex<ThingB: subtrait>>> (or implement it for Thing: Deref<Target=subtrait> directly with no-std), making it somewhat transparent at least,
  • bindings users can use the API, at least kinda, though GC waiting to release the lock kinda sucks too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I'm looking at this from two directions:

  • in the case of bindings, we cannot serialize from a reference to a trait - all traits have to be concretized into a dyn Trait and from there you can't (type-safe) go back to the original type to serialize it. We'd have to add support for that, which would be highly language specific and checked at runtime or make Score require Writeable, which is somewhat awkward, though not insane.

Just a thought, we could have S: routing::Score + Clone and have the accessor return a copy. Seems like it would save a lot of trouble using multiple traits for a small cost. Would this solve the bindings issue at least an interim solution?

  • in the general case, its a bit awkward for users to have to get their InvoicePayer to serialize their Score data - it ends up dictating a chunk of user layout, instead of taking a reference which gives the user more flexibility.

FWIW, they will already need to use Arc<InvoicePayer> as it needs to be passed to the BackgroundProcessor for event handling in addition to being used to make payments. So having, say, some persister utility wrapping Arc<InvoicePayer> wouldn't be horrible. And, if written in Rust, wouldn't it just be a simple method call to persist not involving any references at the bindings level?

Obviously its somewhat awkward in that we end up forcing users into a two-layer wrapper thinggy, but luckily:

  • Our implementation does works against it without the user having to write any additional characters,
  • we can implement the parent trait for Thing<Deref<Target=Mutex<ThingB: subtrait>>> (or implement it for Thing: Deref<Target=subtrait> directly with no-std), making it somewhat transparent at least,
  • bindings users can use the API, at least kinda, though GC waiting to release the lock kinda sucks too.

I understand the problem and am trying to grok the trait-subtrait solution your proposing. I guess the point of it is so we can pass find_route a LockedScore. But the relationship between LockedScore and Score (which is the supertrait of which?) and what methods each has isn't so clear. Could you elaborate?

Wonder if using an associate type in some manner would make this simpler?

Copy link
Collaborator

@TheBlueMatt TheBlueMatt Oct 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a thought, we could have S: routing::Score + Clone and have the accessor return a copy.

Yea, that's something I'd been thinking about for a different reason (cause java does that a ton anyway), and I guess its okay? I'm not really a huge fan of clone'ing a hashmap that may get an entry for every channel in the graph (which may be the case for users doing probing), though, that could be a lot of data.

FWIW, they will already need to use Arc as it needs to be passed to the BackgroundProcessor for event handling in addition to being used to make payments. So having, say, some persister utility wrapping Arc wouldn't be horrible.

Ah, that's a good point re: user code complexity.

And, if written in Rust, wouldn't it just be a simple method call to persist not involving any references at the bindings level?

So we'd make Score : Writeable and add a utility method to InvoicePayer to just write the scores out directly? I'm not 100% sure where you're going with this.

I understand the problem and am trying to grok the trait-subtrait solution your proposing. I guess the point of it is so we can pass find_route a LockedScore. But the relationship between LockedScore and Score (which is the supertrait of which?) and what methods each has isn't so clear. Could you elaborate?

Sorry, I was using "supertrait" liberally (and incorrectly). What I was thinking of was (but with better naming):

trait Score {
  type Scorer: LockedScore;
  fn get_scorer(&self) -> Scorer;
}
trait LockedScore {
  fn get_score(&self, scid: u64, ..) -> u64;
}
#[cfg(not(no_std))]
impl<LS: LockedScore, T: Deref<Target=Mutex<LockedScore>>> Score for T {
  ...
}
#[cfg(no_std)]
impl<LS: LockedScore, T: Deref<Target=LockedScore>> Score for T {
  ...
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmmmm, that could work. I guess it does feel a lot less explicit (in the sense that users lose the flexibility of selecting how locking works) and fairly different from our existing APIs which are built around Derefs.

If we go this route, to make the bindings sane, we'd probably want to create a WriteableScore trait that is just pub trait WriteableScore : Score + Writeable {} and use that, then create a constructor for ScorePersister that mirrors the InvoicePayer constructor and just takes a WriteableScore instead of a Score.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tried your solution with some renaming based on our offline discussion. Ran into some lifetime hell while doing so... 😕 Pushed a commit that almost works. Any chance you could see where I've gone wrong?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, the MutexGuard is making this tricky

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Grrrrrrrrr, right, so the direct thing you want to do here requires GATs, which may land in, like, the next version of rust or something. However, in trying to fix this I learned of a new rust syntax which seems to be exactly the subset of GATs that we want here - HRTB. I pushed a cargo check'ing branch at https://git.bitcoin.ninja/index.cgi?p=rust-lightning;a=log;h=refs/heads/1144-magic

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Got it working now. Only annoying thing about the blanket impl is that I get a conflicting implementation error if I try to implement it with T: Deref<Target=RefCell<S>>, which I was hoping to do for tests instead of using a Mutex. Which I guess means any Deref must reference a Mutex? Might be some way to work around it by introducing another type parameter? Don't need to worry about this now, though.

base_penalty_msat: u64,
params: ScoringParameters,
#[cfg(not(feature = "no-std"))]
channel_failures: HashMap<u64, (u64, SystemTime)>,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be an Instant instead, no? We want a monotonic clock.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instant is opaque, so I'd imagine it would be difficult to persist, whereas with SystemTime we can get a Duration since the unix epoch. IIUC, any decrease in time as used with elapsed would be before the first decay, so it shouldn't have any effect.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can persist the SystemTime::now() - Instant.elapsed()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm... but we can't deserialize it back into Instant since the only way to create one is Instant::now or from another Instant.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, right, yuck. Ummmmmmmm Instant::now() - (SystemTime::now() - deserialized_systemtime)? Its gross, but more technically correct...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I think I convinced myself this is possible by serializing in terms of a Duration since the UNIX epoch, and then upon deserialization do a similar adjustment.

    let time = std::time::Instant::now();
    let duration_since_epoch = std::time::SystemTime::now().duration_since(std::time::SystemTime::UNIX_EPOCH).unwrap();
    let serialized_time = duration_since_epoch - time.elapsed();
    println!("time: {:?}", time);
    println!("duration_since_epoch: {:?}", duration_since_epoch);
    println!("serialized_time: {:?}", serialized_time);
    
    std::thread::sleep(std::time::Duration::from_secs(1));
    
    let duration_since_epoch = std::time::SystemTime::now().duration_since(std::time::SystemTime::UNIX_EPOCH).unwrap();
    let duration_since_instant = duration_since_epoch - serialized_time;
    let deserialized_time = std::time::Instant::now() - duration_since_instant;
    println!("duration_since_epoch: {:?}", duration_since_epoch);
    println!("duration_since_instant: {:?}", duration_since_instant);
    println!("deserialized_time: {:?}", deserialized_time);

@jkczyz jkczyz force-pushed the 2021-10-invoice-payer-scoring branch 2 times, most recently from dbedf14 to 1281d49 Compare October 28, 2021 17:31
@@ -117,15 +131,17 @@ use std::sync::Mutex;
use std::time::{Duration, SystemTime};

/// A utility for paying [`Invoice]`s.
pub struct InvoicePayer<P: Deref, R, L: Deref, E>
pub struct InvoicePayer<P: Deref, R, S, L: Deref, E>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should S be a Deref to a LockableScore?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We implement LockableScore for T: Deref<Target=Mutex<S>>, so I don't think we do unless there is another reason you're thinking of? See last commit for use with BackgroundProcessor.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmmm, right, does it work if we implement LockableScore for Mutex<S> instead? It seems like a Deref here is more consistent with our API.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Done. Also, implemented for RefCell now that it is possible.

@jkczyz jkczyz force-pushed the 2021-10-invoice-payer-scoring branch 3 times, most recently from 27e7947 to 21cbd03 Compare October 28, 2021 20:46
@@ -152,8 +168,9 @@ pub trait Payer {
/// A trait defining behavior for routing an [`Invoice`] payment.
pub trait Router {
/// Finds a [`Route`] between `payer` and `payee` for a payment with the given values.
fn find_route(
&self, payer: &PublicKey, params: &RouteParameters, first_hops: Option<&[&ChannelDetails]>
fn find_route<S: routing::Score>(
Copy link
Collaborator

@TheBlueMatt TheBlueMatt Oct 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The S bound should probably be on the trait itself, no? ie if a user always constructs a InvoicePayer with CustomLocalUserRouter then find_route should take a CustomLocalUserRouter and not a S. If the complexity of the type annotations blows up as a result of this, feel free to ignore :).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I quite follow. Why would find_route take itself (in addition to self) instead of a Score?

I think you're trying to say that the type S: routing::Score parameter used in Router should be the same Score as required by LockableScore<'a>::Locked. The compiler happily leads me to use the following ugly syntax and to use PhantomData in DefaultRouter.

R: for <'a> Router<<<S as Deref>::Target as routing::LockableScore<'a>>::Locked>,

But it seems I'm just doing what the compiler is already inferring, no? Did I misunderstand? I suppose internally, we could accidentally call find_route with an entirely different Score than what the user parameterized InvoicePayer with. But I don't think a user could do so.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of the current code, it'd be

pub trait Router<S: routing::Score> {
   fn find_route(&self, ...&S)
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we're on the same page as discussed offline. Also, turns out I didn't need to use PhantomData. I was being overzealous in where I was adding type parameters.

Copy link
Collaborator

@TheBlueMatt TheBlueMatt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other than the above two comments, LGTM.

@jkczyz jkczyz force-pushed the 2021-10-invoice-payer-scoring branch from 21cbd03 to 269a85a Compare October 29, 2021 02:47
@valentinewallace valentinewallace self-requested a review October 29, 2021 15:50
@jkczyz jkczyz mentioned this pull request Oct 29, 2021
Copy link
Collaborator

@TheBlueMatt TheBlueMatt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that some of the fixups on Notify scorer of failing payment path probably should be on Parameterize InvoicePayer by routing::Score. I'd be totally fine if those both just get squashed into one commit, though.

}
}

impl<S: Score, T: Deref<Target=S> + DerefMut<Target=S>> Score for T {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DerefMut extends Deref so you should be able to drop the Deref bound.

Copy link
Contributor

@valentinewallace valentinewallace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shaping up IMO! Mostly docs requests

//! let scorer = Scorer::default();
//!
//! // Or use a custom channel penalty.
//! let scorer = Scorer::new(1_000);
//! // Or use custom channel penalties.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since this says custom channel "penalties," could we change it to either say "penalty" or have multiple penalties be custom?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added another custom penalty.

}
}

/// Creates a new scorer using `penalty_msat` as a fixed channel penalty.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could specify that this will only have a fixed base channel penalty

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops, missed this comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, the other penalty is zero so the overall penalty will be fixed.

@@ -830,6 +873,39 @@ mod tests {
}
}

#[test]
fn scores_failed_channel() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add a high-level comment? I'm a bit confused what this tests

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a comment. There is an expectation set on TestScorer upon creation that it is called with a specific short_channel_id. It will fail if it is not called at all or called with a different short_channel_id.

}

#[cfg(not(feature = "no-std"))]
fn decay_from(penalty_msat: u64, last_failure: &Instant, half_life: Duration) -> u64 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

High level doc for the return value?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is refactored in #1146 into a method called decayed_penalty, so will leave as is to avoid a lengthy rebase process. Happy to address any concerns in that PR. Probably should include units in the name.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha, just noticed a lot of this is rewritten in #1146. Thanks!


#[cfg(not(feature = "no-std"))]
fn decay_from(penalty_msat: u64, last_failure: &Instant, half_life: Duration) -> u64 {
let decays = last_failure.elapsed().as_secs().checked_div(half_life.as_secs());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we test that it won't decay if less than half_life has passed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing is tested at the moment... 🙂 Will do in a follow-up. Confirmed this results in no shifts because checked_div returns Some(0).

@jkczyz jkczyz force-pushed the 2021-10-invoice-payer-scoring branch from 269a85a to 5c8466b Compare October 29, 2021 19:04
Upon receiving a PaymentPathFailed event, the failing payment may be
retried on a different path. To avoid using the channel responsible for
the failure, a scorer should be notified of the failure before being
used to find a new route.

Add a payment_path_failed method to routing::Score and call it in
InvoicePayer's event handler. Introduce a LockableScore parameterization
to InvoicePayer so the scorer is locked only once before calling
find_route.
As payments fail, the channel responsible for the failure may be
penalized. Implement Scorer::payment_path_failed to penalize the failed
channel using a configured penalty. As time passes, the penalty is
reduced using exponential decay, though penalties will accumulate if the
channel continues to fail. The decay interval is also configurable.
Proof of concept showing InvoicePayer can be used with an
Arc<ChannelManager> passed to BackgroundProcessor. Likely do not need to
merge this commit.
@jkczyz jkczyz force-pushed the 2021-10-invoice-payer-scoring branch from 5c8466b to db05a14 Compare October 29, 2021 19:28
Copy link
Collaborator

@TheBlueMatt TheBlueMatt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Squashed without diff from Val's ACK, will land after CI:

$ git diff-tree -U1 5c8466b4 db05a14a
$

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants