Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(request-response): Report dial IO errors to the user #5429

Merged
merged 3 commits into from
Jun 4, 2024

Conversation

oblique
Copy link
Contributor

@oblique oblique commented May 29, 2024

Description

This fixes a potential infinite retrying when dialing bad peers. The error is now reported to the user and they should handle it as they see fit for their case.

Notes & open questions

Change checklist

  • I have performed a self-review of my own code
  • I have made corresponding changes to the documentation
  • I have added tests that prove my fix is effective or that my feature works
  • A changelog entry has been made in the appropriate crates

This fixes an infinite retrying when dialing bad peers. The error
is now reported to the user and they should handle it as they
see fit for their case.
Copy link
Member

@jxs jxs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Yianis! LGTM

@jxs jxs requested a review from thomaseizinger May 31, 2024 16:39
@jxs jxs added the send-it label Jun 4, 2024
@mergify mergify bot merged commit 68301b8 into libp2p:master Jun 4, 2024
72 checks passed
@oblique oblique deleted the fix/report-io-dial-error branch June 4, 2024 10:22
TimTinkers pushed a commit to unattended-backpack/rust-libp2p that referenced this pull request Sep 14, 2024
This fixes a potential infinite retrying when dialing bad peers. The error is now reported to the user and they should handle it as they see fit for their case.

Pull-Request: libp2p#5429.
github-merge-queue bot pushed a commit to paritytech/polkadot-sdk that referenced this pull request Jan 22, 2025
This PR enforces that outbound requests are finished within the
specified protocol timeout.

The stable2412 version running libp2p 0.52.4 contains a bug which does
not track request timeouts properly:
- libp2p/rust-libp2p#5429

The issue has been detected while submitting libp2p -> litep2p requests
in kusama. This aims to check that pending outbound requests have not
timedout. Although the issue has been fixed in libp2p, there might be
other cases where this may happen. For example:
- libp2p/rust-libp2p#5417

For more context see:
#7076 (comment)


1. Ideally, the force-timeout mechanism in this PR should never be
triggered in production. However, origin/stable2412 occasionally
encounters this issue. When this happens, 2 warnings may be generated:
- one warning introduced by this PR wrt force timeout terminating the
request
- possible one warning when the libp2p decides (if at all) to provide
the response back to substrate (as mentioned by @alexggh
[here](https://github.com/paritytech/polkadot-sdk/pull/7222/files#diff-052aeaf79fef3d9a18c2cfd67006aa306b8d52e848509d9077a6a0f2eb856af7L769)
and
[here](https://github.com/paritytech/polkadot-sdk/pull/7222/files#diff-052aeaf79fef3d9a18c2cfd67006aa306b8d52e848509d9077a6a0f2eb856af7L842)

2. This implementation does not propagate to the substrate service the
`RequestFinished { error: .. }`. That event is only used internally by
substrate to increment metrics. However, we don't have the peer
information available to propagate the event properly when we
force-timeout the request. Considering this should most likely not
happen in production (origin/master) and that we'll be able to extract
information by warnings, I would say this is a good tradeoff for code
simplicity:


https://github.com/paritytech/polkadot-sdk/blob/06e3b5c6a7696048d65f1b8729f16b379a16f501/substrate/client/network/src/service.rs#L1543


### Testing

Added a new test to ensure the timeout is reached properly, even if
libp2p does not produce a response in due time.

I've also transitioned the tests to using `tokio::test` due to a
limitation of
[CI](https://github.com/paritytech/polkadot-sdk/actions/runs/12832055737/job/35784043867)

```
--- TRY 1 STDERR:        sc-network request_responses::tests::max_response_size_exceeded ---
thread 'request_responses::tests::max_response_size_exceeded' panicked at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.40.0/src/time/interval.rs:139:26:
there is no reactor running, must be called from the context of a Tokio 1.x runtime
```



cc @paritytech/networking

---------

Signed-off-by: Alexandru Vasile <[email protected]>
Co-authored-by: Bastian Köcher <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants