Skip to content
This repository has been archived by the owner on Nov 6, 2020. It is now read-only.

Allow dropping light client RPC query with no results #9318

Merged
merged 18 commits into from
Sep 12, 2018

Conversation

cheme
Copy link
Contributor

@cheme cheme commented Aug 8, 2018

This PR helps with #9133 .

I did observe infinite requesting when sending rpc call like getTransactionByHash on the light client.

The initial problem is that the light client is going to query random peers indefinitely until it got a reply (the way the communication is done closing the query do not end this).
In the case of a query with incorrect hash it will just query forever random peers.

In this PR I change this by limiting the number of query to one for each peer, before failing.
It leaves a few open questions:

  • Should we also use a timeout mechanism for the case where there are not many peers? This PR do break the no_capability test, with a timeout mechanism the test would make sense again.
  • When failing I use the Cancelled error resulting in {"jsonrpc":"2.0","error":{"code":-32603,"message":"Internal error occurred: on-demand sender cancelled","data":"\"\""},"id":1}, is it acceptable?
  • If there is new peers or removal of peers during the process some peers may not be queried or queried twice (the hashmap iterator keep its order so it is not that bad). Obviously, I can store the previous query peerids but I am not sure it is worth it.
  • When testing on ropsten, there are definitely very few peers that can reply (one with full history). When we got a OnDemand query that passes for a peer and not for others, should we exclude those peers from the OnDemand list of peers?

cheme added 2 commits August 8, 2018 18:00
All peer known at the time will be queried, and the query fail if all
return no reply.
Returning the failure is done through an empty Vec of reply (the type
of the oneshot channel remains unchanged).
Before this commit the query were send randomly to any peer until there
is a reply (for a query that got no result it was an issue, for other
queries it was quering multiple times the same peers).
After this commit the first query is random but next queries
follows hashmap iterator order.

Test no_capability was broken by this commit (the pending query was
removed).
All peer known at the time will be queried, and the query fail if all
return no reply.
Returning the failure is done through an empty Vec of reply (the type
of the oneshot channel remains unchanged).
Before this commit the query were send randomly to any peer until there
is a reply (for a query that got no result it was an issue, for other
queries it was quering multiple times the same peers).
After this commit the first query is random but next queries
follows hashmap iterator order.

Test no_capability was broken by this commit (the pending query was
removed). If adding some kind of timeout mechanism it could be restored.
@cheme cheme added the A0-pleasereview 🤓 Pull request needs code review. label Aug 8, 2018
@parity-cla-bot
Copy link

It looks like @cheme signed our Contributor License Agreement. 👍

Many thanks,

Parity Technologies CLA Bot

@@ -81,6 +82,9 @@ struct Pending {
required_capabilities: Capabilities,
responses: Vec<Response>,
sender: oneshot::Sender<Vec<Response>>,
first_query: Option<usize>,
nb_query: usize,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does nb and rem stand for? If nb is for "number", I'd suggest switching to query_count perhaps?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes not explicit enough: I can switch to :

  • first_query : base_query_index
  • nb_query : total_query_count
  • rem_query : remaining_query_count

@@ -180,6 +185,14 @@ impl Pending {
self.net_requests = builder.build();
self.required_capabilities = capabilities;
}

// return no response, will result in an error
// consume object (similar to drop when no reply found)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/consume object/consumes self/ maybe?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Best comment (object is definitely a bad wording for rust) :
// returning no reponse, it will result in an error.
// self is consumed on purpose.
The comment is just here to point out that we do not want self to be use back (used oneshot channel).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, I think the comment is useful here. I like your suggestion for an update.

@debris debris requested a review from rphmeier August 10, 2018 08:54
@debris debris added the M4-core ⛓ Core client code / Rust. label Aug 10, 2018
@@ -145,7 +149,8 @@ impl Pending {
// if the requests are complete, send the result and consume self.
fn try_complete(self) -> Option<Self> {
if self.requests.is_complete() {
let _ = self.sender.send(self.responses);
self.sender.send(self.responses).map_err(|_|())
.expect("Non used one shot channel");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't think it should panic here

fn no_response(self) {
trace!(target: "on_demand", "Dropping a pending query (no reply)");
self.sender.send(Vec::with_capacity(0)).map_err(|_|())
.expect("Non used one shot channel");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this will panic if the receiving end has hung up. better to ignore/handle the error than panic

self.receiver.poll().map(|async| async.map(T::extract_from))
match self.receiver.poll() {
Ok(Async::Ready(v)) => {
if v.len() == 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is_empty

@rphmeier
Copy link
Contributor

I'm not sure the logic in this PR is robust against the amount of peers changing over time, and it still seems that it may request from the same peer in that case. Why not just alter Pending to have a attempts_remaining: usize and a attempted: HashSet<PeerId>. When attempts_remaining == 0 then return nothing. We can make the attempts_remaining initial value configurable by CLI or RPC. A reasonable default would be 10

@cheme
Copy link
Contributor Author

cheme commented Aug 13, 2018

Hi Rob, I am totally sure it is not robust against amount of peer changing (the third point in my initial PR).
I did not want to store queried peer, if you believe it is not an issue to store it for every OnDemand request it would certainly be more robust.

@rphmeier
Copy link
Contributor

Storing a HashSet of queried peers per request is probably fine, but if it turns out to be a performance issue in the future we can find some other heuristic.

pub const DEFAULT_NB_RETRY: usize = 10;

/// The default time limit in milliseconds for inactive (no new peer to connect to) OnDemand queries (0 for unlimited)
pub const DEFAULT_QUERY_TIME_LIMIT: u64 = 10000;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use const Duration here instead!

@@ -49,7 +50,14 @@ pub mod request;
/// The result of execution
pub type ExecutionResult = Result<Executed, ExecutionError>;

/// The default number of retry for OnDemand queries send to other nodes
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default number of retries for OnDemand queries to send to the other nodes

@@ -145,7 +157,9 @@ impl Pending {
// if the requests are complete, send the result and consume self.
fn try_complete(self) -> Option<Self> {
if self.requests.is_complete() {
let _ = self.sender.send(self.responses);
if self.sender.send(self.responses).is_err() {
debug!(target: "on_demand", "Dropped oneshot channel receiver on complet request");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

complete?

@cheme
Copy link
Contributor Author

cheme commented Aug 15, 2018

I added an error type to be able to have multiple errors : 
{"jsonrpc":"2.0","error":{"code":-32042,"message":"On demand query limit reached on query #0"},"id":1} when all query occured so there is probably no results (for instance wrong hash).
And {"jsonrpc":"2.0","error":{"code":-32065,"message":"Timeout for On demand query, remaining 4 query attempts on query #0"},"id":1} if the request was dropped but we still got some attempt to do (and there was no new peer during the inactive delay) : in this case we may suspect that there is no peer serving.

It is also less hacky as proper types are used and we no longer rely on empty vec to carry an error.

In fact it all depends on the parameter used.

Also my choice for the RPC codes, may not be appropriate.

@5chdn 5chdn added this to the 2.1 milestone Aug 23, 2018
@5chdn
Copy link
Contributor

5chdn commented Aug 30, 2018

What's the status of this?

@@ -1364,12 +1374,19 @@ struct Whisper {
pool_size: Option<usize>,
}

#[derive(Default, Debug, PartialEq, Deserialize)]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make this Copy and Clone?

@@ -461,6 +595,16 @@ impl Handler for OnDemand {
None => return,
};

if responses.len() == 0 {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use is_empty() instead of len() == 0

@@ -180,7 +180,7 @@ fn no_capabilities() {

harness.inject_peer(peer_id, Peer {
status: dummy_status(),
capabilities: capabilities,
capabilities: capabilities.clone(),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can remove clone here capabilities is Copy!

@@ -444,7 +445,41 @@ pub fn filter_block_not_found(id: BlockId) -> Error {
}
}

pub fn on_demand_error(err: OnDemandError) -> Error {
match err.kind() {
OnDemandErrorKind::ChannelCanceled(ref e) => return on_demand_cancel(e.clone()),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needless return

Also, is it possible to take ownership of the `future::oneshot::Canceled´ instead and not clone it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be possible with a match err { Error(OnDemandErrorKind::ChannelCanceled(e), _) => e

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, thanks for the suggestion.

pending.base_query_index = rand;
rand
} else {
pending.base_query_index + history_len
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps make this an explicit wrapping add?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure about what you mean by explicit wrapping add ?

Copy link
Collaborator

@niklasad1 niklasad1 Sep 4, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Rob is referring to <integer>::wrapping_add(), see https://doc.rust-lang.org/std/primitive.usize.html#method.wrapping_add

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, it's nice.

rand
} else {
pending.base_query_index + history_len
} % cmp::max(num_peers, 1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we lose peers along the way we might end up re-querying some of them. but generally the peer set will be decently stable so maybe not the worst thing

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also did a test on hashset insertion a few line after, so it should be fine.

@@ -216,7 +216,7 @@ impl<T: LightChainClient + 'static> EthClient<T> {
};

fill_rich(block, score)
}).map_err(errors::on_demand_cancel)),
}).map_err(errors::on_demand_error)),
Copy link
Collaborator

@niklasad1 niklasad1 Sep 4, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these changes will affect error return-type and need to be changed in for example in https://github.com/paritytech/parity-ethereum/blob/master/rpc/src/v1/helpers/light_fetch.rs#L300-#L365!

I guess this is the reason for the compiler-error

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I resolve merge conflict really badly previously, my fault.

debug!(target: "on_demand", "No more peer to query, waiting for {} seconds until dropping query", query_inactive_time_limit.as_secs());
pending.inactive_time_limit = Some(now + query_inactive_time_limit);
} else {
if now > pending.inactive_time_limit.unwrap() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't allow unwrap in the codebase. You should use expect with a proof that it is not going to panic. Maybe rewrite this block with an if let Some(x) = pending.inactive_time_limit?

Copy link
Collaborator

@niklasad1 niklasad1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this good enough especially as it drops infinite (as it seems) requests which make it almost unusable!

We should investigate if a priority queue is better than query random peers but might put more load on good peers

@5chdn 5chdn added A8-looksgood 🦄 Pull request is reviewed well. and removed A0-pleasereview 🤓 Pull request needs code review. labels Sep 6, 2018
Copy link
Collaborator

@dvdplm dvdplm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few questions and typos/formatting to fix.

@@ -49,7 +50,43 @@ pub mod request;
/// The result of execution
pub type ExecutionResult = Result<Executed, ExecutionError>;

/// The default number of retries for OnDemand queries to send to the other nodes
pub const DEFAULT_NB_RETRY: usize = 10;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd name this DEFAULT_RETRY_COUNT to be consistent.

}

errors {
#[doc = "Max number of on demand attempt reached without results for a query."]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Max number of on-demand query attempts reached without result."

errors {
#[doc = "Max number of on demand attempt reached without results for a query."]
MaxAttemptReach(query_index: usize) {
description("On demand query limit reached")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/On demand/On-demand/

#[doc = "Max number of on demand attempt reached without results for a query."]
MaxAttemptReach(query_index: usize) {
description("On demand query limit reached")
display("On demand query limit reached on query #{}", query_index)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/On demand/On-demand/


#[doc = "No reply with current peer set, time out occured while waiting for new peers for additional query attempt."]
TimeoutOnNewPeers(query_index: usize, remaining_attempts: usize) {
description("Timeout for On demand query")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"On-demand query timed out"

@@ -147,8 +150,7 @@ impl LightFetch {

Either::B(self.send_requests(reqs, |res|
extract_header(&res, header_ref)
.expect("these responses correspond to requests that header_ref belongs to \
therefore it will not fail; qed")
.expect(WRONG_RESPONSE_AMOUNT_TYPE)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This proof was different before. Was it perhaps a mistake to change it?

@@ -51,6 +52,8 @@ use v1::types::{BlockNumber, CallRequest, Log, Transaction};

const NO_INVALID_BACK_REFS: &str = "Fails only on invalid back-references; back-references here known to be valid; qed";

const WRONG_RESPONSE_AMOUNT_TYPE: &str = "responses correspond directly with requests in amount and type; qed";
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The convention seems to be to name these _PROOF.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure why this was "amount" – I think this is better: "the number and type of responses is the same as the number of requests; qed"

// on-demand sender cancelled.
pub fn on_demand_cancel(_cancel: futures::sync::oneshot::Canceled) -> Error {
internal("on-demand sender cancelled", "")
}

pub fn max_attempt_reach(err: &OnDemandError) -> Error {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/max_attempt_reach/max_attempts_reached/

let rng = rand::random::<usize>() % cmp::max(num_peers, 1);
for (peer_id, peer) in peers.iter().chain(peers.iter()).skip(rng).take(num_peers) {
let history_len = pending.query_id_history.len();
let start = if history_len == 0 {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would peer_offset be a better name than start. I think .skip(start) below reads a bit weird.

…rs :

 - use standard '-' instead of '_'
 - renaming nb_retry params to 'on-demand-retry-count'
Copy link
Member

@ordian ordian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder whether we could use failsafe-rs as an implementation of CircuitBreaker pattern instead of rolling our own version.

sender: oneshot::Sender<PendingResponse>,
base_query_index: usize,
remaining_query_count: usize,
query_id_history: BTreeSet<PeerId>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason to use BTreeSet instead of HashSet? We don't rely on ids order as it seems, so HashSet should more appropriate maybe?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I expect it to be a small set, and I generally favor a btreeset in those case.

@cheme
Copy link
Contributor Author

cheme commented Sep 7, 2018

Ordian, thanks for pointing out failsafe-rs, it seems like an interesting crate. At the time I was considering doing a very small change to fix it (not even a timeout), the other configurations were added afterward.

@rphmeier
Copy link
Contributor

rphmeier commented Sep 7, 2018

Failsafe-rs seems like a fantastic way to proceed. Ideally we can make things like # of attempts or timeouts user-configurable. Although I would place that as a refactoring beyond the scope of this PR.

@5chdn 5chdn modified the milestones: 2.1, 2.2 Sep 11, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A8-looksgood 🦄 Pull request is reviewed well. M4-core ⛓ Core client code / Rust.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants