-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Allow dropping light client RPC query with no results #9318
Conversation
All peer known at the time will be queried, and the query fail if all return no reply. Returning the failure is done through an empty Vec of reply (the type of the oneshot channel remains unchanged). Before this commit the query were send randomly to any peer until there is a reply (for a query that got no result it was an issue, for other queries it was quering multiple times the same peers). After this commit the first query is random but next queries follows hashmap iterator order. Test no_capability was broken by this commit (the pending query was removed).
All peer known at the time will be queried, and the query fail if all return no reply. Returning the failure is done through an empty Vec of reply (the type of the oneshot channel remains unchanged). Before this commit the query were send randomly to any peer until there is a reply (for a query that got no result it was an issue, for other queries it was quering multiple times the same peers). After this commit the first query is random but next queries follows hashmap iterator order. Test no_capability was broken by this commit (the pending query was removed). If adding some kind of timeout mechanism it could be restored.
It looks like @cheme signed our Contributor License Agreement. 👍 Many thanks, Parity Technologies CLA Bot |
ethcore/light/src/on_demand/mod.rs
Outdated
@@ -81,6 +82,9 @@ struct Pending { | |||
required_capabilities: Capabilities, | |||
responses: Vec<Response>, | |||
sender: oneshot::Sender<Vec<Response>>, | |||
first_query: Option<usize>, | |||
nb_query: usize, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does nb
and rem
stand for? If nb
is for "number", I'd suggest switching to query_count
perhaps?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes not explicit enough: I can switch to :
- first_query : base_query_index
- nb_query : total_query_count
- rem_query : remaining_query_count
ethcore/light/src/on_demand/mod.rs
Outdated
@@ -180,6 +185,14 @@ impl Pending { | |||
self.net_requests = builder.build(); | |||
self.required_capabilities = capabilities; | |||
} | |||
|
|||
// return no response, will result in an error | |||
// consume object (similar to drop when no reply found) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/consume object/consumes self/ maybe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Best comment (object is definitely a bad wording for rust) :
// returning no reponse, it will result in an error.
// self is consumed on purpose.
The comment is just here to point out that we do not want self to be use back (used oneshot channel).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, I think the comment is useful here. I like your suggestion for an update.
ethcore/light/src/on_demand/mod.rs
Outdated
@@ -145,7 +149,8 @@ impl Pending { | |||
// if the requests are complete, send the result and consume self. | |||
fn try_complete(self) -> Option<Self> { | |||
if self.requests.is_complete() { | |||
let _ = self.sender.send(self.responses); | |||
self.sender.send(self.responses).map_err(|_|()) | |||
.expect("Non used one shot channel"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i don't think it should panic here
ethcore/light/src/on_demand/mod.rs
Outdated
fn no_response(self) { | ||
trace!(target: "on_demand", "Dropping a pending query (no reply)"); | ||
self.sender.send(Vec::with_capacity(0)).map_err(|_|()) | ||
.expect("Non used one shot channel"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this will panic if the receiving end has hung up. better to ignore/handle the error than panic
ethcore/light/src/on_demand/mod.rs
Outdated
self.receiver.poll().map(|async| async.map(T::extract_from)) | ||
match self.receiver.poll() { | ||
Ok(Async::Ready(v)) => { | ||
if v.len() == 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is_empty
I'm not sure the logic in this PR is robust against the amount of peers changing over time, and it still seems that it may request from the same peer in that case. Why not just alter |
Hi Rob, I am totally sure it is not robust against amount of peer changing (the third point in my initial PR). |
Storing a |
on demand query in light client.
set number of query to minimum current number peer or configured number of query : that way capability test was restored.
…into light-noreply
ethcore/light/src/on_demand/mod.rs
Outdated
pub const DEFAULT_NB_RETRY: usize = 10; | ||
|
||
/// The default time limit in milliseconds for inactive (no new peer to connect to) OnDemand queries (0 for unlimited) | ||
pub const DEFAULT_QUERY_TIME_LIMIT: u64 = 10000; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use const Duration
here instead!
ethcore/light/src/on_demand/mod.rs
Outdated
@@ -49,7 +50,14 @@ pub mod request; | |||
/// The result of execution | |||
pub type ExecutionResult = Result<Executed, ExecutionError>; | |||
|
|||
/// The default number of retry for OnDemand queries send to other nodes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default number of retries for OnDemand queries to send to the other nodes
ethcore/light/src/on_demand/mod.rs
Outdated
@@ -145,7 +157,9 @@ impl Pending { | |||
// if the requests are complete, send the result and consume self. | |||
fn try_complete(self) -> Option<Self> { | |||
if self.requests.is_complete() { | |||
let _ = self.sender.send(self.responses); | |||
if self.sender.send(self.responses).is_err() { | |||
debug!(target: "on_demand", "Dropped oneshot channel receiver on complet request"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
complete?
reported at rpc level : choice of rpc error code error might not be right.
…into light-noreply
I added an error type to be able to have multiple errors : It is also less hacky as proper types are used and we no longer rely on empty vec to carry an error. In fact it all depends on the parameter used. Also my choice for the RPC codes, may not be appropriate. |
What's the status of this? |
@@ -1364,12 +1374,19 @@ struct Whisper { | |||
pool_size: Option<usize>, | |||
} | |||
|
|||
#[derive(Default, Debug, PartialEq, Deserialize)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make this Copy
and Clone
?
ethcore/light/src/on_demand/mod.rs
Outdated
@@ -461,6 +595,16 @@ impl Handler for OnDemand { | |||
None => return, | |||
}; | |||
|
|||
if responses.len() == 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use is_empty()
instead of len() == 0
ethcore/light/src/on_demand/tests.rs
Outdated
@@ -180,7 +180,7 @@ fn no_capabilities() { | |||
|
|||
harness.inject_peer(peer_id, Peer { | |||
status: dummy_status(), | |||
capabilities: capabilities, | |||
capabilities: capabilities.clone(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can remove clone
here capabilities
is Copy!
rpc/src/v1/helpers/errors.rs
Outdated
@@ -444,7 +445,41 @@ pub fn filter_block_not_found(id: BlockId) -> Error { | |||
} | |||
} | |||
|
|||
pub fn on_demand_error(err: OnDemandError) -> Error { | |||
match err.kind() { | |||
OnDemandErrorKind::ChannelCanceled(ref e) => return on_demand_cancel(e.clone()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needless return
Also, is it possible to take ownership of the `future::oneshot::Canceled´ instead and not clone it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be possible with a match err { Error(OnDemandErrorKind::ChannelCanceled(e), _) => e
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, thanks for the suggestion.
pending.base_query_index = rand; | ||
rand | ||
} else { | ||
pending.base_query_index + history_len |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps make this an explicit wrapping add?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure about what you mean by explicit wrapping add ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think Rob is referring to <integer>::wrapping_add()
, see https://doc.rust-lang.org/std/primitive.usize.html#method.wrapping_add
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, it's nice.
rand | ||
} else { | ||
pending.base_query_index + history_len | ||
} % cmp::max(num_peers, 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we lose peers along the way we might end up re-querying some of them. but generally the peer set will be decently stable so maybe not the worst thing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also did a test on hashset insertion a few line after, so it should be fine.
@@ -216,7 +216,7 @@ impl<T: LightChainClient + 'static> EthClient<T> { | |||
}; | |||
|
|||
fill_rich(block, score) | |||
}).map_err(errors::on_demand_cancel)), | |||
}).map_err(errors::on_demand_error)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these changes will affect error return-type and need to be changed in for example in https://github.com/paritytech/parity-ethereum/blob/master/rpc/src/v1/helpers/light_fetch.rs#L300-#L365!
I guess this is the reason for the compiler-error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I resolve merge conflict really badly previously, my fault.
ethcore/light/src/on_demand/mod.rs
Outdated
debug!(target: "on_demand", "No more peer to query, waiting for {} seconds until dropping query", query_inactive_time_limit.as_secs()); | ||
pending.inactive_time_limit = Some(now + query_inactive_time_limit); | ||
} else { | ||
if now > pending.inactive_time_limit.unwrap() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we don't allow unwrap
in the codebase. You should use expect
with a proof that it is not going to panic. Maybe rewrite this block with an if let Some(x) = pending.inactive_time_limit
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this good enough especially as it drops infinite (as it seems) requests which make it almost unusable!
We should investigate if a priority queue is better than query random peers but might put more load on good peers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few questions and typos/formatting to fix.
ethcore/light/src/on_demand/mod.rs
Outdated
@@ -49,7 +50,43 @@ pub mod request; | |||
/// The result of execution | |||
pub type ExecutionResult = Result<Executed, ExecutionError>; | |||
|
|||
/// The default number of retries for OnDemand queries to send to the other nodes | |||
pub const DEFAULT_NB_RETRY: usize = 10; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd name this DEFAULT_RETRY_COUNT
to be consistent.
ethcore/light/src/on_demand/mod.rs
Outdated
} | ||
|
||
errors { | ||
#[doc = "Max number of on demand attempt reached without results for a query."] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Max number of on-demand query attempts reached without result."
ethcore/light/src/on_demand/mod.rs
Outdated
errors { | ||
#[doc = "Max number of on demand attempt reached without results for a query."] | ||
MaxAttemptReach(query_index: usize) { | ||
description("On demand query limit reached") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/On demand/On-demand/
ethcore/light/src/on_demand/mod.rs
Outdated
#[doc = "Max number of on demand attempt reached without results for a query."] | ||
MaxAttemptReach(query_index: usize) { | ||
description("On demand query limit reached") | ||
display("On demand query limit reached on query #{}", query_index) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/On demand/On-demand/
ethcore/light/src/on_demand/mod.rs
Outdated
|
||
#[doc = "No reply with current peer set, time out occured while waiting for new peers for additional query attempt."] | ||
TimeoutOnNewPeers(query_index: usize, remaining_attempts: usize) { | ||
description("Timeout for On demand query") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"On-demand query timed out"
rpc/src/v1/helpers/light_fetch.rs
Outdated
@@ -147,8 +150,7 @@ impl LightFetch { | |||
|
|||
Either::B(self.send_requests(reqs, |res| | |||
extract_header(&res, header_ref) | |||
.expect("these responses correspond to requests that header_ref belongs to \ | |||
therefore it will not fail; qed") | |||
.expect(WRONG_RESPONSE_AMOUNT_TYPE) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This proof was different before. Was it perhaps a mistake to change it?
rpc/src/v1/helpers/light_fetch.rs
Outdated
@@ -51,6 +52,8 @@ use v1::types::{BlockNumber, CallRequest, Log, Transaction}; | |||
|
|||
const NO_INVALID_BACK_REFS: &str = "Fails only on invalid back-references; back-references here known to be valid; qed"; | |||
|
|||
const WRONG_RESPONSE_AMOUNT_TYPE: &str = "responses correspond directly with requests in amount and type; qed"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The convention seems to be to name these _PROOF
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure why this was "amount" – I think this is better: "the number and type of responses is the same as the number of requests; qed"
rpc/src/v1/helpers/errors.rs
Outdated
// on-demand sender cancelled. | ||
pub fn on_demand_cancel(_cancel: futures::sync::oneshot::Canceled) -> Error { | ||
internal("on-demand sender cancelled", "") | ||
} | ||
|
||
pub fn max_attempt_reach(err: &OnDemandError) -> Error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/max_attempt_reach/max_attempts_reached/
ethcore/light/src/on_demand/mod.rs
Outdated
let rng = rand::random::<usize>() % cmp::max(num_peers, 1); | ||
for (peer_id, peer) in peers.iter().chain(peers.iter()).skip(rng).take(num_peers) { | ||
let history_len = pending.query_id_history.len(); | ||
let start = if history_len == 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would peer_offset
be a better name than start
. I think .skip(start)
below reads a bit weird.
…rs : - use standard '-' instead of '_' - renaming nb_retry params to 'on-demand-retry-count'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder whether we could use failsafe-rs
as an implementation of CircuitBreaker pattern instead of rolling our own version.
sender: oneshot::Sender<PendingResponse>, | ||
base_query_index: usize, | ||
remaining_query_count: usize, | ||
query_id_history: BTreeSet<PeerId>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason to use BTreeSet
instead of HashSet
? We don't rely on ids order as it seems, so HashSet
should more appropriate maybe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I expect it to be a small set, and I generally favor a btreeset in those case.
Ordian, thanks for pointing out |
Failsafe-rs seems like a fantastic way to proceed. Ideally we can make things like # of attempts or timeouts user-configurable. Although I would place that as a refactoring beyond the scope of this PR. |
This PR helps with #9133 .
I did observe infinite requesting when sending rpc call like getTransactionByHash on the light client.
The initial problem is that the light client is going to query random peers indefinitely until it got a reply (the way the communication is done closing the query do not end this).
In the case of a query with incorrect hash it will just query forever random peers.
In this PR I change this by limiting the number of query to one for each peer, before failing.
It leaves a few open questions:
no_capability
test, with a timeout mechanism the test would make sense again.Cancelled
error resulting in{"jsonrpc":"2.0","error":{"code":-32603,"message":"Internal error occurred: on-demand sender cancelled","data":"\"\""},"id":1}
, is it acceptable?