Allow dropping light client RPC query with no results #9318

cheme · 2018-08-08T16:52:22Z

This PR helps with #9133 .

I did observe infinite requesting when sending rpc call like getTransactionByHash on the light client.

The initial problem is that the light client is going to query random peers indefinitely until it got a reply (the way the communication is done closing the query do not end this).
In the case of a query with incorrect hash it will just query forever random peers.

In this PR I change this by limiting the number of query to one for each peer, before failing.
It leaves a few open questions:

Should we also use a timeout mechanism for the case where there are not many peers? This PR do break the no_capability test, with a timeout mechanism the test would make sense again.
When failing I use the Cancelled error resulting in {"jsonrpc":"2.0","error":{"code":-32603,"message":"Internal error occurred: on-demand sender cancelled","data":"\"\""},"id":1}, is it acceptable?
If there is new peers or removal of peers during the process some peers may not be queried or queried twice (the hashmap iterator keep its order so it is not that bad). Obviously, I can store the previous query peerids but I am not sure it is worth it.
When testing on ropsten, there are definitely very few peers that can reply (one with full history). When we got a OnDemand query that passes for a peer and not for others, should we exclude those peers from the OnDemand list of peers?

All peer known at the time will be queried, and the query fail if all return no reply. Returning the failure is done through an empty Vec of reply (the type of the oneshot channel remains unchanged). Before this commit the query were send randomly to any peer until there is a reply (for a query that got no result it was an issue, for other queries it was quering multiple times the same peers). After this commit the first query is random but next queries follows hashmap iterator order. Test no_capability was broken by this commit (the pending query was removed).

All peer known at the time will be queried, and the query fail if all return no reply. Returning the failure is done through an empty Vec of reply (the type of the oneshot channel remains unchanged). Before this commit the query were send randomly to any peer until there is a reply (for a query that got no result it was an issue, for other queries it was quering multiple times the same peers). After this commit the first query is random but next queries follows hashmap iterator order. Test no_capability was broken by this commit (the pending query was removed). If adding some kind of timeout mechanism it could be restored.

parity-cla-bot · 2018-08-08T16:52:26Z

It looks like @cheme signed our Contributor License Agreement. 👍

Many thanks,

Parity Technologies CLA Bot

dvdplm · 2018-08-09T05:18:43Z

ethcore/light/src/on_demand/mod.rs

@@ -81,6 +82,9 @@ struct Pending {
 	required_capabilities: Capabilities,
 	responses: Vec<Response>,
 	sender: oneshot::Sender<Vec<Response>>,
+	first_query: Option<usize>,
+	nb_query: usize,


What does nb and rem stand for? If nb is for "number", I'd suggest switching to query_count perhaps?

Yes not explicit enough: I can switch to :

first_query : base_query_index

nb_query : total_query_count

rem_query : remaining_query_count

dvdplm · 2018-08-09T05:19:29Z

ethcore/light/src/on_demand/mod.rs

@@ -180,6 +185,14 @@ impl Pending {
 		self.net_requests = builder.build();
 		self.required_capabilities = capabilities;
 	}
+
+	// return no response, will result in an error
+	// consume object (similar to drop when no reply found)


s/consume object/consumes self/ maybe?

Best comment (object is definitely a bad wording for rust) :
// returning no reponse, it will result in an error.
// self is consumed on purpose.
The comment is just here to point out that we do not want self to be use back (used oneshot channel).

Yep, I think the comment is useful here. I like your suggestion for an update.

rphmeier · 2018-08-13T12:09:17Z

ethcore/light/src/on_demand/mod.rs

@@ -145,7 +149,8 @@ impl Pending {
 	// if the requests are complete, send the result and consume self.
 	fn try_complete(self) -> Option<Self> {
 		if self.requests.is_complete() {
-			let _ = self.sender.send(self.responses);
+			self.sender.send(self.responses).map_err(|_|())
+				.expect("Non used one shot channel");


i don't think it should panic here

rphmeier · 2018-08-13T12:09:42Z

ethcore/light/src/on_demand/mod.rs

+	fn no_response(self) {
+		trace!(target: "on_demand", "Dropping a pending query (no reply)");
+		self.sender.send(Vec::with_capacity(0)).map_err(|_|())
+			.expect("Non used one shot channel");


I think this will panic if the receiving end has hung up. better to ignore/handle the error than panic

rphmeier · 2018-08-13T12:10:11Z

ethcore/light/src/on_demand/mod.rs

-		self.receiver.poll().map(|async| async.map(T::extract_from))
+		match self.receiver.poll() {
+			Ok(Async::Ready(v)) => {
+				if v.len() == 0 {


rphmeier · 2018-08-13T12:17:51Z

I'm not sure the logic in this PR is robust against the amount of peers changing over time, and it still seems that it may request from the same peer in that case. Why not just alter Pending to have a attempts_remaining: usize and a attempted: HashSet<PeerId>. When attempts_remaining == 0 then return nothing. We can make the attempts_remaining initial value configurable by CLI or RPC. A reasonable default would be 10

cheme · 2018-08-13T13:11:37Z

Hi Rob, I am totally sure it is not robust against amount of peer changing (the third point in my initial PR).
I did not want to store queried peer, if you believe it is not an issue to store it for every OnDemand request it would certainly be more robust.

rphmeier · 2018-08-13T15:48:18Z

Storing a HashSet of queried peers per request is probably fine, but if it turns out to be a performance issue in the future we can find some other heuristic.

on demand query in light client.

set number of query to minimum current number peer or configured number of query : that way capability test was restored.

…into light-noreply

niklasad1 · 2018-08-15T09:36:54Z

ethcore/light/src/on_demand/mod.rs

+pub const DEFAULT_NB_RETRY: usize = 10;
+
+/// The default time limit in milliseconds for inactive (no new peer to connect to) OnDemand queries (0 for unlimited)
+pub const DEFAULT_QUERY_TIME_LIMIT: u64 = 10000;


Use const Duration here instead!

niklasad1 · 2018-08-15T09:40:11Z

ethcore/light/src/on_demand/mod.rs

@@ -49,7 +50,14 @@ pub mod request;
 /// The result of execution
 pub type ExecutionResult = Result<Executed, ExecutionError>;

+/// The default number of retry for OnDemand queries send to other nodes


The default number of retries for OnDemand queries to send to the other nodes

niklasad1 · 2018-08-15T09:42:52Z

ethcore/light/src/on_demand/mod.rs

@@ -145,7 +157,9 @@ impl Pending {
 	// if the requests are complete, send the result and consume self.
 	fn try_complete(self) -> Option<Self> {
 		if self.requests.is_complete() {
-			let _ = self.sender.send(self.responses);
+			if self.sender.send(self.responses).is_err() {
+				debug!(target: "on_demand", "Dropped oneshot channel receiver on complet request");


reported at rpc level : choice of rpc error code error might not be right.

…into light-noreply

cheme · 2018-08-15T19:24:02Z

I added an error type to be able to have multiple errors :
{"jsonrpc":"2.0","error":{"code":-32042,"message":"On demand query limit reached on query #0"},"id":1} when all query occured so there is probably no results (for instance wrong hash).
And {"jsonrpc":"2.0","error":{"code":-32065,"message":"Timeout for On demand query, remaining 4 query attempts on query #0"},"id":1} if the request was dropped but we still got some attempt to do (and there was no new peer during the inactive delay) : in this case we may suspect that there is no peer serving.

It is also less hacky as proper types are used and we no longer rely on empty vec to carry an error.

In fact it all depends on the parameter used.

Also my choice for the RPC codes, may not be appropriate.

5chdn · 2018-08-30T18:20:00Z

What's the status of this?

niklasad1 · 2018-09-04T09:14:09Z

parity/cli/mod.rs

@@ -1364,12 +1374,19 @@ struct Whisper {
 	pool_size: Option<usize>,
 }

+#[derive(Default, Debug, PartialEq, Deserialize)]


Make this Copy and Clone?

niklasad1 · 2018-09-04T09:23:38Z

ethcore/light/src/on_demand/mod.rs

@@ -461,6 +595,16 @@ impl Handler for OnDemand {
 			None => return,
 		};

+		if responses.len() == 0 {


use is_empty() instead of len() == 0

niklasad1 · 2018-09-04T09:29:41Z

ethcore/light/src/on_demand/tests.rs

@@ -180,7 +180,7 @@ fn no_capabilities() {

 	harness.inject_peer(peer_id, Peer {
 		status: dummy_status(),
-		capabilities: capabilities,
+		capabilities: capabilities.clone(),


You can remove clone here capabilities is Copy!

niklasad1 · 2018-09-04T09:53:55Z

rpc/src/v1/helpers/errors.rs

@@ -444,7 +445,41 @@ pub fn filter_block_not_found(id: BlockId) -> Error {
 	}
 }

+pub fn on_demand_error(err: OnDemandError) -> Error {
+	match err.kind() {
+		OnDemandErrorKind::ChannelCanceled(ref e) => return on_demand_cancel(e.clone()),


Needless return

Also, is it possible to take ownership of the `future::oneshot::Canceled´ instead and not clone it?

Should be possible with a match err { Error(OnDemandErrorKind::ChannelCanceled(e), _) => e

Yes, thanks for the suggestion.

rphmeier · 2018-09-04T10:13:49Z

ethcore/light/src/on_demand/mod.rs

+					pending.base_query_index = rand;
+					rand
+				} else {
+					pending.base_query_index + history_len


perhaps make this an explicit wrapping add?

I am not sure about what you mean by explicit wrapping add ?

I think Rob is referring to <integer>::wrapping_add(), see https://doc.rust-lang.org/std/primitive.usize.html#method.wrapping_add

Oh, it's nice.

rphmeier · 2018-09-04T10:14:20Z

ethcore/light/src/on_demand/mod.rs

+					rand
+				} else {
+					pending.base_query_index + history_len
+				} % cmp::max(num_peers, 1);


if we lose peers along the way we might end up re-querying some of them. but generally the peer set will be decently stable so maybe not the worst thing

I also did a test on hashset insertion a few line after, so it should be fine.

niklasad1 · 2018-09-04T10:21:27Z

rpc/src/v1/impls/light/eth.rs

@@ -216,7 +216,7 @@ impl<T: LightChainClient + 'static> EthClient<T> {
 								};

 								fill_rich(block, score)
-							}).map_err(errors::on_demand_cancel)),
+							}).map_err(errors::on_demand_error)),


I think these changes will affect error return-type and need to be changed in for example in https://github.com/paritytech/parity-ethereum/blob/master/rpc/src/v1/helpers/light_fetch.rs#L300-#L365!

I guess this is the reason for the compiler-error

I resolve merge conflict really badly previously, my fault.

rphmeier · 2018-09-04T10:42:46Z

ethcore/light/src/on_demand/mod.rs

+							debug!(target: "on_demand", "No more peer to query, waiting for {} seconds until dropping query", query_inactive_time_limit.as_secs());
+							pending.inactive_time_limit = Some(now + query_inactive_time_limit);
+						} else {
+							if now > pending.inactive_time_limit.unwrap() {


we don't allow unwrap in the codebase. You should use expect with a proof that it is not going to panic. Maybe rewrite this block with an if let Some(x) = pending.inactive_time_limit?

niklasad1

I think this good enough especially as it drops infinite (as it seems) requests which make it almost unusable!

We should investigate if a priority queue is better than query random peers but might put more load on good peers

dvdplm

A few questions and typos/formatting to fix.

dvdplm · 2018-09-07T09:56:36Z

ethcore/light/src/on_demand/mod.rs

@@ -49,7 +50,43 @@ pub mod request;
 /// The result of execution
 pub type ExecutionResult = Result<Executed, ExecutionError>;

+/// The default number of retries for OnDemand queries to send to the other nodes
+pub const DEFAULT_NB_RETRY: usize = 10;


I'd name this DEFAULT_RETRY_COUNT to be consistent.

dvdplm · 2018-09-07T09:59:05Z

ethcore/light/src/on_demand/mod.rs

+		}
+
+		errors {
+			#[doc = "Max number of on demand attempt reached without results for a query."]


"Max number of on-demand query attempts reached without result."

dvdplm · 2018-09-07T09:59:32Z

ethcore/light/src/on_demand/mod.rs

+		errors {
+			#[doc = "Max number of on demand attempt reached without results for a query."]
+			MaxAttemptReach(query_index: usize) {
+				description("On demand query limit reached")


s/On demand/On-demand/

dvdplm · 2018-09-07T09:59:37Z

ethcore/light/src/on_demand/mod.rs

+			#[doc = "Max number of on demand attempt reached without results for a query."]
+			MaxAttemptReach(query_index: usize) {
+				description("On demand query limit reached")
+				display("On demand query limit reached on query #{}", query_index)


s/On demand/On-demand/

dvdplm · 2018-09-07T10:00:46Z

ethcore/light/src/on_demand/mod.rs

+
+			#[doc = "No reply with current peer set, time out occured while waiting for new peers for additional query attempt."]
+			TimeoutOnNewPeers(query_index: usize, remaining_attempts: usize) {
+				description("Timeout for On demand query")


"On-demand query timed out"

dvdplm · 2018-09-07T11:09:07Z

rpc/src/v1/helpers/light_fetch.rs

@@ -147,8 +150,7 @@ impl LightFetch {

 		Either::B(self.send_requests(reqs, |res|
 			extract_header(&res, header_ref)
-				.expect("these responses correspond to requests that header_ref belongs to \
-						therefore it will not fail; qed")
+				.expect(WRONG_RESPONSE_AMOUNT_TYPE)


This proof was different before. Was it perhaps a mistake to change it?

dvdplm · 2018-09-07T11:09:26Z

rpc/src/v1/helpers/light_fetch.rs

@@ -51,6 +52,8 @@ use v1::types::{BlockNumber, CallRequest, Log, Transaction};

 const NO_INVALID_BACK_REFS: &str = "Fails only on invalid back-references; back-references here known to be valid; qed";

+const WRONG_RESPONSE_AMOUNT_TYPE: &str = "responses correspond directly with requests in amount and type; qed";


The convention seems to be to name these _PROOF.

Not sure why this was "amount" – I think this is better: "the number and type of responses is the same as the number of requests; qed"

dvdplm · 2018-09-07T11:12:44Z

rpc/src/v1/helpers/errors.rs

 // on-demand sender cancelled.
 pub fn on_demand_cancel(_cancel: futures::sync::oneshot::Canceled) -> Error {
 	internal("on-demand sender cancelled", "")
 }
+
+pub fn max_attempt_reach(err: &OnDemandError) -> Error {


s/max_attempt_reach/max_attempts_reached/

dvdplm · 2018-09-07T11:16:50Z

ethcore/light/src/on_demand/mod.rs

-				let rng = rand::random::<usize>() % cmp::max(num_peers, 1);
-				for (peer_id, peer) in peers.iter().chain(peers.iter()).skip(rng).take(num_peers) {
+				let history_len = pending.query_id_history.len();
+				let start = if history_len == 0 {


Would peer_offset be a better name than start. I think .skip(start) below reads a bit weird.

ethcore/light/src/on_demand/mod.rs

…rs : - use standard '-' instead of '_' - renaming nb_retry params to 'on-demand-retry-count'

ordian

I wonder whether we could use failsafe-rs as an implementation of CircuitBreaker pattern instead of rolling our own version.

ordian · 2018-09-07T15:17:41Z

ethcore/light/src/on_demand/mod.rs

+	sender: oneshot::Sender<PendingResponse>,
+	base_query_index: usize,
+	remaining_query_count: usize,
+	query_id_history: BTreeSet<PeerId>,


Any reason to use BTreeSet instead of HashSet? We don't rely on ids order as it seems, so HashSet should more appropriate maybe?

I expect it to be a small set, and I generally favor a btreeset in those case.

cheme · 2018-09-07T15:57:23Z

Ordian, thanks for pointing out failsafe-rs, it seems like an interesting crate. At the time I was considering doing a very small change to fix it (not even a timeout), the other configurations were added afterward.

rphmeier · 2018-09-07T18:01:58Z

Failsafe-rs seems like a fantastic way to proceed. Ideally we can make things like # of attempts or timeouts user-configurable. Although I would place that as a refactoring beyond the scope of this PR.

cheme added 2 commits August 8, 2018 18:00

cheme added the A0-pleasereview 🤓 Pull request needs code review. label Aug 8, 2018

dvdplm reviewed Aug 9, 2018

View reviewed changes

cheme mentioned this pull request Aug 9, 2018

Light client logs should include 'from_block' when querying logs #9331

Merged

Comment plus better field names.

93aa500

debris requested a review from rphmeier August 10, 2018 08:54

debris added the M4-core ⛓ Core client code / Rust. label Aug 10, 2018

rphmeier reviewed Aug 13, 2018

View reviewed changes

No panick on dropped oneshot channel.

1cd1616

cheme added 5 commits August 14, 2018 09:55

Use Set to avoid counter heuristic

2b0fd7f

Cli option on_demand_nb_retry for maximum number of retry when doing

f7f087b

on demand query in light client.

Missing test update for previous commit

a89a439

Add a timeout (only when there is no peer to query), that way we do not

1e96a45

set number of query to minimum current number peer or configured number of query : that way capability test was restored.

Merge branch 'master' of https://github.com/paritytech/parity-ethereum …

b1e1ec5

…into light-noreply

niklasad1 reviewed Aug 15, 2018

View reviewed changes

Tbaut mentioned this pull request Aug 15, 2018

Light client network RPC requests never get responses on kovan #9356

Closed

cheme added 4 commits August 15, 2018 20:18

Adding an error type for on_demand, it helps having variant of error

51c959b

reported at rpc level : choice of rpc error code error might not be right.

Duration as constant is nice

17bb71b

Switch to duration in main too

a1d181c

Merge branch 'master' of https://github.com/paritytech/parity-ethereum …

444dd3e

…into light-noreply

Fix indentation (sorry for that).

608a302

5chdn added this to the 2.1 milestone Aug 23, 2018

Merge branch 'master' into light-noreply

029630f

niklasad1 reviewed Sep 4, 2018

View reviewed changes

rphmeier reviewed Sep 4, 2018

View reviewed changes

niklasad1 reviewed Sep 4, 2018

View reviewed changes

rphmeier reviewed Sep 4, 2018

View reviewed changes

Fix error management (bad merge in previous commit)

42e58e2

niklasad1 approved these changes Sep 6, 2018

View reviewed changes

5chdn added A8-looksgood 🦄 Pull request is reviewed well. and removed A0-pleasereview 🤓 Pull request needs code review. labels Sep 6, 2018

Merge branch 'master' into light-noreply

f9ea121

dvdplm suggested changes Sep 7, 2018

View reviewed changes

Lots of english corrections, major change on the new command paramete…

383107e

…rs : - use standard '-' instead of '_' - renaming nb_retry params to 'on-demand-retry-count'

ordian reviewed Sep 7, 2018

View reviewed changes

5chdn modified the milestones: 2.1, 2.2 Sep 11, 2018

dvdplm approved these changes Sep 12, 2018

View reviewed changes

5chdn merged commit 61f4534 into openethereum:master Sep 12, 2018

ordian mentioned this pull request Sep 12, 2018

Refactor on-demand queries to use failsafe-rs #9536

Closed

2 tasks

Tbaut mentioned this pull request Sep 13, 2018

Unhandled error: Timeout for On-demand query openethereum/fether#201

Closed

niklasad1 mentioned this pull request Oct 25, 2018

Light client does not respond to eth_getTransactionReceipt request #9133

Closed

ordian mentioned this pull request Nov 14, 2018

Backport Stable 2.1.6 #9904

Merged

8 tasks

		@@ -51,6 +52,8 @@ use v1::types::{BlockNumber, CallRequest, Log, Transaction};

		const NO_INVALID_BACK_REFS: &str = "Fails only on invalid back-references; back-references here known to be valid; qed";

		const WRONG_RESPONSE_AMOUNT_TYPE: &str = "responses correspond directly with requests in amount and type; qed";

Allow dropping light client RPC query with no results #9318

Allow dropping light client RPC query with no results #9318

Conversation

cheme commented Aug 8, 2018 • edited by niklasad1 Loading

parity-cla-bot commented Aug 8, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rphmeier commented Aug 13, 2018

cheme commented Aug 13, 2018 • edited Loading

rphmeier commented Aug 13, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cheme commented Aug 15, 2018

5chdn commented Aug 30, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

niklasad1 Sep 4, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

niklasad1 Sep 4, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

niklasad1 left a comment

Choose a reason for hiding this comment

dvdplm left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ordian left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cheme commented Sep 7, 2018

rphmeier commented Sep 7, 2018

cheme commented Aug 8, 2018 •

edited by niklasad1

Loading

cheme commented Aug 13, 2018 •

edited

Loading

niklasad1 Sep 4, 2018 •

edited

Loading

niklasad1 Sep 4, 2018 •

edited

Loading