-
Notifications
You must be signed in to change notification settings - Fork 377
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scenario balance sum flakiness #5078
Comments
I can pick this up. I think this is due to some latency for the assertion. If all payments go through, the sum of all channels of node 0 must be exactly that. |
The way to reproduce this issue is by making sure:
|
Small things to add:
|
That seems to already be the case right? raiden/raiden/tests/scenarios/bf1_basic_functionality.yaml Lines 43 to 52 in 4bed4d7
|
yes |
In order to be able to reproduce and debug this I need releases for raiden, contracts and services that are on the same page and use same contracts. At the moment and since Friday afternoon that is not the case and running this scenario (and I guess any other) with latest develop/dev fails with:
|
Now I am blocked by the PFS not being able to connect to matrix and as such not being able to function. This in turn makes Raiden fail to start ...
|
now it should work again |
but you might be blocked by the matrix presence bug anyway ... |
@Dominik1999 How does the presence bug manifest in a scenario? Route not found I guess? Because this is what I am getting when I run it now. |
exactly |
there must be a route, the scenario is designed like that. It can also manifest in an expired transfer since the Matrix bug is also affecting the local view of the nodes ... they start sending the transfer, but it gets stuck |
As per our talk with @rakanalh during standup I set all transports to server 01 instead of 04 and ran again. Now the scenario failed in a different place:
Is this a new thing that I should investigate or related to what you guys are doing? |
ok if it helps, I had this fail now
All transfers are green but the balance check fails even after 100s (so 31 transfers succeeded but the balance was not updated) |
Now with a second try it failed with an even weirder value |
Some thoughts after analyzing Dominik's logs@Dominik1999 @rakanalh I think we see two different issues here. I took a look at the logs given by Dominik in the comment just above this one here. Looking at past logs and the issue description the error always seems to happen at this line. Let's see what happens in this scenario. The network layout looks like this:
The only payments going out from node 0 are:
The logs of But the scenario logs fail. The error above is: expected: 1_995_000_000_000_000_000 That's really weird since after sending 100 payments the balance should not go down. The assertion uses the following scenario player task: which should iterate all channel states so we should see the sum of all balances for 0, both from channel 0-1 and channel 0-4. Issue 1Which brings me to issue 1. I think the scenario player has a bug in that assertion that it can sometimes fail and not take an entire channel into account. It's very easy to confirm by taking the DB of node 0 from the logs above and checking the balances.
From this I can deduce that the scenario player has a bug at getting the balance sum of all channels of a node. Somehow it can omit to take into account balances of all channels. Issue 2From above , even if you add both balances That's I suspect something is wrong there. But ... with all the changes in fees ... and fees schedules are the amounts set as expected in the scenario correct? |
AddendumHmm wait there is more. The scenarios' expected balances seem wrong to me at exactly the point of the failure
With the above network layout in mind let's look at the timeline of the scenario for Node 0. T0 - Start
T1 - Make 10 transfers from 3 to 0
Scenario assertion for Node0's total balance at time
T2 - Make payments from 0 to 2 after withdraw
Scenario assertion for Node0's total balance at time
T3 - Make 100 payments from 0 to 3
Scenario assertion for Node0's total balance at time
T4 - Make 10 payments from 0 to 3 after restart
Scenario assertion for Node0's total balance at time
So far so good ... The problem!!The assertion for Node0's balance at Sure enough raiden/raiden/tests/scenarios/bf1_basic_functionality.yaml Lines 208 to 221 in 88664b2
The channel 0-4 is closed before Which means the channel should not be counted in the balance calculation and only the channel 0-1 should be ... whose total balance should be SidenoteI am not sure if the transfers will pick the routes I picked for them during my scenario timeline creation. I made the assumption they would always go for the shortest route. |
So if the channel is pending close (or closed, not sure), that means that the API called here: Would be called and the closed channel would be returned along with the the response? If so, which is a more suitable solution?
|
As discussed in the standup the SP should be testing what happens in the user's case. And since the users also see the closed channels this should be returned and also tested in the SP. The question for last log of Dominik then becomes ... why was the closed channel not seen? Either an SP bug, or a timing issue in the scenario where there was a big pause inbetween and the channel ended up also being settled, in which case it did indeed get deleted and not returned in the channels list. |
that might be related
I can locally reprocude the error when all nodes use |
All right I managed to reproduce this in a local scenario run. I found out exactly what is happening but not sure why. What we know so farAs we have seen for far the scenario fails consistently here:
As we have seen above the big discrepancy in the scenario's node 0 balance is due to the channel 0-4 having been closed and settled. Either adjusting the scenario's expected balance or increasing the settlement timeout should fix this. But then the balance should have been My new runRight after closing channel 4 Assertion Failure From 3 to 0:
Node 0: Checking node 3's DatabaseAll Node 3's EventPaymentSentSuccess events have amount Checking node 0's DatabaseOffending payment received is at
It's sent from Node 3 and the Checking node 3's Database againThe
The route is 3 -> 2 -> 1 -> 0 as expected. But then checking the
The route states here have no fees ... but if you check the state change that generated it:
There they do have a fee of The remaining questionThis estimated fee is put in the initial Lines 82 to 100 in f317455
And it only seems to be a non zero number for the 99th out of the 100 payments of the
Check the post below to see that the PFS query fails, and we fall back to internal routing for that particular transfer. |
Checking all PFSFeeUpdates sent from nodesContinuing from the above findings I checked the Node 0
Node 1
Node 2
Node 3
Node 4
Checking conclusionBy checking the So the question seems to lie with the PFS as theorized in the previous post above. |
Reason why node 3 puts a fee on the 99th transferHehe jumped the gun above! The problem does indeed lie with the PFS but it does not ever return non-zero fees. The 99th request to the PFS fails and the node falls back to internal routing which adds a fee estimate because it can't ask the PFS. Here are the relevant log lines:
The exception raised (posted below in a more readable format) shows that the PFS read times out:
Next questionWhy does the 99th request to the PFS in this scenario time out? I may need help from @karlb here. |
I like your bug descriptions @LefterisJP it is a good read |
might be related raiden-network/raiden-services#550 |
Fix raiden-network#5078 When many requests are sent to the PFS at the same time and the pathfinding timeout is only 1 second https://github.com/raiden-network/raiden/blob/1ff6e6e22017b16a22fcef0e23535d5c41c617cc/raiden/constants.py#L153 then one of the requests does not make it within this one second window and is killed by the node before even reaching the PFS. Causing it to revert back to internal routing as explained [here](raiden-network#5078 (comment)). Increasing the request timeout fixes this.
We had a look at the PFS logs with @karlb. Turns out that for this particular request there is not even a log entry at the PFS side. That means that the request never made it to the PFS but timed out instead. The theory for that is that since many requests are sent to the PFS at the same time and the pathfinding timeout is only 1 second Line 153 in 1ff6e6e
then one of the requests does not make it within this one second window and is killed by the node before even reaching the PFS. Causing it to revert back to internal routing as explained here. Increasing the request timeout fixes this for a local scenario run. Finally to make the scenario pass we wait for 40 blocks to make sure that the settlement timeout has passed and the channel0-4 is settled and as such taken out of the balance calculation. And we adjust the asserted balance to |
Fix #5078 When many requests are sent to the PFS at the same time and the pathfinding timeout is only 1 second https://github.com/raiden-network/raiden/blob/1ff6e6e22017b16a22fcef0e23535d5c41c617cc/raiden/constants.py#L153 then one of the requests does not make it within this one second window and is killed by the node before even reaching the PFS. Causing it to revert back to internal routing as explained [here](#5078 (comment)). Increasing the request timeout fixes this.
Intro
When running BF1, the balance sum sometimes fails:
raiden/raiden/tests/scenarios/bf1_basic_functionality.yaml
Line 231 in e34025f
Investigation required:
The text was updated successfully, but these errors were encountered: