-
Notifications
You must be signed in to change notification settings - Fork 454
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tribler crashes after 4-10 minutes #5220
Comments
I recently had this problem when experimenting with threading in asyncio. Could you try to disable all threaded calls? |
Disabling Market, RemoteQuery and GigaChannel communities does not help either. |
Not sure why this is happening. Considering the error I would expect this to happen a shutdown. If I remember correctly this is the error you get when a task destructor is called without the Task having finished. For some reason, I can't seem to reproduce this (on Win10, Python 3.8, libtorrent 1.2.1). I've tried to run from source and using the RC1 binaries. @ichorid Have you tried enabling asyncios debug mode so that you get some more information? |
@egbertbouman , I'll try it with asyncio debug enabled. |
After some fiddling with asyncio debugging, one of the logs produced this: [PID:26906] 2020-03-18 19:46:58,810 - ERROR <community:336> DHTDiscoveryCommunity.on_packet(): Exception occurred while handling packet!
Traceback (most recent call last):
File "/home/vader/my_SRC/TRIBLER/tribler_ichorid/src/pyipv8/ipv8/community.py", line 331, in on_packet
result = handler(source_address, data)
File "/home/vader/my_SRC/TRIBLER/tribler_ichorid/src/pyipv8/ipv8/lazy_community.py", line 39, in wrapper
return func(self, Peer(auth.public_key_bin, source_address), *unpacked)
File "/home/vader/my_SRC/TRIBLER/tribler_ichorid/src/pyipv8/ipv8/dht/discovery.py", line 177, in on_connect_peer_response
cache.future.set_result(payload.nodes)
asyncio.base_futures.InvalidStateError: invalid state
[PID:26906] 2020-03-18 19:46:54,123 - ERROR <community:447> DHTDiscoveryCommunity.on_find_response(): Got find-response with unknown identifier, dropping packet This SO post suggests the problem is related to |
The InvalidStateError happens if you try to set the state more than once. This is usually solved by something like:
(this is similar to what we do here) In this instance, I'm not sure what's causing this error though. Is there a timeout in the logs just before this stacktrace? |
EgbertDHT is not (the only thing) responsible for the crash. Running Tribler with DHT disabled still produces the crash. Also, during one of the tests, I downloaded a couple of torrents, and when I clicked "Remove with data" Tribler crashed immediately. This points further to |
Correct, the issue in the DHT should not cause the crash since the packet handlers in IPv8 (whether they are async or not) will only log the error. |
So, it is basically either Tunnels of Libtorrent wrapper (triggers even w/o bootstrap downlad and any other downloads, though.) |
Trying to also debug the new RC1 on Ubuntu. She is holding steady for a while already, no crash yet. What is this? Some chatter that should be fixed:
|
@synctext Most of these errors are because someone is sending you bad packets (see Tribler/py-ipv8#701). Not sure about the Trustchain error. |
@synctext the |
Channel size counters only updated on channel commits, which happen automatically (if you do not disable this feature in Settings). The added entries will be visible in the channel immediately, though. |
I am at loss dealing with this. Naturally, this is caused by something in Tunnels or Libtorrent. However, sometimes it will not trigger at all and there are no real debug messages. |
Stable for a 3+ hours already, but not doing anon downloads. Normal downloads are OK. |
@ichorid @egbertbouman @devos50 Please coordinate and try to get something like the application stress tester operational to get a good reproducible crash of this. It might needs some repeated downloading. RC1 just crashed on me after 18 hours. No anonymous downloads, just 1 normal download active.
|
@synctext , I spent the last three days debuggin this. A stress tester would not help with debugging but could have helped with catching this earlier in development. |
Ai. What is your next-steps strategy? Could we enhance our application tester to output more debug info and give us some hints? |
btw Crashed again just after a call to
|
If you enable asyncio debug, we should get more information about this unfinished task.. |
Unfortunately, we will not. |
|
Printing full error context yields something useful (formatted manually for readability): ({'message': 'Task was destroyed but it is pending!',
'task':
<Task pending coro=<TunnelCommunity.on_create()
running at /tribler_src/src/pyipv8/ipv8/messaging/anonymization/community.py:725>
wait_for=<Future pending cb=[<TaskWakeupMethWrapper object at 0x7f9ca46d4590>()]
created at /tribler_src/src/tribler-core/tribler_core/modules/tunnel/community/triblertunnel_community.py: 206>
cb=[TaskManager.register_task.<locals>.done_cb() at /tribler_src/src/pyipv8/ipv8/taskmanager.py:105]
created at /tribler_src/src/pyipv8/ipv8/messaging/anonymization/community.py:675>,
'source_traceback': [
<FrameSummary file /tribler_src/src/run_tribler.py, line 117 in <module>>,
<FrameSummary file /tribler_src//src/run_tribler.py, line 101 in start_tribler_core>,
<FrameSummary file /home/vader/.pyenv/versions/3.7.4/lib/python3.7/asyncio/base_events.py, line 534 in run_forever>,
<FrameSummary file /home/vader/.pyenv/versions/3.7.4/lib/python3.7/asyncio/base_events.py, line 1763 in _run_once>,
<FrameSummary file /home/vader/.pyenv/versions/3.7.4/lib/python3.7/asyncio/events.py, line 88 in _run>,
<FrameSummary file /home/vader/.pyenv/versions/3.7.4/lib/python3.7/asyncio/selector_events.py, line 965 in _read_ready>,
<FrameSummary file /tribler_src/src/pyipv8/ipv8/messaging/interfaces/udp/endpoint.py, line 28 in datagram_received>,
<FrameSummary file /tribler_src/src/pyipv8/ipv8/messaging/interfaces/endpoint.py, line 85 in notify_listeners>,
<FrameSummary file /tribler_src/src/pyipv8/ipv8/messaging/interfaces/endpoint.py, line 73 in _deliver_later>,
<FrameSummary file /tribler_src/src/pyipv8/ipv8/community.py, line 331 in on_packet>,
<FrameSummary file /tribler_src/src/pyipv8/ipv8/messaging/anonymization/community.py, line 660 in on_cell>,
<FrameSummary file /tribler_src/src/pyipv8/ipv8/messaging/anonymization/community.py, line 675 in on_packet_from_circuit>]}') |
Flip flopping between subclasses in the stacktrace, it seems that this future: tribler/src/tribler-core/tribler_core/modules/tunnel/community/triblertunnel_community.py Line 206 in f80d668
Is added to this request cache: tribler/src/tribler-core/tribler_core/modules/tunnel/community/triblertunnel_community.py Line 207 in f80d668
But also returned: tribler/src/tribler-core/tribler_core/modules/tunnel/community/triblertunnel_community.py Line 220 in f80d668
This probably leads to a conflict somewhere. |
Fun fact: |
So, on a CreatePayload, What probably then happens is, something cancels the Task while it is still waiting on the Future. |
@ichorid Yes, I think the issue that's causing the stacktrace is that |
@egbertbouman , but is there a guarantee that |
If I insta-pop the RequestCache I get a different task destroyed error, but this one is caught and logged (no crash!):
|
@ichorid The anyonmous Task serving on_create will never timeout. @qstokkink You removed the RequestCache, which results in the Future destructor getting called, which results in this crash. Again, I think you just need to set the Future before it gets destroyed. |
@egbertbouman sure, but I want to confirm we're fixing the right thing. We have no way to reproduce it now. This makes a hard-crash happen very quickly for me: diff --git a/src/pyipv8 b/src/pyipv8
index d57f610d6..775b5a40b 160000
--- a/src/pyipv8
+++ b/src/pyipv8
@@ -1 +1 @@
-Subproject commit d57f610d60a45f30535659a2741d2862881a875a
+Subproject commit 775b5a40b8d7b4315203430b9569b74731dfd7a8
diff --git a/src/tribler-core/tribler_core/modules/tunnel/community/triblertunnel_community.py b/src/tribler-core/tribler_core/modules/tunnel/community/triblertunnel_community.py
index 6e4d2f650..442f2eadc 100644
--- a/src/tribler-core/tribler_core/modules/tunnel/community/triblertunnel_community.py
+++ b/src/tribler-core/tribler_core/modules/tunnel/community/triblertunnel_community.py
@@ -190,17 +190,17 @@ class TriblerTunnelCommunity(HiddenTunnelCommunity):
"""
Check whether we should join a circuit. Returns a future that fires with a boolean.
"""
- if self.settings.max_joined_circuits <= len(self.relay_from_to) + len(self.exit_sockets):
+ """if self.settings.max_joined_circuits <= len(self.relay_from_to) + len(self.exit_sockets):
self.logger.warning("too many relays (%d)", (len(self.relay_from_to) + len(self.exit_sockets)))
- return succeed(False)
+ return succeed(False)"""
# Check whether we have a random open slot, if so, allocate this to this request.
circuit_id = create_payload.circuit_id
- for index, slot in enumerate(self.random_slots):
+ """for index, slot in enumerate(self.random_slots):
if not slot:
self.random_slots[index] = circuit_id
- return succeed(True)
-
+ return succeed(True)"""
+ print("*"*50)
# No random slots but this user might be allocated a competing slot.
# Next, we request the token balance of the circuit initiator.
balance_future = Future()
@@ -217,6 +217,8 @@ class TriblerTunnelCommunity(HiddenTunnelCommunity):
self.directions.pop(circuit_id, None)
self.relay_session_keys.pop(circuit_id, None)
+ self.request_cache.pop(u"balance-request", circuit_id)
+
return balance_future
async def on_payout_block(self, source_address, data): |
@egbertbouman I believe my |
@qstokkink , it crashes for me too almost instantly. This is indeed a good way to (hopefully) test the thing. |
This is exactly the kind of situation |
To prod in an open wound.. Guess in an more ideal Tribler world we would push code and see it broke the overnight application testers. First, a manual revert would be issued. Second, create a test which crashes it quicker by magic. Third, we would have a clear idea which PR broke stability. Finally fix is committed. Do we want to start defining and enforcing this more? Or we keep these seemingly frustrating days of bug hunting in order to maximize roadmap progress? |
@synctext , this bug was introduced during transition from Twisted to Asyncio due to subtle differences in the frameworks' semantics. Also, overnight testers would probably not catch this, as this requires some time to trigger, and does not trigger at all at most machines. Proper deployment testing is no silver bullet. Some problems can only be kept in check by adopting a better-suited programming style (think threads vs reactor, spaghetti vs structured etc.), doing frequent experimental releases and producing more useful stack reports. |
The proper way to handle this is to add a callback to set off the However, with the current implementation of |
That would be a pretty big refactoring. For 7.5. I'll try to stick to @egbertbouman 's solution and hope it works. |
I've applied @egbertbouman's fix and now testing the thing. If I don't see anything suspicious in a day or so, I'll file a PR. |
Our RC1 crashes on Linux within an hour. I think that this should not be possible to go undetected in our process. |
Anyways, #4999 does not need to be done in a hurry. |
I completely agree. However, the funny thing is that all this time no one paid any attention to this particular crash due to Tribler crashing constantly because of Libtorrent 1.2.x transition, Python3 transition, Asyncio transition, GUI refactorings... Well, you get it. We're actively working on getting Tribler into a shape where it will actually become testable. |
This happens both on Linux (witnessed myself many times) and on Windows (reported in 7.5.0-rc1). Just run Tribler, do nothing (or do anything), and after about 5 minutes it crashes.
Observations so far:
WeakValueDictionary
. As when I change it back todict
it happens still.DEBUG
level of logging makes it go away (or I was just really lucky to not trigger in several dozens runs with debug enabled).run_tribler_headless
script.The text was updated successfully, but these errors were encountered: