Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tribler crashes after 4-10 minutes #5220

Closed
ichorid opened this issue Mar 18, 2020 · 44 comments · Fixed by #5235
Closed

Tribler crashes after 4-10 minutes #5220

ichorid opened this issue Mar 18, 2020 · 44 comments · Fixed by #5235

Comments

@ichorid
Copy link
Contributor

ichorid commented Mar 18, 2020

This happens both on Linux (witnessed myself many times) and on Windows (reported in 7.5.0-rc1). Just run Tribler, do nothing (or do anything), and after about 5 minutes it crashes.

[PID:3451] 2020-03-18 00:43:50,235 - ERROR <tribler_window:107> root.on_exception(): Traceback (most recent call last):
  File "/home/vader/my_SRC/TRIBLER/tribler_ichorid/src/tribler-gui/tribler_gui/event_request_manager.py", line 114, in on_read_data
    raise RuntimeError(json_dict["event"]["text"])
RuntimeError: Task was destroyed but it is pending!

Observations so far:

  • the problem has nothing to do with the recent introduction of WeakValueDictionary. As when I change it back to dict it happens still.
  • the problem is seemingly a Heisenbug: enabling DEBUG level of logging makes it go away (or I was just really lucky to not trigger in several dozens runs with debug enabled).
  • the problem does not manifest itself when running core-only through run_tribler_headless script.
  • the problem happens on both Linux and Windows system, both with Libtorrent 1.2.x and 1.1.x (as packaged with our installer)
  • the problem seems to be related to Tunnels, as the last log line before crash typically refers to Tunnels. Also, disabling Tunnels seems to prevent the problem from occurring.
  • the problem seems to be not related to GigaChannel Manager, Popularity Community and Torrent Checker, as the crash still happens with these disabled.
@qstokkink
Copy link
Contributor

I recently had this problem when experimenting with threading in asyncio. Could you try to disable all threaded calls?

@ichorid
Copy link
Contributor Author

ichorid commented Mar 18, 2020

Disabling Market, RemoteQuery and GigaChannel communities does not help either.

@egbertbouman
Copy link
Member

Not sure why this is happening. Considering the error I would expect this to happen a shutdown. If I remember correctly this is the error you get when a task destructor is called without the Task having finished.

For some reason, I can't seem to reproduce this (on Win10, Python 3.8, libtorrent 1.2.1). I've tried to run from source and using the RC1 binaries.

@ichorid Have you tried enabling asyncios debug mode so that you get some more information?

@ichorid
Copy link
Contributor Author

ichorid commented Mar 18, 2020

@egbertbouman , I'll try it with asyncio debug enabled.

@ichorid
Copy link
Contributor Author

ichorid commented Mar 19, 2020

After some fiddling with asyncio debugging, one of the logs produced this:

[PID:26906] 2020-03-18 19:46:58,810 - ERROR <community:336> DHTDiscoveryCommunity.on_packet(): Exception occurred while handling packet!
Traceback (most recent call last):
  File "/home/vader/my_SRC/TRIBLER/tribler_ichorid/src/pyipv8/ipv8/community.py", line 331, in on_packet
    result = handler(source_address, data)
  File "/home/vader/my_SRC/TRIBLER/tribler_ichorid/src/pyipv8/ipv8/lazy_community.py", line 39, in wrapper
    return func(self, Peer(auth.public_key_bin, source_address), *unpacked)
  File "/home/vader/my_SRC/TRIBLER/tribler_ichorid/src/pyipv8/ipv8/dht/discovery.py", line 177, in on_connect_peer_response
    cache.future.set_result(payload.nodes)
asyncio.base_futures.InvalidStateError: invalid state


[PID:26906] 2020-03-18 19:46:54,123 - ERROR <community:447> DHTDiscoveryCommunity.on_find_response(): Got find-response with unknown identifier, dropping packet

This SO post suggests the problem is related to add_done_callback thing.

@egbertbouman
Copy link
Member

egbertbouman commented Mar 19, 2020

The InvalidStateError happens if you try to set the state more than once. This is usually solved by something like:

if not cache.future.done():
    cache.future.set_result(payload.nodes)

(this is similar to what we do here)

In this instance, I'm not sure what's causing this error though. Is there a timeout in the logs just before this stacktrace?

@ichorid
Copy link
Contributor Author

ichorid commented Mar 19, 2020

EgbertDHT is not (the only thing) responsible for the crash. Running Tribler with DHT disabled still produces the crash.

Also, during one of the tests, I downloaded a couple of torrents, and when I clicked "Remove with data" Tribler crashed immediately. This points further to add_done_callback thing, because it is used in the @require_handle decorator.

@egbertbouman
Copy link
Member

Correct, the issue in the DHT should not cause the crash since the packet handlers in IPv8 (whether they are async or not) will only log the error.

@ichorid
Copy link
Contributor Author

ichorid commented Mar 19, 2020

So, it is basically either Tunnels of Libtorrent wrapper (triggers even w/o bootstrap downlad and any other downloads, though.)

@synctext
Copy link
Member

Trying to also debug the new RC1 on Ubuntu. She is holding steady for a while already, no crash yet.

What is this?
ERROR:TrustChainCommunity:Not sending crawl response, the block is invalid. Result (<function ValidationResult.invalid at 0x7f3d4b8c5510>, ['Total down is higher than expected compared to the next block'])

Some chatter that should be fixed:

ERROR:DiscoveryCommunity:Exception occurred while handling packet!
Traceback (most recent call last):
  File "ipv8/messaging/serialization.py", line 314, in unpack_multiple
  File "ipv8/messaging/serialization.py", line 291, in unpack
  File "ipv8/messaging/serialization.py", line 155, in unpack_from
struct.error: unpack_from requires a buffer of at least 80 bytes

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "ipv8/messaging/serialization.py", line 341, in unpack_to_serializables
  File "ipv8/messaging/serialization.py", line 321, in unpack_multiple
ipv8.messaging.serialization.PackError: Could not pack item 1: varlenHx20
error: unpack_from requires a buffer of at least 80 bytes
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
  File "ipv8/community.py", line 331, in on_packet
  File "ipv8/lazy_community.py", line 33, in wrapper
  File "ipv8/messaging/serialization.py", line 360, in ez_unpack_serializables
  File "ipv8/messaging/serialization.py", line 344, in unpack_to_serializables
ipv8.messaging.serialization.PackError: Failed to unserialize SimilarityResponsePayload
PackError: Could not pack item 1: varlenHx20
error: unpack_from requires a buffer of at least 80 bytes
[SNIP]
Traceback (most recent call last):
  File "/home/tribler/jenkins/workspace/Build-Tribler_release/Async-Py3-Build-Ubuntu64/tribler/src/tribler-core/tribler_core/modules/libtorrent/torrentdef.py", line 75, in __init__
RuntimeError: incorrect number of piece hashes in torrent file
[SNIP]
Traceback (most recent call last):
  File "/home/tribler/jenkins/workspace/Build-Tribler_release/Async-Py3-Build-Ubuntu64/tribler/src/tribler-core/tribler_core/modules/libtorrent/download_manager.py", line 760, in load_checkpoint
  File "/home/tribler/jenkins/workspace/Build-Tribler_release/Async-Py3-Build-Ubuntu64/tribler/src/tribler-core/tribler_core/modules/libtorrent/torrentdef.py", line 136, in load_from_dict
  File "/home/tribler/jenkins/workspace/Build-Tribler_release/Async-Py3-Build-Ubuntu64/tribler/src/tribler-core/tribler_core/modules/libtorrent/torrentdef.py", line 77, in __init__
ValueError: incorrect number of piece hashes in torrent file
ERROR:Session:bitcoinlib library cannot be loaded: No module named 'bitcoinlib'

@egbertbouman
Copy link
Member

@synctext Most of these errors are because someone is sending you bad packets (see Tribler/py-ipv8#701). Not sure about the Trustchain error.

@synctext
Copy link
Member

No refresh after adding dozens of new files to "My Channel" (you have to wait :-)
Screenshot from 2020-03-19 18-47-09

@qstokkink
Copy link
Contributor

@synctext the ipv8.messaging.serialization.PackError: Failed to unserialize SimilarityResponsePayload error is from the kotlin implementation, we already informed @MattSkala 😉

@ichorid
Copy link
Contributor Author

ichorid commented Mar 19, 2020

No refresh after adding dozens of new files to "My Channel" (you have to wait :-)

Channel size counters only updated on channel commits, which happen automatically (if you do not disable this feature in Settings). The added entries will be visible in the channel immediately, though.

@ichorid
Copy link
Contributor Author

ichorid commented Mar 19, 2020

I am at loss dealing with this. Naturally, this is caused by something in Tunnels or Libtorrent. However, sometimes it will not trigger at all and there are no real debug messages.

@synctext
Copy link
Member

Stable for a 3+ hours already, but not doing anon downloads. Normal downloads are OK.

@synctext
Copy link
Member

@ichorid @egbertbouman @devos50 Please coordinate and try to get something like the application stress tester operational to get a good reproducible crash of this. It might needs some repeated downloading. RC1 just crashed on me after 18 hours. No anonymous downloads, just 1 normal download active.

INFO:CreditMiningManager:0 active download(s), 0.000 MB uploaded, 0.000 MB downloaded
INFO:DownloadManager:Failed to retrieve metainfo for 0508c7bb87f531cbfc7ecac1f0ac608eb98d7e96
INFO:DownloadManager:Failed to retrieve metainfo for b03104cc6c5d73ccab0b92b27854d5ec011e7147
INFO:DownloadManager:Failed to retrieve metainfo for b4f0da9cafc13949a9e3ebdd53b8973a772b21fd
INFO:GigaChannelManager:Downloading new channel version 9b10b5fb1391f9d0da0bdbab2a4c82fd50b9f402c9cbdbcdc0552dd8ce3910005ea7de7282851fa932d939197ec88ddc463acbaf73b46574cf4afa432964c7f7 ver 1562671495055->1562671495058
INFO:GigaChannelManager:Downloading new channel version 02ef6751e809bc49c92540962fd8fa54cf89b0004c61bfa1d31b5095ee20127a0bda166852057e15cdf57e3c315b88e67fddaa4d4e41b34b9415163d6062ff00 ver 1562497274819->1583447000002
INFO:TriblerTunnelCommunity:Want 0 data circuits of length 1
INFO:DownloadManager:Trying to fetch metainfo for 0508c7bb87f531cbfc7ecac1f0ac608eb98d7e96
INFO:DownloadManager:Trying to fetch metainfo for b03104cc6c5d73ccab0b92b27854d5ec011e7147
INFO:PopularityCommunity:Received torrent health information for 5 random torrents and 5 checked torrents
ERROR:root:Traceback (most recent call last):
  File "/home/tribler/jenkins/workspace/Build-Tribler_release/Async-Py3-Build-Ubuntu64/tribler/src/tribler-gui/tribler_gui/event_request_manager.py", line 114, in on_read_data
RuntimeError: Task was destroyed but it is pending!

INFO:TorrentChecker:Selected 2 new torrents to check on tracker: http://tracker.zelka.org/announce.php
INFO:Session:Tribler shutdown state notification:Shutting down Credit Mining...
INFO:CreditMiningManager:Shutting down CreditMiningManager
INFO:Session:Tribler shutdown state notification:Shutting down Torrent Checker...
INFO:Session:Tribler shutdown state notification:Shutting down Gigachannel Manager...
INFO:DownloadManager:Failed to retrieve metainfo for 0508c7bb87f531cbfc7ecac1f0ac608eb98d7e96
INFO:DownloadManager:Failed to retrieve metainfo for b03104cc6c5d73ccab0b92b27854d5ec011e7147
INFO:Session:Tribler shutdown state notification:Shutting down Version Checker...
INFO:Session:Tribler shutdown state notification:Shutting down Resource Monitor...
INFO:Session:Tribler shutdown state notification:Unloading Tunnel Community...
INFO:Socks5Connection:Closing session, reason stopping
INFO:Socks5Connection:Closing session, reason unspecified
INFO:TriblerTunnelCommunity:Destroy_exit_socket 2609793190 ('81.171.8.17', 35125)
INFO:TriblerTunnelCommunity:Removing DATA circuit 992103458 unload
INFO:TriblerTunnelCommunity:destroy_circuit 992103458 ('5.79.71.187', 35035)
INFO:TriblerTunnelCommunity:Removing DATA circuit 4170702521 unload
INFO:TriblerTunnelCommunity:destroy_circuit 4170702521 ('5.79.71.187', 35035)
INFO:TriblerTunnelCommunity:Removing DATA circuit 245290361 unload
INFO:TriblerTunnelCommunity:destroy_circuit 245290361 ('81.171.8.19', 35020)
INFO:TriblerTunnelCommunity:Removing DATA circuit 372189226 unload
INFO:TriblerTunnelCommunity:destroy_circuit 372189226 ('81.171.8.17', 35005)
INFO:Session:Tribler shutdown state notification:Shutting down TrustChain Community...
INFO:Session:Tribler shutdown state notification:Shutting down IPv8...
INFO:Session:Tribler shutdown state notification:Saving configuration...
INFO:Session:Tribler shutdown state notification:Checkpointing Downloads...
ERROR:Download:Resume data failed to save: <class 'tribler_core.exceptions.SaveResumeDataError'>: torrent has no metadata
ERROR:Download:Resume data failed to save: <class 'tribler_core.exceptions.SaveResumeDataError'>: torrent has no metadata
INFO:Session:Tribler shutdown state notification:Shutting down Downloads...
INFO:Session:Tribler shutdown state notification:Shutting down Libtorrent Manager...
INFO:Session:Tribler shutdown state notification:Waiting for Libtorrent to finish...
INFO:Session:Tribler shutdown state notification:Shutting down Metadata Store...
INFO:Session:Tribler shutdown state notification:Shutting down API Manager...
INFO:root:Shutting down Tribler

@ichorid
Copy link
Contributor Author

ichorid commented Mar 20, 2020

@synctext , I spent the last three days debuggin this. A stress tester would not help with debugging but could have helped with catching this earlier in development.
This is the worst kind of bug. There is almost no info no what causes it, and the crash is entirely stochastic.

@synctext
Copy link
Member

Ai. What is your next-steps strategy? Could we enhance our application tester to output more debug info and give us some hints?

@synctext
Copy link
Member

btw Crashed again just after a call to PopularityCommunity for a second time.

INFO:PopularityCommunity:Received torrent health information for 5 random torrents and 5 checked torrents
INFO:CreditMiningManager:Downloading: 0, Uploading: 0, Stopped: 0
INFO:CreditMiningManager:0 active download(s), 0.000 MB uploaded, 0.000 MB downloaded
INFO:TriblerTunnelCommunity:We joined circuit 3007315007 with neighbour ('81.171.27.193', 35030)
INFO:PopularityCommunity:Received torrent health information for 5 random torrents and 5 checked torrents
ERROR:root:Traceback (most recent call last):
  File "/home/tribler/jenkins/workspace/Build-Tribler_release/Async-Py3-Build-Ubuntu64/tribler/src/tribler-gui/tribler_gui/event_request_manager.py", line 114, in on_read_data
RuntimeError: Task was destroyed but it is pending!

@egbertbouman
Copy link
Member

If you enable asyncio debug, we should get more information about this unfinished task..

@ichorid
Copy link
Contributor Author

ichorid commented Mar 20, 2020

If you enable asyncio debug, we should get more information about this unfinished task..

Unfortunately, we will not.
I got, like, three dozens crashes with asyncio debug enabled. Nothing suspicious. 😞

@synctext
Copy link
Member

INFO:TriblerTunnelCommunity:Adding first hop 81.171.8.17:35180 to circuit 1231071863
INFO:CreditMiningManager:Downloading: 0, Uploading: 0, Stopped: 0
INFO:CreditMiningManager:0 active download(s), 0.000 MB uploaded, 0.000 MB downloaded
ERROR:root:Traceback (most recent call last):
  File "/home/tribler/jenkins/workspace/Build-Tribler_release/Async-Py3-Build-Ubuntu64/tribler/src/tribler-gui/tribler_gui/event_request_manager.py", line 114, in on_read_data
RuntimeError: Task was destroyed but it is pending!

@ichorid
Copy link
Contributor Author

ichorid commented Mar 21, 2020

Printing full error context yields something useful (formatted manually for readability):

({'message': 'Task was destroyed but it is pending!', 
'task': 
<Task pending coro=<TunnelCommunity.on_create()
running at /tribler_src/src/pyipv8/ipv8/messaging/anonymization/community.py:725>
wait_for=<Future pending cb=[<TaskWakeupMethWrapper object at 0x7f9ca46d4590>()]
created at /tribler_src/src/tribler-core/tribler_core/modules/tunnel/community/triblertunnel_community.py: 206> 
cb=[TaskManager.register_task.<locals>.done_cb() at /tribler_src/src/pyipv8/ipv8/taskmanager.py:105]
created at /tribler_src/src/pyipv8/ipv8/messaging/anonymization/community.py:675>,

 'source_traceback': [
<FrameSummary file /tribler_src/src/run_tribler.py, line 117 in <module>>,
<FrameSummary file /tribler_src//src/run_tribler.py, line 101 in start_tribler_core>, 
<FrameSummary file /home/vader/.pyenv/versions/3.7.4/lib/python3.7/asyncio/base_events.py, line 534 in run_forever>,
<FrameSummary file /home/vader/.pyenv/versions/3.7.4/lib/python3.7/asyncio/base_events.py, line 1763 in _run_once>,
<FrameSummary file /home/vader/.pyenv/versions/3.7.4/lib/python3.7/asyncio/events.py, line 88 in _run>,
<FrameSummary file /home/vader/.pyenv/versions/3.7.4/lib/python3.7/asyncio/selector_events.py, line 965 in _read_ready>,
<FrameSummary file /tribler_src/src/pyipv8/ipv8/messaging/interfaces/udp/endpoint.py, line 28 in datagram_received>,
<FrameSummary file /tribler_src/src/pyipv8/ipv8/messaging/interfaces/endpoint.py, line 85 in notify_listeners>,
<FrameSummary file /tribler_src/src/pyipv8/ipv8/messaging/interfaces/endpoint.py, line 73 in _deliver_later>,
<FrameSummary file /tribler_src/src/pyipv8/ipv8/community.py, line 331 in on_packet>,
<FrameSummary file /tribler_src/src/pyipv8/ipv8/messaging/anonymization/community.py, line 660 in on_cell>,
<FrameSummary file /tribler_src/src/pyipv8/ipv8/messaging/anonymization/community.py, line 675 in on_packet_from_circuit>]}')

@qstokkink
Copy link
Contributor

qstokkink commented Mar 21, 2020

Flip flopping between subclasses in the stacktrace, it seems that this future:

Is added to this request cache:

self.request_cache.add(BalanceRequestCache(self, circuit_id, balance_future))

But also returned:

This probably leads to a conflict somewhere.

@ichorid
Copy link
Contributor Author

ichorid commented Mar 21, 2020

Fun fact: should_join_circuit is async in base AnonymizationCommunity, though it is not necessary. But TriblerTunnel Community overloads it with non-async version.

@ichorid
Copy link
Contributor Author

ichorid commented Mar 21, 2020

So, on a CreatePayload, on_create gets fired, which then runs should_join_circuit and awaits for balance_future created there. Also, on_create is a Task managed by the TaskManager aspect of TriblerTunnel Community.

What probably then happens is, something cancels the Task while it is still waiting on the Future.

@egbertbouman
Copy link
Member

egbertbouman commented Mar 21, 2020

@ichorid Yes, should_join_circuit in the base class is async, in TriblerTunnel it's not and returns a Future. However, you both call them using await should_join_circuit. I don't really see the problem.

I think the issue that's causing the stacktrace is that BalanceRequestCache.on_timeout should be setting the result of the Future.

@ichorid
Copy link
Contributor Author

ichorid commented Mar 21, 2020

@egbertbouman , but is there a guarantee that BalanceRequestCache will time out before the anonymous Task serving on_create? If it will not, the same problem will reappear again.

@qstokkink
Copy link
Contributor

If I insta-pop the RequestCache I get a different task destroyed error, but this one is caught and logged (no crash!):

[PID:5282] 2020-03-21 17:00:08,818 - ERROR <base_events:1604> asyncio.default_exception_handler(): Unhandled error in exception handler
context:
 {'message': 'Task was destroyed but it is pending!',
'task':
 <Task pending coro=<TunnelCommunity.on_create() done,
               defined at /home/quinten/Documents/tribler/src/pyipv8/ipv8/messaging/anonymization/community.py:711>
               wait_for=<Future pending cb=[<TaskWakeupMethWrapper object at 0x7f41081cc490>()]>
               cb=[TaskManager.register_task.<locals>.done_cb() at /home/quinten/Documents/tribler/src/pyipv8/ipv8/taskmanager.py:105]>}
Traceback (most recent call last):
  File "/usr/lib/python3.7/asyncio/base_events.py", line 1639, in call_exception_handler
    self._exception_handler(self, context)
  File "/home/quinten/Documents/tribler/src/tribler-core/tribler_core/session.py", line 379, in unhandled_error_observer
    self.api_manager.get_endpoint('events').on_tribler_exception(text_long)
  File "/home/quinten/Documents/tribler/src/tribler-core/tribler_core/restapi/events_endpoint.py", line 115, in on_tribler_exception
    self.write_data({"type": NTFY.TRIBLER_EXCEPTION.value, "event": {"text": exception_text}})
  File "/home/quinten/Documents/tribler/src/pyipv8/ipv8/taskmanager.py", line 35, in wrapper
    ensure_future(func(self, *args, **kwargs)),
  File "/usr/lib/python3.7/asyncio/tasks.py", line 607, in ensure_future
    loop = events.get_event_loop()
  File "/usr/lib/python3.7/asyncio/events.py", line 644, in get_event_loop
    % threading.current_thread().name)
RuntimeError: There is no current event loop in thread 'ThreadPoolExecutor-0_1'.
/usr/lib/python3.7/asyncio/base_events.py:1655: RuntimeWarning: coroutine 'EventsEndpoint.write_data' was never awaited
  exc_info=True)
RuntimeWarning: Enable tracemalloc to get the object allocation traceback

@egbertbouman
Copy link
Member

@ichorid The anyonmous Task serving on_create will never timeout.

@qstokkink You removed the RequestCache, which results in the Future destructor getting called, which results in this crash. Again, I think you just need to set the Future before it gets destroyed.

@qstokkink
Copy link
Contributor

@egbertbouman sure, but I want to confirm we're fixing the right thing. We have no way to reproduce it now.

This makes a hard-crash happen very quickly for me:

diff --git a/src/pyipv8 b/src/pyipv8
index d57f610d6..775b5a40b 160000
--- a/src/pyipv8
+++ b/src/pyipv8
@@ -1 +1 @@
-Subproject commit d57f610d60a45f30535659a2741d2862881a875a
+Subproject commit 775b5a40b8d7b4315203430b9569b74731dfd7a8
diff --git a/src/tribler-core/tribler_core/modules/tunnel/community/triblertunnel_community.py b/src/tribler-core/tribler_core/modules/tunnel/community/triblertunnel_community.py
index 6e4d2f650..442f2eadc 100644
--- a/src/tribler-core/tribler_core/modules/tunnel/community/triblertunnel_community.py
+++ b/src/tribler-core/tribler_core/modules/tunnel/community/triblertunnel_community.py
@@ -190,17 +190,17 @@ class TriblerTunnelCommunity(HiddenTunnelCommunity):
         """
         Check whether we should join a circuit. Returns a future that fires with a boolean.
         """
-        if self.settings.max_joined_circuits <= len(self.relay_from_to) + len(self.exit_sockets):
+        """if self.settings.max_joined_circuits <= len(self.relay_from_to) + len(self.exit_sockets):
             self.logger.warning("too many relays (%d)", (len(self.relay_from_to) + len(self.exit_sockets)))
-            return succeed(False)
+            return succeed(False)"""
 
         # Check whether we have a random open slot, if so, allocate this to this request.
         circuit_id = create_payload.circuit_id
-        for index, slot in enumerate(self.random_slots):
+        """for index, slot in enumerate(self.random_slots):
             if not slot:
                 self.random_slots[index] = circuit_id
-                return succeed(True)
-
+                return succeed(True)"""
+        print("*"*50)
         # No random slots but this user might be allocated a competing slot.
         # Next, we request the token balance of the circuit initiator.
         balance_future = Future()
@@ -217,6 +217,8 @@ class TriblerTunnelCommunity(HiddenTunnelCommunity):
         self.directions.pop(circuit_id, None)
         self.relay_session_keys.pop(circuit_id, None)
 
+        self.request_cache.pop(u"balance-request", circuit_id)
+
         return balance_future
 
     async def on_payout_block(self, source_address, data):

@qstokkink
Copy link
Contributor

@egbertbouman I believe my patch file does indeed confirm that you are right and the Future destructor is the issue.

@ichorid
Copy link
Contributor Author

ichorid commented Mar 21, 2020

@qstokkink , it crashes for me too almost instantly. This is indeed a good way to (hopefully) test the thing.

@ichorid
Copy link
Contributor Author

ichorid commented Mar 21, 2020

This is exactly the kind of situation trio and cancellation scopes were designed to solve... 🤦‍♂️

@synctext
Copy link
Member

To prod in an open wound.. Guess in an more ideal Tribler world we would push code and see it broke the overnight application testers. First, a manual revert would be issued. Second, create a test which crashes it quicker by magic. Third, we would have a clear idea which PR broke stability. Finally fix is committed. Do we want to start defining and enforcing this more? Or we keep these seemingly frustrating days of bug hunting in order to maximize roadmap progress?

@ichorid
Copy link
Contributor Author

ichorid commented Mar 21, 2020

@synctext , this bug was introduced during transition from Twisted to Asyncio due to subtle differences in the frameworks' semantics. Also, overnight testers would probably not catch this, as this requires some time to trigger, and does not trigger at all at most machines.
Of course, we would like to get more effective deployment testing. However, this kind of bugs is really, really hard to hunt down unless you collect full telemetry from users.

Proper deployment testing is no silver bullet. Some problems can only be kept in check by adopting a better-suited programming style (think threads vs reactor, spaghetti vs structured etc.), doing frequent experimental releases and producing more useful stack reports.

@ichorid
Copy link
Contributor Author

ichorid commented Mar 21, 2020

The proper way to handle this is to add a callback to set off the balance_future associated with BalanceRequestCache on cancellation of the associated Task, and not just setting _on_timeout as @egbertbouman suggested. This will guarantee that whatever the reason for cancelling the task (e.g. Tribler shutdown), balance_future will be triggered.

However, with the current implementation of NumberCache it is impossible. NumberCache is not inherted from asyncio.Task and instead is just a regular Object. The proper way to handle cache-like objects is to refactor NumberCache to inherit from asyncio.Task implementing its cancel() method. That's what cancel method is for. 🙈

@ichorid
Copy link
Contributor Author

ichorid commented Mar 21, 2020

That would be a pretty big refactoring. For 7.5. I'll try to stick to @egbertbouman 's solution and hope it works.

@ichorid
Copy link
Contributor Author

ichorid commented Mar 21, 2020

I've applied @egbertbouman's fix and now testing the thing. If I don't see anything suspicious in a day or so, I'll file a PR.

@synctext
Copy link
Member

Our RC1 crashes on Linux within an hour. I think that this should not be possible to go undetected in our process.

@synctext
Copy link
Member

Anyways, #4999 does not need to be done in a hurry.

@ichorid
Copy link
Contributor Author

ichorid commented Mar 21, 2020

Our RC1 crashes on Linux within an hour. I think that this should not be possible to go undetected in our process.

I completely agree. However, the funny thing is that all this time no one paid any attention to this particular crash due to Tribler crashing constantly because of Libtorrent 1.2.x transition, Python3 transition, Asyncio transition, GUI refactorings... Well, you get it.
One can only check regressions on a thing that is already stable. "Garbage in, garbage out" as they say. Otherwise, tests become merely a nuisance that developers tend to ignore.

We're actively working on getting Tribler into a shape where it will actually become testable.

@synctext
Copy link
Member

Being patient for 14 years and 10 months is not always easy, but experimental release 4 and RC1 has solid performance and responsiveness!
image
btw did we switch anon relaying protocol? I'm not earning anything anymore, not part of any relay circuit.
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging a pull request may close this issue.

4 participants