Skip to content
This repository has been archived by the owner on Jul 5, 2024. It is now read-only.

[BUG] Scrape or some other routine won't exit and runs infinitely #791

Closed
baccccccc opened this issue Jan 21, 2024 · 18 comments
Closed

[BUG] Scrape or some other routine won't exit and runs infinitely #791

baccccccc opened this issue Jan 21, 2024 · 18 comments
Assignees
Labels
bug Something isn't working

Comments

@baccccccc
Copy link

baccccccc commented Jan 21, 2024

I'm trying to download https://simpcity.su/threads/cj-miles.7954 for quite some time. The command line I'm typically using is --output-folder <bla> --log-folder <bla> --ignore-history --download https://simpcity.su/threads/cj-miles.7954

Note 1. That the same syntax worked flawlessly with other forum threads before.
Note 2. I've tried removing --ignore-history and the outcome is slightly different, but the main problem seems to be the same.

Basically, it seems that the program reaches approximately 2095 files (sometimes a bit fewer due to transient errors that don't seem to impact anything in the grand scheme of things) and then hangs indefinitely. It's still responsive, i.e. it keeps flickering, resizes (if I resize the console window) and responds to Ctrl+C. But there are seemingly no ongoing downloads, the scrape won't finish (even if I leave it like this for a day or longer), no new entries in downloader.log and the file counter won't increase anymore.

Now here's a minor difference. If I omit --ignore-history then the scraping part of the screen becomes completely empty.

Screen

If I include --ignore-history then the scraping part always shows ... and 90 links in Scrape queue. Of course, there are other (meaningful) messages before that. But once it reaches ... and 90 links then I know it went to this "hang" state and it stays like this forever, and nothing changes anymore.

screen

I've retried multiple times (probably couple dozen times already in the last two weeks or so), trying to update the program before each run. The results are very consistent.

Thanks in advance!

@baccccccc baccccccc added the bug Something isn't working label Jan 21, 2024
@baccccccc baccccccc changed the title [BUG] Scrape or some other routine won't exit and run infinitely. [BUG] Scrape or some other routine won't exit and runs infinitely Jan 21, 2024
@Jules-WinnfieldX
Copy link
Owner

Aware this happens, unsure of the cause. No time currently to figure it out. There also isn't a reliable method to recreate in a small timeframe.

@baccccccc
Copy link
Author

baccccccc commented Jan 30, 2024

Aware this happens, unsure of the cause. No time currently to figure it out. There also isn't a reliable method to recreate in a small timeframe.

thanks. In my case, I have 100% repro. (Although it recently changed from "90 links" to "20 links" in queue. But it's 20 links all the time, and it's been like this for a week or so.) I can capture any sort of trace or other diagnostic info if that helps.

@baccccccc
Copy link
Author

Just tried it on a different thread, and it works as expected. (I.e. the app eventually exists.) So, it might be something wrong with this particular forum thread.

@baccccccc
Copy link
Author

I think you've made some great progress there. With one of the recent updates, it no longer hangs.

Although it now exist prematurely with a 403 error from the forum. I suspect that might be where it hanged previously.

INFO     : 2024-02-23 21:29:41,244 : utilities.py:95 : Starting UI...
INFO     : 2024-02-23 21:29:41,446 : utilities.py:95 : Scrape Starting: https://simpcity.su/threads/cj-miles.7954
...
ERROR    : 2024-02-23 21:29:48,343 : utilities.py:95 : Scrape Failed: https://simpcity.su/threads/cj-miles.7954 (403 - HTTP status code 403: Forbidden)
INFO     : 2024-02-23 21:29:48,352 : utilities.py:95 : Scrape Finished: https://simpcity.su/threads/cj-miles.7954

which is weird becuase there is a bunch of successful activity in between. And I still can access the forum in the browser. And this only happens to me when downloading some very long threads, such as https://simpcity.su/threads/cj-miles.7954 and https://simpcity.su/threads/miss-lexa-misslexa.9116. So maybe there's some sort of rate limiting feature added to the forum recently.

anyhow, the original problem seems to be solved. Do you want me to open a new issue or continue here to discuss how we could potentially addess those intermittent 403's?

@Jules-WinnfieldX
Copy link
Owner

You are right about the ratelimiting, however the hangs are still a problem. I have yet to successfully find out what it is yet, need to dedicate more time.

I'll deal with the ratelimiting soon, no need to make an issue for it.

@Jules-WinnfieldX
Copy link
Owner

@baccccccc, are you running things on windows?

@baccccccc
Copy link
Author

yes, Windows.

hold on, I think I might have figured it out.

I noticed that there is a small bunch of the following errors in downloader.log. This was unique to this thread, never seen this before. But it repeats a few times every time I try to download it.

INFO     : 2024-02-25 13:42:55,717 : utilities.py:95 : Download Starting: https://tbi.sb-cd.com/t/13362571/1/3/w:300/t8-enh/cj-miles-juju-bahreis-luck.jpg
ERROR    : 2024-02-25 13:42:58,495 : utilities.py:95 : Download Failed: https://tbi.sb-cd.com/t/13362571/1/3/w:300/t8-enh/cj-miles-juju-bahreis-luck.jpg with error 'utf-8' codec can't decode byte 0x89 in position 0: invalid start byte
ERROR    : 2024-02-25 13:42:58,628 : utilities.py:95 : Traceback (most recent call last):
  File "C:\Users\<username>\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.12_qbz5n2kfra8p0\LocalCache\local-packages\Python312\site-packages\cyberdrop_dl\downloader\downloader.py", line 34, in wrapper
    return await f(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\<username>\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.12_qbz5n2kfra8p0\LocalCache\local-packages\Python312\site-packages\cyberdrop_dl\downloader\downloader.py", line 323, in download
    complete_file, partial_file, proceed, skip_by_config = await self.get_final_file_info(complete_file, partial_file, media_item)
                                                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\<username>\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.12_qbz5n2kfra8p0\LocalCache\local-packages\Python312\site-packages\cyberdrop_dl\downloader\downloader.py", line 226, in get_final_file_info
    media_item.filesize = await self.client.get_filesize(media_item)
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\<username>\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.12_qbz5n2kfra8p0\LocalCache\local-packages\Python312\site-packages\cyberdrop_dl\clients\download_client.py", line 45, in wrapper
    return await func(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\<username>\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.12_qbz5n2kfra8p0\LocalCache\local-packages\Python312\site-packages\cyberdrop_dl\clients\download_client.py", line 66, in get_filesize
    await self.client_manager.check_http_status(resp)
  File "C:\Users\<username>\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.12_qbz5n2kfra8p0\LocalCache\local-packages\Python312\site-packages\cyberdrop_dl\managers\client_manager.py", line 93, in check_http_status
    response_text = await response.text()
                    ^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\<username>\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.12_qbz5n2kfra8p0\LocalCache\local-packages\Python312\site-packages\aiohttp\client_reqrep.py", line 1147, in text
    return self._body.decode(  # type: ignore[no-any-return,union-attr]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x89 in position 0: invalid start byte

INFO     : 2024-02-25 13:42:58,629 : utilities.py:95 : Download Finished: https://tbi.sb-cd.com/t/13362571/1/3/w:300/t8-enh/cj-miles-juju-bahreis-luck.jpg

What captured my attention is that those URLs never make it to Download_Error_URLs.csv. So, it might be something about handling those errors what went wrong?

I was under the impression you can blocklist individual files in the config but now I could not find this option. So, to try it out, I just blocked the entire tbi.sb-cd.com domain on my router.

And guess what, the next download attempt finished successfully. No more hangs!

@Jules-WinnfieldX
Copy link
Owner

I'm unsure if it's as simple as one unhandled exception. But I'll try it. After I figure out why the program won't exit for me..

@baccccccc
Copy link
Author

baccccccc commented Feb 26, 2024

It must be it.

While I had the tbi.sb-cd.com domain blocked on the router, the app completed successfully, and the following three failures were recorded in downloader.log.

ERROR    : 2024-02-25 19:52:51,703 : utilities.py:95 : Download Failed: https://tbi.sb-cd.com/t/13362571/1/3/w:300/t8-enh/cj-miles-juju-bahreis-luck.jpg with status 1 and message ClientConnectorError(ConnectionKey(host='tbi.sb-cd.com', port=443, is_ssl=True, ssl=False, proxy=URL(''), proxy_auth=None, proxy_headers_hash=-7796739784479401459), gaierror(11001, 'getaddrinfo failed'))

ERROR    : 2024-02-25 19:53:02,800 : utilities.py:95 : Download Failed: https://tbi.sb-cd.com/t/11897232/1/1/w:300/t10-enh/1-pole-2-hoe.jpg with status 1 and message ClientConnectorError(ConnectionKey(host='tbi.sb-cd.com', port=443, is_ssl=True, ssl=False, proxy=URL(''), proxy_auth=None, proxy_headers_hash=-7796739784479401459), gaierror(11001, 'getaddrinfo failed'))

ERROR    : 2024-02-25 19:53:14,953 : utilities.py:95 : Download Failed: https://tbi.sb-cd.com/t/13362571/1/3/w:300/t10-enh/cj-miles-juju-bahreis-luck.jpg with status 1 and message ClientConnectorError(ConnectionKey(host='tbi.sb-cd.com', port=443, is_ssl=True, ssl=False, proxy=URL(''), proxy_auth=None, proxy_headers_hash=-7796739784479401459), gaierror(11001, 'getaddrinfo failed'))

Each was retried a number of times, as configured for download_attempts, and then fails as expected.

And more importantly, three corrsponding records made it to Download_Error_URLs.csv.

Subsequently, it all finished with the following stats.

Download Stats:
Downloaded 1 files
Previously Downloaded 1845 files
Skipped By Config 0 files
Failed 289 files

Scrape Failures:
Scrape Failures (404 HTTP Status): 18
Scrape Failures (Unknown): 1
Scrape Failures (DDOS-Guard): 25

Download Failures:
Download Failures (403 HTTP Status): 286
Download Failures (1 HTTP Status): 3

The whole pass, from start to finish took roughly 12 minutes, and downloader.log file is 895 kb long after it completed.

Now that I unblocked the domain again and restarted the download, it hangs at this point.

Completed                ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0.00%  0 of 1421 Files
Previously Downloaded    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺ 98.03% 1393 of 1421 Files
Skipped By Configuration ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0.00%  0 of 1421 Files
Failed                   ╸━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.83%  26 of 1421 Files

downloader.log is currently 658 kb. The last entry in it was

INFO     : 2024-02-25 20:21:40,562 : utilities.py:95 : Scrape Finished: https://cyberfile.me/o99J

It's been 20 minutes since that record, and no visible progress has been made.

There are only two error records about tbi.sb-cd.com downloads in downloader.log.

INFO     : 2024-02-25 20:20:30,141 : utilities.py:95 : Download Starting: https://tbi.sb-cd.com/t/13362571/1/3/w:300/t8-enh/cj-miles-juju-bahreis-luck.jpg
ERROR    : 2024-02-25 20:20:30,708 : utilities.py:95 : Download Failed: https://tbi.sb-cd.com/t/13362571/1/3/w:300/t8-enh/cj-miles-juju-bahreis-luck.jpg with error 'utf-8' codec can't decode byte 0x89 in position 0: invalid start byte
ERROR    : 2024-02-25 20:20:30,817 : utilities.py:95 : Traceback (most recent call last):
  File "C:\Users\<username>\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.12_qbz5n2kfra8p0\LocalCache\local-packages\Python312\site-packages\cyberdrop_dl\downloader\downloader.py", line 34, in wrapper
    return await f(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\<username>\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.12_qbz5n2kfra8p0\LocalCache\local-packages\Python312\site-packages\cyberdrop_dl\downloader\downloader.py", line 323, in download
    complete_file, partial_file, proceed, skip_by_config = await self.get_final_file_info(complete_file, partial_file, media_item)
                                                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\<username>\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.12_qbz5n2kfra8p0\LocalCache\local-packages\Python312\site-packages\cyberdrop_dl\downloader\downloader.py", line 226, in get_final_file_info
    media_item.filesize = await self.client.get_filesize(media_item)
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\<username>\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.12_qbz5n2kfra8p0\LocalCache\local-packages\Python312\site-packages\cyberdrop_dl\clients\download_client.py", line 45, in wrapper
    return await func(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\<username>\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.12_qbz5n2kfra8p0\LocalCache\local-packages\Python312\site-packages\cyberdrop_dl\clients\download_client.py", line 66, in get_filesize
    await self.client_manager.check_http_status(resp)
  File "C:\Users\<username>\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.12_qbz5n2kfra8p0\LocalCache\local-packages\Python312\site-packages\cyberdrop_dl\managers\client_manager.py", line 93, in check_http_status
    response_text = await response.text()
                    ^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\<username>\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.12_qbz5n2kfra8p0\LocalCache\local-packages\Python312\site-packages\aiohttp\client_reqrep.py", line 1147, in text
    return self._body.decode(  # type: ignore[no-any-return,union-attr]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x89 in position 0: invalid start byte
INFO     : 2024-02-25 20:20:30,817 : utilities.py:95 : Download Finished: https://tbi.sb-cd.com/t/13362571/1/3/w:300/t8-enh/cj-miles-juju-bahreis-luck.jpg

and

INFO     : 2024-02-25 20:20:30,818 : utilities.py:95 : Download Starting: https://tbi.sb-cd.com/t/11897232/1/1/w:300/t10-enh/1-pole-2-hoe.jpg
ERROR    : 2024-02-25 20:20:32,531 : utilities.py:95 : Download Failed: https://tbi.sb-cd.com/t/11897232/1/1/w:300/t10-enh/1-pole-2-hoe.jpg with error 'utf-8' codec can't decode byte 0x89 in position 0: invalid start byte
ERROR    : 2024-02-25 20:20:32,533 : utilities.py:95 : Traceback (most recent call last):
  File "C:\Users\<username>\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.12_qbz5n2kfra8p0\LocalCache\local-packages\Python312\site-packages\cyberdrop_dl\downloader\downloader.py", line 34, in wrapper
    return await f(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\<username>\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.12_qbz5n2kfra8p0\LocalCache\local-packages\Python312\site-packages\cyberdrop_dl\downloader\downloader.py", line 323, in download
    complete_file, partial_file, proceed, skip_by_config = await self.get_final_file_info(complete_file, partial_file, media_item)
                                                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\<username>\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.12_qbz5n2kfra8p0\LocalCache\local-packages\Python312\site-packages\cyberdrop_dl\downloader\downloader.py", line 226, in get_final_file_info
    media_item.filesize = await self.client.get_filesize(media_item)
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\<username>\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.12_qbz5n2kfra8p0\LocalCache\local-packages\Python312\site-packages\cyberdrop_dl\clients\download_client.py", line 45, in wrapper
    return await func(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\<username>\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.12_qbz5n2kfra8p0\LocalCache\local-packages\Python312\site-packages\cyberdrop_dl\clients\download_client.py", line 66, in get_filesize
    await self.client_manager.check_http_status(resp)
  File "C:\Users\<username>\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.12_qbz5n2kfra8p0\LocalCache\local-packages\Python312\site-packages\cyberdrop_dl\managers\client_manager.py", line 93, in check_http_status
    response_text = await response.text()
                    ^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\<username>\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.12_qbz5n2kfra8p0\LocalCache\local-packages\Python312\site-packages\aiohttp\client_reqrep.py", line 1147, in text
    return self._body.decode(  # type: ignore[no-any-return,union-attr]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x89 in position 0: invalid start byte
INFO     : 2024-02-25 20:20:32,533 : utilities.py:95 : Download Finished: https://tbi.sb-cd.com/t/11897232/1/1/w:300/t10-enh/1-pole-2-hoe.jpg

and then

INFO     : 2024-02-25 20:20:32,534 : utilities.py:95 : Download Starting: https://tbi.sb-cd.com/t/13362571/1/3/w:300/t10-enh/cj-miles-juju-bahreis-luck.jpg

with no furhter info (e.g. errors or successes for this URL.)

But there's nothing mentioning tbi.sb-cd.com in Download_Error_URLs.csv.

@nuvibes
Copy link

nuvibes commented Feb 27, 2024

I've been having a similar issue to bacccc's. I have CDL installed on 3 separate Windows 10 VMs using 3 separate URLs.txt files and now all of them are having the same issue. I've narrowed down specifically which links are causing the hang up from 2/3 machine's URLs.txt files, and they are https://nudostar.com/forum/threads/ashley-alban.31217 for machhine1 and https://nudostar.com/forum/threads/missalexapearl-alexapearl-videos.4288 for machine2. I'm working on isolating the link from machine3.

CDL seems to completely freeze/hang after it reaches a certain point and will consistently freeze, even after restarting the program multiple times. It typically has a blank Scraping UI section and it will always say ... And ___ Files In Download Queue under Downloads. I've seen the number of stuck downloads range anywhere from as low as ~500 all the way up to like ~40k+.

machine1
image

machine2
image

machine3 this is what it looks like when the Scraping section is not blank. The amount of links gets stuck and does not increment
image

No field in the UI from Files, Scrape Failures, or Download Failures increments upward, so it's unlikely it's just a visual bug. To further support that, I've left it running for hours and nothing changes.

The last entry in machine1's Downloader.log file is:

INFO     : 2024-02-27 00:13:12,810 : utilities.py:95 : Download Starting: https://nudostar.com/forum/attachments/nudostar-com77-8-jpg.1550811/

The last entry in machine2's Downloader.log file is:

INFO     : 2024-02-27 00:15:34,756 : utilities.py:95 : Scrape Finished: https://bunkr.sk/a/Kl9xviJo

I'm happy to provide any further info that could be useful.

Edit: I found another link that makes the program consistently freeze: https://nudostar.com/forum/threads/ashley-tervort.100415

@Jules-WinnfieldX
Copy link
Owner

So @baccccccc, you are correct in that URLs failing in that manner aren't added to the failed download log file, however, they aren't hanging up the program. Unsure why it started working for you after blocking that domain.

I'm wondering if some failure types are leaving client sessions open or something of that nature.

I'm going to try some of the above threads with my hyper logged version and see if it can give me any insights.

@Jules-WinnfieldX
Copy link
Owner

To make it more fun, none of the links @nuvibes suggested above freeze for me. At least not actually freeze. Some bunkr files that fail out on retries, but that's about it for waits.

@Jules-WinnfieldX
Copy link
Owner

@baccccccc I figured it out. Round about way to get there though.

I should have an update coming later today for you to test.

@Jules-WinnfieldX
Copy link
Owner

5.1.73 just went up, that should solve this.

If it doesn't feel free to reopen the issue.

@baccccccc
Copy link
Author

thanks! Running it right now. Is it expected that it's now constantly flickering around the top of the window, saying something about file locks?

@Jules-WinnfieldX
Copy link
Owner

Damn. No, I'll sort that.

@Jules-WinnfieldX
Copy link
Owner

5.1.74 should solve that.

@baccccccc
Copy link
Author

ok, looks like the hangs are fixed indeed, yay!

now gotta rant about ddos-guard on bunkr.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants