-
Notifications
You must be signed in to change notification settings - Fork 187
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
www.immobilienscout24.de crawl_immobilienscout.py|ERROR ]: Index error occurred #214
Comments
Yes - that's very likely. Crawling immoscout without 2captcha / imagetyperz support is expected to fail. Does it work if you configure the captcha solving? |
@codders is there any diff between 2caotcha / imagetyperz? thaz |
There isn't much difference. For a while we had problems with 2captcha, so we integrated Imagetyprz as a backup, but 2captcha is working fine again since a while. I would suggest 2captcha for now - that's what I'm using, so at least if it breaks there is someone trying to fix it for you :) |
@codders sure thanks. still getting weird error :(
|
Can you set verbose_logging to true in your config and try again? Also might be good to clear out the webdriver-manager cache (/home/flathunter/.wdm). |
@codders i turned verbose logging on; but i cant see that folder connected to the wedriver-manager cache; do i have to install webdriver-manager cache? flathunter@docker-base:~/flathunter$ ls /home/flathunter/.wdm the error is here> |
Seems like you're not the only person with this issue: What can you tell me about your execution environment? Is Google Chrome / Chromium definitely installed? Are you running inside any kind of container or virtualisation? |
@codders running it as linux user on ubuntu sever; thanks |
@codders okay i installed chrome driver and chromium browser, executed code as follows flathunter@heap-virtual-machine:~/flathunter$ /home/flathunter/.local/bin/pipenv run python flathunt.py end thats the error: `Traceback (most recent call last): ` |
maybe the issue is that ubutu doesnt have chrome but chromium only?
edit2 edit3 similar issue? |
i run Gnome; executed it via gnome ... browser got opened
the configuration regardin the telegram might be confusing? the receiver_ids is negative number... so in configuration i assume it should be set as following: receiver_ids: correct? Also in case i run the script now ... i am getting logs as below, is that correct behavior? [2022/09/10 00:57:18|_common.py |INFO ]: Backing off get_soup_from_url(...) for 0.4s (selenium.common.exceptions.TimeoutException: Message: |
Hey @heapxor , Sorry you're having some troubles here. It would certainly make sense for us to update the documentation based on the spots that didn't make sense for you. Incidentally, if you are looking for a flat in Berlin, you might also have success just using the hosted version at https://flathunter.codders.io - you can just log in there with Telegram and set a (basic) filter. But otherwise, I hope to have some time in the next days to look at your issues, or else maybe someone else can support you. |
@heapxor , I don't think your Telegram ID should be a negative number. How did you get that? To make chrome work, maybe try these driver-arguments:
|
@codders, where i can put these driver arguments? thanks! |
@codders cool will try the arguments! thanks just wondering ... is that something that has to be analyzed further or thats okay? yes will try to add more ram to that machine |
The CAPCHA_NOT_READY message is very normal. That happens every time a captcha is solved. CaptchaUnsolvableError also happens from time to time. Sometimes, 2captcha just can't solve the captcha. The When you get these errors, best is just to restart. I want to change the code soon so that it retries if it gets a CaptchaUnsolvableError. But if you see that message every time, it is probably a problem with 2captcha (or something with the ImmoScout website has changed). For now, you can either run the code as a cron job - set it to run every 10 minutes then quit (by disabling the 'loop' option), or you can run it as a systemd service (there is some documentation around that). Systemd will restart it when it exits. TimeoutException is also possible. It's not a bad thing if that happens - the system will retry. |
@codders, sounds cool. okay disabling loop and execute via cron makes sense; in that case i can prevent the issue with the captcha and it should be Safe. where do u set that timeoutexception ? or its in plan to be developed? thanks! edit2 |
Running more quickly is also okay for Immoscout. With ebay Kleinanzeigen you can get an IP block if you crawl too quickly. Just be aware there is no locking / concurrency control, so if the previous run didn't finish after 4 mins, you will have two flathunters at once, which will have weird effects. For the timeouts and other errors, there is no plan right now. People who want it to be different make pull requests :) |
@codders,
any idea why? |
hm so i commented out and it works |
@heapxor Can we mark this issue as closed? Do you want to add some information to the README about your tips for making it work successfully? |
These driver arguments did the trick for me. Currently running on Ubuntu. I also had to install chrome as suggested by @heapxor |
okay. I'll mark this as closed. If you want to make a PR to update the documentation about the chrome requirement for 2captcha support, that would be very welcome :) |
hello,
using following url in config>
urls:
- https://www.immobilienscout24.de/Suche/de/berlin/berlin/wohnung-mieten?numberofrooms=2.5-&price=-1000.0&livingspace=65.0-&pricetype=rentpermonth&sorting=2
after execution i am getting following error .... is that because of 2captcha is missing in config file?
flathunter@docker-base:~/flathunter$ /home/flathunter/.local/bin/pipenv run python flathunt.py
/usr/lib/python3/dist-packages/pkg_resources/init.py:116: PkgResourcesDeprecationWarning: 0.1.43ubuntu1 is an invalid version and will not be supported in a future release
warnings.warn(
/usr/lib/python3/dist-packages/pkg_resources/init.py:116: PkgResourcesDeprecationWarning: 1.1build1 is an invalid version and will not be supported in a future release
warnings.warn(
[2022/09/05 15:41:58|config.py |INFO ]: Using config path /home/flathunter/flathunter/config.yaml
[2022/09/05 15:41:58|crawl_immobilienscout.py|ERROR ]: Index error occurred
^CTraceback (most recent call last):
File "/home/flathunter/flathunter/flathunt.py", line 110, in
main()
File "/home/flathunter/flathunter/flathunt.py", line 106, in main
launch_flat_hunt(config, heartbeat)
File "/home/flathunter/flathunter/flathunt.py", line 36, in launch_flat_hunt
time.sleep(config.loop_period_seconds())
KeyboardInterrupt
thanks!
The text was updated successfully, but these errors were encountered: