Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Request] Add option for sleep interval between page crawls to avoid captchas/rate limits #131

Closed
Fs00 opened this issue Mar 24, 2022 · 2 comments · Fixed by #257
Closed
Assignees

Comments

@Fs00
Copy link

Fs00 commented Mar 24, 2022

Hello!
I'm trying to crawl a huge website which starts asking for captchas after crawling a few hundreds of pages in a short amount of time.
Since setting workers=1 is not enough to avoid hitting the captcha "rate limit", I'm here to ask for the addition of an option to specify a custom sleep interval (e.g. 5 seconds) which makes the crawler do nothing for the specified amount of time before crawling the next page.
Youtube-dl has a similar option too, and in my experience it has been useful in other similar circumstances.
Thanks!

@ikreymer
Copy link
Member

yeah, that makes sense and is easy to add. Are you thinking it would sleep after every page, or after every N pages?

@Fs00
Copy link
Author

Fs00 commented Mar 24, 2022

I think that sleeping after every page should be good enough.
Having N workers that sleep after every page provides a similar behavior to sleeping after N pages.

@SuaYoo SuaYoo moved this from Triage to Todo in Webrecorder Projects Mar 8, 2023
@tw4l tw4l moved this from Todo to Ready for Dev in Webrecorder Projects Mar 8, 2023
@tw4l tw4l moved this from Ready for Dev to Dev In Progress in Webrecorder Projects Mar 21, 2023
@tw4l tw4l moved this from Dev In Progress to PR In Review in Webrecorder Projects Mar 21, 2023
ikreymer pushed a commit that referenced this issue Mar 22, 2023
)

* Add --pageExtraDelay option to add extra delay/wait time after every page (fixes #131)

* Store total page time in 'maxPageTime', include pageExtraDelay

* Rename timeout->pageLoadTimeout

* cleanup:
- store seconds for most interval checks, convert to ms only for api calls, remove most sec<->ms conversions
- add secondsElapsed() utility function to help checking time elapsed
- cleanup comments

---------
Co-authored-by: Ilya Kreymer <[email protected]>
@github-project-automation github-project-automation bot moved this from PR In Review to Done! in Webrecorder Projects Mar 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

3 participants