[Request] Add option for sleep interval between page crawls to avoid captchas/rate limits #131

Fs00 · 2022-03-24T18:06:05Z

Hello!
I'm trying to crawl a huge website which starts asking for captchas after crawling a few hundreds of pages in a short amount of time.
Since setting workers=1 is not enough to avoid hitting the captcha "rate limit", I'm here to ask for the addition of an option to specify a custom sleep interval (e.g. 5 seconds) which makes the crawler do nothing for the specified amount of time before crawling the next page.
Youtube-dl has a similar option too, and in my experience it has been useful in other similar circumstances.
Thanks!

ikreymer · 2022-03-24T18:58:33Z

yeah, that makes sense and is easy to add. Are you thinking it would sleep after every page, or after every N pages?

Fs00 · 2022-03-24T19:53:35Z

I think that sleeping after every page should be good enough.
Having N workers that sleep after every page provides a similar behavior to sleeping after N pages.

) * Add --pageExtraDelay option to add extra delay/wait time after every page (fixes #131) * Store total page time in 'maxPageTime', include pageExtraDelay * Rename timeout->pageLoadTimeout * cleanup: - store seconds for most interval checks, convert to ms only for api calls, remove most sec<->ms conversions - add secondsElapsed() utility function to help checking time elapsed - cleanup comments --------- Co-authored-by: Ilya Kreymer <[email protected]>

ikreymer mentioned this issue Feb 24, 2023

Crawl Config Limits: Additional Time Limits webrecorder/browsertrix#636

Closed

SuaYoo added this to Webrecorder Projects Mar 8, 2023

github-project-automation bot moved this to Triage in Webrecorder Projects Mar 8, 2023

SuaYoo assigned tw4l Mar 8, 2023

SuaYoo moved this from Triage to Todo in Webrecorder Projects Mar 8, 2023

tw4l moved this from Todo to Ready for Dev in Webrecorder Projects Mar 8, 2023

tw4l moved this from Ready for Dev to Dev In Progress in Webrecorder Projects Mar 21, 2023

tw4l mentioned this issue Mar 21, 2023

Add option for sleep interval after behaviors run #257

Merged

tw4l moved this from Dev In Progress to PR In Review in Webrecorder Projects Mar 21, 2023

ikreymer closed this as completed in #257 Mar 22, 2023

github-project-automation bot moved this from PR In Review to Done! in Webrecorder Projects Mar 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Request] Add option for sleep interval between page crawls to avoid captchas/rate limits #131

[Request] Add option for sleep interval between page crawls to avoid captchas/rate limits #131

Fs00 commented Mar 24, 2022

ikreymer commented Mar 24, 2022

Fs00 commented Mar 24, 2022

[Request] Add option for sleep interval between page crawls to avoid captchas/rate limits #131

[Request] Add option for sleep interval between page crawls to avoid captchas/rate limits #131

Comments

Fs00 commented Mar 24, 2022

ikreymer commented Mar 24, 2022

Fs00 commented Mar 24, 2022