Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resume failed browsertrix crawls #436

Open
benoit74 opened this issue Nov 26, 2024 · 0 comments
Open

Resume failed browsertrix crawls #436

benoit74 opened this issue Nov 26, 2024 · 0 comments
Milestone

Comments

@benoit74
Copy link
Collaborator

Every now and then, we have very long crawl to perform.

E.g. https://farm.openzim.org/recipes/shamela.ws_ar_al-tafsir-3 has ~500k pages to grab. Or https://farm.openzim.org/recipes/ubuntuforums.org_en_all which has already discoverd ~400k pages.

This poses two challenges to Browsertrix Crawler (warc2zim is always "quite fast"): the duration of the crawl, and the stability of the crawl. To reduce the duration, we usually set multiple workers (typically 4) to run in parallel. But it looks like it comes with a detrimental impact on stability of the crawl. Or at least, it often happens that the crawl fails with browser crash, disconnected, execution context destroyed, ...

I think we enhance zimit by automatically restartig the crawl after a failure, I know Browsertrix Cloud is capable to do it, probably based on https://crawler.docs.browsertrix.com/user-guide/common-options/#saving-crawl-state-interrupting-and-restarting-the-crawl

The most difficult part will of course be to know when it "worth-it" to restart the crawler.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant