Resume failed browsertrix crawls #436

benoit74 · 2024-11-26T07:09:07Z

Every now and then, we have very long crawl to perform.

E.g. https://farm.openzim.org/recipes/shamela.ws_ar_al-tafsir-3 has ~500k pages to grab. Or https://farm.openzim.org/recipes/ubuntuforums.org_en_all which has already discoverd ~400k pages.

This poses two challenges to Browsertrix Crawler (warc2zim is always "quite fast"): the duration of the crawl, and the stability of the crawl. To reduce the duration, we usually set multiple workers (typically 4) to run in parallel. But it looks like it comes with a detrimental impact on stability of the crawl. Or at least, it often happens that the crawl fails with browser crash, disconnected, execution context destroyed, ...

I think we enhance zimit by automatically restartig the crawl after a failure, I know Browsertrix Cloud is capable to do it, probably based on https://crawler.docs.browsertrix.com/user-guide/common-options/#saving-crawl-state-interrupting-and-restarting-the-crawl

The most difficult part will of course be to know when it "worth-it" to restart the crawler.

benoit74 added the enhancement label Nov 26, 2024

benoit74 added this to the later milestone Nov 26, 2024

benoit74 mentioned this issue Nov 26, 2024

New request: shamela.ws المكتبة الشاملة openzim/zim-requests#1172

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resume failed browsertrix crawls #436

Resume failed browsertrix crawls #436

benoit74 commented Nov 26, 2024

Resume failed browsertrix crawls #436

Resume failed browsertrix crawls #436

Comments

benoit74 commented Nov 26, 2024