You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This poses two challenges to Browsertrix Crawler (warc2zim is always "quite fast"): the duration of the crawl, and the stability of the crawl. To reduce the duration, we usually set multiple workers (typically 4) to run in parallel. But it looks like it comes with a detrimental impact on stability of the crawl. Or at least, it often happens that the crawl fails with browser crash, disconnected, execution context destroyed, ...
Every now and then, we have very long crawl to perform.
E.g. https://farm.openzim.org/recipes/shamela.ws_ar_al-tafsir-3 has ~500k pages to grab. Or https://farm.openzim.org/recipes/ubuntuforums.org_en_all which has already discoverd ~400k pages.
This poses two challenges to Browsertrix Crawler (warc2zim is always "quite fast"): the duration of the crawl, and the stability of the crawl. To reduce the duration, we usually set multiple workers (typically 4) to run in parallel. But it looks like it comes with a detrimental impact on stability of the crawl. Or at least, it often happens that the crawl fails with browser crash, disconnected, execution context destroyed, ...
I think we enhance zimit by automatically restartig the crawl after a failure, I know Browsertrix Cloud is capable to do it, probably based on https://crawler.docs.browsertrix.com/user-guide/common-options/#saving-crawl-state-interrupting-and-restarting-the-crawl
The most difficult part will of course be to know when it "worth-it" to restart the crawler.
The text was updated successfully, but these errors were encountered: