Expand use of failOnFailedSeed option #360

ldko · 2023-08-29T20:55:00Z

I was looking at where --failOnFailedSeed was added in #300 and am wondering if it would generally make sense (it does in my mind for my use case) for that to also apply when a seed isn't parsed successfully. Right now if I crawl with a seed list (generated automatically from URLs extracted from a document) containing this particular URL https://doi.org10.1901/jaba.2012.45-85 that is missing a slash after the .org and is presumably read as having a TLD of .1901, the crawl aborts. Perhaps if failOnFailedSeed is false, such a problem could be logged lower than fatal, so the crawl doesn't abort?

The text was updated successfully, but these errors were encountered:

tw4l · 2023-09-15T17:55:39Z

Perhaps we could have the following behavior for seeds that aren't parsed successfully:

If it's the only seed or --failOnFailedSeed is set, fail the crawl
If there are multiple seeds, not all of them are invalid, and --failOnFailedSeed is not set, log an error and continue the crawl

tw4l · 2023-09-15T17:57:11Z

In simpler terms, if --failOnFailedSeed is not set, we'll want to check that we have at least one valid seed and continue on if so or fail if not.

ldko · 2023-09-15T17:57:29Z

That logic makes sense to me. :)

github-project-automation bot added this to Webrecorder Projects Aug 29, 2023

github-project-automation bot moved this to Triage in Webrecorder Projects Aug 29, 2023

tw4l moved this from Triage to Todo in Webrecorder Projects Sep 15, 2023

tw4l self-assigned this Sep 15, 2023

tw4l mentioned this issue Sep 15, 2023

Add --failOnFailedSeed crawler arg as option webrecorder/browsertrix#1180

Closed

tw4l moved this from Todo to Ready for Dev in Webrecorder Projects Sep 21, 2023

tw4l moved this from Ready for Dev to Dev In Progress in Webrecorder Projects Sep 25, 2023

tw4l mentioned this issue Sep 26, 2023

Set new logic for invalid seeds #395

Merged

tw4l moved this from Dev In Progress to PR In Review in Webrecorder Projects Sep 26, 2023

tw4l closed this as completed in #395 Sep 29, 2023

github-project-automation bot moved this from PR In Review to Done! in Webrecorder Projects Sep 29, 2023

ikreymer mentioned this issue Feb 7, 2025

Retry Improvements + Rate Limit Support #758

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expand use of failOnFailedSeed option #360

Expand use of failOnFailedSeed option #360

ldko commented Aug 29, 2023

tw4l commented Sep 15, 2023

tw4l commented Sep 15, 2023

ldko commented Sep 15, 2023

Expand use of failOnFailedSeed option #360

Expand use of failOnFailedSeed option #360

Comments

ldko commented Aug 29, 2023

tw4l commented Sep 15, 2023

tw4l commented Sep 15, 2023

ldko commented Sep 15, 2023