Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expand use of failOnFailedSeed option #360

Closed
ldko opened this issue Aug 29, 2023 · 3 comments · Fixed by #395
Closed

Expand use of failOnFailedSeed option #360

ldko opened this issue Aug 29, 2023 · 3 comments · Fixed by #395
Assignees

Comments

@ldko
Copy link

ldko commented Aug 29, 2023

I was looking at where --failOnFailedSeed was added in #300 and am wondering if it would generally make sense (it does in my mind for my use case) for that to also apply when a seed isn't parsed successfully. Right now if I crawl with a seed list (generated automatically from URLs extracted from a document) containing this particular URL https://doi.org10.1901/jaba.2012.45-85 that is missing a slash after the .org and is presumably read as having a TLD of .1901, the crawl aborts. Perhaps if failOnFailedSeed is false, such a problem could be logged lower than fatal, so the crawl doesn't abort?

@tw4l
Copy link
Member

tw4l commented Sep 15, 2023

Perhaps we could have the following behavior for seeds that aren't parsed successfully:

  • If it's the only seed or --failOnFailedSeed is set, fail the crawl
  • If there are multiple seeds, not all of them are invalid, and --failOnFailedSeed is not set, log an error and continue the crawl

@tw4l
Copy link
Member

tw4l commented Sep 15, 2023

In simpler terms, if --failOnFailedSeed is not set, we'll want to check that we have at least one valid seed and continue on if so or fail if not.

@ldko
Copy link
Author

ldko commented Sep 15, 2023

That logic makes sense to me. :)

@tw4l tw4l moved this from Triage to Todo in Webrecorder Projects Sep 15, 2023
@tw4l tw4l self-assigned this Sep 15, 2023
@tw4l tw4l moved this from Todo to Ready for Dev in Webrecorder Projects Sep 21, 2023
@tw4l tw4l moved this from Ready for Dev to Dev In Progress in Webrecorder Projects Sep 25, 2023
@tw4l tw4l moved this from Dev In Progress to PR In Review in Webrecorder Projects Sep 26, 2023
@tw4l tw4l closed this as completed in #395 Sep 29, 2023
@github-project-automation github-project-automation bot moved this from PR In Review to Done! in Webrecorder Projects Sep 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

2 participants