Catch 4xx and 5xx page.goto() responses to mark invalid URLs as failed #300

tw4l · 2023-04-25T20:58:19Z

Fixes #286

Also fixes #207 by introducing a --failOnFailedSeed CLI option which, when enabled, will fail the crawl with a status code of 1 if there is a page load error on one of the initial seeds.

Resolves issue #207

tw4l · 2023-04-25T21:00:03Z

@rgaudin, curious to get your eyes on the new proposed CLI option to see if it'll work for your use case!

rgaudin · 2023-04-26T09:10:45Z

@tw4l ; thank you. That looks good 👍

crawler.js

util/argParser.js

tw4l added 3 commits April 25, 2023 16:19

Catch 400 pywb errors on page load and mark page failed

c44e6e3

Add --failOnFailedSeed option to fail crawl if seed doesn't load

818d4c9

Resolves issue #207

Handle 4xx or 5xx page.goto responses as page load errors

26d83df

tw4l requested a review from ikreymer April 25, 2023 20:58

ikreymer reviewed Apr 26, 2023

View reviewed changes

crawler.js Outdated Show resolved Hide resolved

ikreymer reviewed Apr 26, 2023

View reviewed changes

util/argParser.js Outdated Show resolved Hide resolved

Code review changes

fa52432

tw4l requested a review from ikreymer April 26, 2023 21:05

Merge branch 'main' into issue-286-fail-invalid-url

839e51f

ikreymer merged commit d4bc9e8 into main Apr 26, 2023

ikreymer deleted the issue-286-fail-invalid-url branch April 26, 2023 23:49

ldko mentioned this pull request Aug 29, 2023

Expand use of failOnFailedSeed option #360

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Catch 4xx and 5xx page.goto() responses to mark invalid URLs as failed #300

Catch 4xx and 5xx page.goto() responses to mark invalid URLs as failed #300

tw4l commented Apr 25, 2023

tw4l commented Apr 25, 2023

rgaudin commented Apr 26, 2023

Catch 4xx and 5xx page.goto() responses to mark invalid URLs as failed #300

Catch 4xx and 5xx page.goto() responses to mark invalid URLs as failed #300

Conversation

tw4l commented Apr 25, 2023

tw4l commented Apr 25, 2023

rgaudin commented Apr 26, 2023