Success status code on failure #207

rgaudin · 2023-01-23T10:27:04Z

In a zimit run, we started a crawl of a page https://journals.openedition.org/bibnum/ which failed after about 15s but the crawler reported a success status code (0).

Crawler mentioned Page Load Failed: https://journals.openedition.org/bibnum/, Reason: Error: Page crashed!

I am not sure about the exact behavior of the crawler on errors:

does it halt ?
if so, on all errors or only certain kinds?
if not, is there some decision code on whether the crawl is successful: if the error is on the source URL, chances are the output will be unusable because homepage will not be present and/or no links would have been gathered.

Full log follows.

Running browsertrix-crawler crawl: crawl --workers 6 --newContext page --waitUntil load,networkidle0 --depth -1 --timeout 90 --behaviors autoplay,autofetch,siteSpecific --behaviorTimeout 90 --url https://journals.openedition.org/bibnum/ --userAgentSuffix +Zimit [email protected] --cwd /output/.tmp677hzgz2 --statsFilename /output/crawl.json
Note: The newContext argument is deprecated in 0.8.0. Values passed to this option will be ignored
Window context being used to support >1 workers
Set netIdleWait to 2 seconds
Storing state in memory
�[K
�[K== Start:     2023-01-21 23:05:31.418
�[K== Now:       2023-01-21 23:05:31.921 (running for 503.0 ms)
�[K== Progress:  0 / 0 (100.00%), errors: 0 (0.00%)
�[K== Remaining: 0.0 ms (@ 0 pages/second)
�[K== Sys. load: 0.0% CPU / 0.0% memory
�[K== Workers:   0
�[7AText Extraction: Disabled
�[K
�[K== Start:     2023-01-21 23:05:31.418
�[K== Now:       2023-01-21 23:05:32.420 (running for 1.0 seconds)
�[K== Progress:  0 / 1 (0.00%), errors: 0 (0.00%)
�[K== Remaining: unknown (@ 0 pages/second)
�[K== Sys. load: 11.8% CPU / 3.3% memory
�[K== Workers:   1
�[K   #0 WORK https://journals.openedition.org/bibnum/
�[8A�[K
�[K== Start:     2023-01-21 23:05:31.418
�[K== Now:       2023-01-21 23:05:32.920 (running for 1.5 seconds)
�[K== Progress:  0 / 1 (0.00%), errors: 0 (0.00%)
�[K== Remaining: unknown (@ 0 pages/second)
�[K== Sys. load: 8.2% CPU / 3.3% memory
�[K== Workers:   1
�[K   #0 WORK https://journals.openedition.org/bibnum/
�[8A�[K
�[K== Start:     2023-01-21 23:05:31.418
�[K== Now:       2023-01-21 23:05:33.421 (running for 2.0 seconds)
�[K== Progress:  0 / 1 (0.00%), errors: 0 (0.00%)
�[K== Remaining: unknown (@ 0 pages/second)
�[K== Sys. load: 7.3% CPU / 3.3% memory
�[K== Workers:   1
�[K   #0 WORK https://journals.openedition.org/bibnum/
�[8A�[K
�[K== Start:     2023-01-21 23:05:31.418
�[K== Now:       2023-01-21 23:05:33.920 (running for 2.5 seconds)
�[K== Progress:  0 / 1 (0.00%), errors: 0 (0.00%)
�[K== Remaining: unknown (@ 0 pages/second)
�[K== Sys. load: 6.8% CPU / 3.3% memory
�[K== Workers:   1
�[K   #0 WORK https://journals.openedition.org/bibnum/
�[8A�[K
�[K== Start:     2023-01-21 23:05:31.418
�[K== Now:       2023-01-21 23:05:34.420 (running for 3.0 seconds)
�[K== Progress:  0 / 1 (0.00%), errors: 0 (0.00%)
�[K== Remaining: unknown (@ 0 pages/second)
�[K== Sys. load: 7.3% CPU / 3.6% memory
�[K== Workers:   1
�[K   #0 WORK https://journals.openedition.org/bibnum/
�[8A�[K
�[K== Start:     2023-01-21 23:05:31.418
�[K== Now:       2023-01-21 23:05:34.921 (running for 3.5 seconds)
�[K== Progress:  0 / 1 (0.00%), errors: 0 (0.00%)
�[K== Remaining: unknown (@ 0 pages/second)
�[K== Sys. load: 7.3% CPU / 3.6% memory
�[K== Workers:   1
�[K   #0 WORK https://journals.openedition.org/bibnum/
�[8A�[K
�[K== Start:     2023-01-21 23:05:31.418
�[K== Now:       2023-01-21 23:05:35.420 (running for 4.0 seconds)
�[K== Progress:  0 / 1 (0.00%), errors: 0 (0.00%)
�[K== Remaining: unknown (@ 0 pages/second)
�[K== Sys. load: 7.1% CPU / 3.6% memory
�[K== Workers:   1
�[K   #0 WORK https://journals.openedition.org/bibnum/
�[8A�[K
�[K== Start:     2023-01-21 23:05:31.418
�[K== Now:       2023-01-21 23:05:35.920 (running for 4.5 seconds)
�[K== Progress:  0 / 1 (0.00%), errors: 0 (0.00%)
�[K== Remaining: unknown (@ 0 pages/second)
�[K== Sys. load: 7.6% CPU / 3.7% memory
�[K== Workers:   1
�[K   #0 WORK https://journals.openedition.org/bibnum/
�[8A�[K
�[K== Start:     2023-01-21 23:05:31.418
�[K== Now:       2023-01-21 23:05:36.421 (running for 5.0 seconds)
�[K== Progress:  0 / 1 (0.00%), errors: 0 (0.00%)
�[K== Remaining: unknown (@ 0 pages/second)
�[K== Sys. load: 9.1% CPU / 3.6% memory
�[K== Workers:   1
�[K   #0 WORK https://journals.openedition.org/bibnum/
�[8A�[K
�[K== Start:     2023-01-21 23:05:31.418
�[K== Now:       2023-01-21 23:05:36.920 (running for 5.5 seconds)
�[K== Progress:  0 / 1 (0.00%), errors: 0 (0.00%)
�[K== Remaining: unknown (@ 0 pages/second)
�[K== Sys. load: 10.1% CPU / 3.7% memory
�[K== Workers:   1
�[K   #0 WORK https://journals.openedition.org/bibnum/
�[8A�[K
�[K== Start:     2023-01-21 23:05:31.418
�[K== Now:       2023-01-21 23:05:37.420 (running for 6.0 seconds)
�[K== Progress:  0 / 1 (0.00%), errors: 0 (0.00%)
�[K== Remaining: unknown (@ 0 pages/second)
�[K== Sys. load: 10.6% CPU / 3.7% memory
�[K== Workers:   1
�[K   #0 WORK https://journals.openedition.org/bibnum/
�[8ALoad Error: https://journals.openedition.org/bibnum/: Navigation failed because browser has disconnected!
�[K
�[K== Start:     2023-01-21 23:05:31.418
�[K== Now:       2023-01-21 23:05:37.920 (running for 6.5 seconds)
�[K== Progress:  0 / 1 (0.00%), errors: 0 (0.00%)
�[K== Remaining: unknown (@ 0 pages/second)
�[K== Sys. load: 12.0% CPU / 3.1% memory
�[K== Workers:   1
�[K   #0 WORK https://journals.openedition.org/bibnum/
�[8A�[K
�[K== Start:     2023-01-21 23:05:31.418
�[K== Now:       2023-01-21 23:05:38.421 (running for 7.0 seconds)
�[K== Progress:  0 / 1 (0.00%), errors: 0 (0.00%)
�[K== Remaining: unknown (@ 0 pages/second)
�[K== Sys. load: 11.4% CPU / 3.0% memory
�[K== Workers:   1
�[K   #0 WORK https://journals.openedition.org/bibnum/
�[8A�[K
�[K== Start:     2023-01-21 23:05:31.418
�[K== Now:       2023-01-21 23:05:38.920 (running for 7.5 seconds)
�[K== Progress:  0 / 1 (0.00%), errors: 0 (0.00%)
�[K== Remaining: unknown (@ 0 pages/second)
�[K== Sys. load: 11.2% CPU / 3.0% memory
�[K== Workers:   1
�[K   #0 WORK https://journals.openedition.org/bibnum/
�[8A�[K
�[K== Start:     2023-01-21 23:05:31.418
�[K== Now:       2023-01-21 23:05:39.421 (running for 8.0 seconds)
�[K== Progress:  0 / 1 (0.00%), errors: 0 (0.00%)
�[K== Remaining: unknown (@ 0 pages/second)
�[K== Sys. load: 10.6% CPU / 3.0% memory
�[K== Workers:   1
�[K   #0 WORK https://journals.openedition.org/bibnum/
�[8A�[K
�[K== Start:     2023-01-21 23:05:31.418
�[K== Now:       2023-01-21 23:05:39.921 (running for 8.5 seconds)
�[K== Progress:  0 / 1 (0.00%), errors: 0 (0.00%)
�[K== Remaining: unknown (@ 0 pages/second)
�[K== Sys. load: 10.2% CPU / 3.0% memory
�[K== Workers:   1
�[K   #0 WORK https://journals.openedition.org/bibnum/
�[8Anote: waitForNetworkIdle timed out, ignoring
�[K
�[K== Start:     2023-01-21 23:05:31.418
�[K== Now:       2023-01-21 23:05:40.422 (running for 9.0 seconds)
�[K== Progress:  0 / 1 (0.00%), errors: 0 (0.00%)
�[K== Remaining: unknown (@ 0 pages/second)
�[K== Sys. load: 9.7% CPU / 3.0% memory
�[K== Workers:   1
�[K   #0 WORK https://journals.openedition.org/bibnum/
�[8A�[K
�[K== Start:     2023-01-21 23:05:31.418
�[K== Now:       2023-01-21 23:05:40.923 (running for 9.5 seconds)
�[K== Progress:  0 / 1 (0.00%), errors: 0 (0.00%)
�[K== Remaining: unknown (@ 0 pages/second)
�[K== Sys. load: 8.9% CPU / 3.0% memory
�[K== Workers:   1
�[K   #0 WORK https://journals.openedition.org/bibnum/
�[8A�[K
�[K== Start:     2023-01-21 23:05:31.418
�[K== Now:       2023-01-21 23:05:41.423 (running for 10.0 seconds)
�[K== Progress:  0 / 1 (0.00%), errors: 0 (0.00%)
�[K== Remaining: unknown (@ 0 pages/second)
�[K== Sys. load: 7.0% CPU / 3.0% memory
�[K== Workers:   1
�[K   #0 WORK https://journals.openedition.org/bibnum/
�[8A�[K
�[K== Start:     2023-01-21 23:05:31.418
�[K== Now:       2023-01-21 23:05:41.924 (running for 10.5 seconds)
�[K== Progress:  0 / 1 (0.00%), errors: 0 (0.00%)
�[K== Remaining: unknown (@ 0 pages/second)
�[K== Sys. load: 5.4% CPU / 3.0% memory
�[K== Workers:   1
�[K   #0 WORK https://journals.openedition.org/bibnum/
�[8A�[K
�[K== Start:     2023-01-21 23:05:31.418
�[K== Now:       2023-01-21 23:05:42.424 (running for 11.0 seconds)
�[K== Progress:  0 / 1 (0.00%), errors: 0 (0.00%)
�[K== Remaining: unknown (@ 0 pages/second)
�[K== Sys. load: 3.9% CPU / 3.0% memory
�[K== Workers:   1
�[K   #0 WORK https://journals.openedition.org/bibnum/
�[8A�[K
�[K== Start:     2023-01-21 23:05:31.418
�[K== Now:       2023-01-21 23:05:42.924 (running for 11.5 seconds)
�[K== Progress:  0 / 1 (0.00%), errors: 0 (0.00%)
�[K== Remaining: unknown (@ 0 pages/second)
�[K== Sys. load: 2.4% CPU / 3.0% memory
�[K== Workers:   1
�[K   #0 WORK https://journals.openedition.org/bibnum/
�[8A�[K
�[K== Start:     2023-01-21 23:05:31.418
�[K== Now:       2023-01-21 23:05:43.424 (running for 12.0 seconds)
�[K== Progress:  0 / 1 (0.00%), errors: 0 (0.00%)
�[K== Remaining: unknown (@ 0 pages/second)
�[K== Sys. load: 2.5% CPU / 3.0% memory
�[K== Workers:   1
�[K   #0 WORK https://journals.openedition.org/bibnum/
�[8A�[K
�[K== Start:     2023-01-21 23:05:31.418
�[K== Now:       2023-01-21 23:05:43.924 (running for 12.5 seconds)
�[K== Progress:  0 / 1 (0.00%), errors: 0 (0.00%)
�[K== Remaining: unknown (@ 0 pages/second)
�[K== Sys. load: 2.4% CPU / 3.0% memory
�[K== Workers:   1
�[K   #0 WORK https://journals.openedition.org/bibnum/
�[8A�[K
�[K== Start:     2023-01-21 23:05:31.418
�[K== Now:       2023-01-21 23:05:44.425 (running for 13.0 seconds)
�[K== Progress:  0 / 1 (0.00%), errors: 0 (0.00%)
�[K== Remaining: unknown (@ 0 pages/second)
�[K== Sys. load: 2.3% CPU / 3.0% memory
�[K== Workers:   1
�[K   #0 WORK https://journals.openedition.org/bibnum/
�[8A�[K
�[K== Start:     2023-01-21 23:05:31.418
�[K== Now:       2023-01-21 23:05:44.925 (running for 13.5 seconds)
�[K== Progress:  0 / 1 (0.00%), errors: 0 (0.00%)
�[K== Remaining: unknown (@ 0 pages/second)
�[K== Sys. load: 2.2% CPU / 3.0% memory
�[K== Workers:   1
�[K   #0 WORK https://journals.openedition.org/bibnum/
Error: Protocol error (Runtime.callFunctionOn): Session closed. Most likely the page has been closed.
    at CDPSession.send (file:///app/node_modules/puppeteer-core/lib/esm/puppeteer/common/Connection.js:281:35)
    at ExecutionContext._ExecutionContext_evaluate (file:///app/node_modules/puppeteer-core/lib/esm/puppeteer/common/ExecutionContext.js:206:46)
    at ExecutionContext.evaluate (file:///app/node_modules/puppeteer-core/lib/esm/puppeteer/common/ExecutionContext.js:103:113)
    at IsolatedWorld.evaluate (file:///app/node_modules/puppeteer-core/lib/esm/puppeteer/common/IsolatedWorld.js:171:24)
�[8AStarting repair
Page Load Failed: https://journals.openedition.org/bibnum/, Reason: Error: Page crashed!
�[K
�[K== Start:     2023-01-21 23:05:31.418
�[K== Now:       2023-01-21 23:05:45.424 (running for 14.0 seconds)
�[K== Progress:  1 / 1 (100.00%), errors: 1 (100.00%)
�[K== Remaining: 0.0 ms (@ 0.07 pages/second)
�[K== Sys. load: 2.6% CPU / 3.1% memory
�[K== Workers:   1
�[K   #0 IDLE 
�[8A�[K
�[K== Start:     2023-01-21 23:05:31.418
�[K== Now:       2023-01-21 23:05:45.439 (running for 14.0 seconds)
�[K== Progress:  1 / 1 (100.00%), errors: 1 (100.00%)
�[K== Remaining: 0.0 ms (@ 0.07 pages/second)
�[K== Sys. load: 2.6% CPU / 3.1% memory
�[K== Workers:   1
�[K   #0 IDLE 
�[8A�[8BWaiting to ensure pending data is written to WARCs...
done

The text was updated successfully, but these errors were encountered:

tw4l · 2023-02-02T21:03:32Z

@rgaudin in v0.8.0 beta, browsertrix-crawler will return 0 unless a fatal error is encountered (something that triggers the Logger class's fatal method).

Examples of that would include no WARC files being created, WACZ generation failing, being unable to connect to Redis, or giving invalid arguments to the crawler. In any of those cases, the crawler will exit 1.

A failed page is not considered a fatal error, but instead logs an error message (now in JSON as of the beta), as long as some data is captured and written to a WARC during the crawl.

(Edited for clarity)

rgaudin · 2023-02-06T12:51:44Z

OK, I understand although I'd suggest raising a fatal error in case the --url itself fails as I don't think the output can be useful in that scenario.

In this particular case, 8 entries were added to WARC (resources from the page) before that initial URL crashed so that's why it returned 0.

tw4l · 2023-02-06T16:15:38Z

@rgaudin that is a good suggestion, thank you. I'll look into how to best implement.

Resolves issue #207

rgaudin mentioned this issue Jan 23, 2023

Several new ZIMs appearing in the zimit directory on download.kiwix.org are too small openzim/zimit#168

Closed

tw4l self-assigned this Feb 6, 2023

rgaudin mentioned this issue Apr 14, 2023

Crawler doesn't mark invalid URL as failed #286

Closed

tw4l added a commit that referenced this issue Apr 25, 2023

Add --failOnFailedSeed option to fail crawl if seed doesn't load

1f93942

Resolves issue #207

tw4l added a commit that referenced this issue Apr 25, 2023

Add --failOnFailedSeed option to fail crawl if seed doesn't load

818d4c9

Resolves issue #207

tw4l mentioned this issue Apr 25, 2023

Catch 4xx and 5xx page.goto() responses to mark invalid URLs as failed #300

Merged

ikreymer closed this as completed in d4bc9e8 Apr 26, 2023

ikreymer closed this as completed in #300 Apr 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Success status code on failure #207

Success status code on failure #207

rgaudin commented Jan 23, 2023

tw4l commented Feb 2, 2023 •

edited

Loading

rgaudin commented Feb 6, 2023

tw4l commented Feb 6, 2023

Success status code on failure #207

Success status code on failure #207

Comments

rgaudin commented Jan 23, 2023

tw4l commented Feb 2, 2023 • edited Loading

rgaudin commented Feb 6, 2023

tw4l commented Feb 6, 2023

tw4l commented Feb 2, 2023 •

edited

Loading