Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Success status code on failure #207

Closed
rgaudin opened this issue Jan 23, 2023 · 3 comments · Fixed by #300
Closed

Success status code on failure #207

rgaudin opened this issue Jan 23, 2023 · 3 comments · Fixed by #300
Assignees

Comments

@rgaudin
Copy link
Contributor

rgaudin commented Jan 23, 2023

In a zimit run, we started a crawl of a page https://journals.openedition.org/bibnum/ which failed after about 15s but the crawler reported a success status code (0).

Crawler mentioned Page Load Failed: https://journals.openedition.org/bibnum/, Reason: Error: Page crashed!

I am not sure about the exact behavior of the crawler on errors:

  • does it halt ?
  • if so, on all errors or only certain kinds?
  • if not, is there some decision code on whether the crawl is successful: if the error is on the source URL, chances are the output will be unusable because homepage will not be present and/or no links would have been gathered.

Full log follows.

Running browsertrix-crawler crawl: crawl --workers 6 --newContext page --waitUntil load,networkidle0 --depth -1 --timeout 90 --behaviors autoplay,autofetch,siteSpecific --behaviorTimeout 90 --url https://journals.openedition.org/bibnum/ --userAgentSuffix +Zimit [email protected] --cwd /output/.tmp677hzgz2 --statsFilename /output/crawl.json
Note: The newContext argument is deprecated in 0.8.0. Values passed to this option will be ignored
Window context being used to support >1 workers
Set netIdleWait to 2 seconds
Storing state in memory
�[K
�[K== Start:     2023-01-21 23:05:31.418
�[K== Now:       2023-01-21 23:05:31.921 (running for 503.0 ms)
�[K== Progress:  0 / 0 (100.00%), errors: 0 (0.00%)
�[K== Remaining: 0.0 ms (@ 0 pages/second)
�[K== Sys. load: 0.0% CPU / 0.0% memory
�[K== Workers:   0
�[7AText Extraction: Disabled
�[K
�[K== Start:     2023-01-21 23:05:31.418
�[K== Now:       2023-01-21 23:05:32.420 (running for 1.0 seconds)
�[K== Progress:  0 / 1 (0.00%), errors: 0 (0.00%)
�[K== Remaining: unknown (@ 0 pages/second)
�[K== Sys. load: 11.8% CPU / 3.3% memory
�[K== Workers:   1
�[K   #0 WORK https://journals.openedition.org/bibnum/
�[8A�[K
�[K== Start:     2023-01-21 23:05:31.418
�[K== Now:       2023-01-21 23:05:32.920 (running for 1.5 seconds)
�[K== Progress:  0 / 1 (0.00%), errors: 0 (0.00%)
�[K== Remaining: unknown (@ 0 pages/second)
�[K== Sys. load: 8.2% CPU / 3.3% memory
�[K== Workers:   1
�[K   #0 WORK https://journals.openedition.org/bibnum/
�[8A�[K
�[K== Start:     2023-01-21 23:05:31.418
�[K== Now:       2023-01-21 23:05:33.421 (running for 2.0 seconds)
�[K== Progress:  0 / 1 (0.00%), errors: 0 (0.00%)
�[K== Remaining: unknown (@ 0 pages/second)
�[K== Sys. load: 7.3% CPU / 3.3% memory
�[K== Workers:   1
�[K   #0 WORK https://journals.openedition.org/bibnum/
�[8A�[K
�[K== Start:     2023-01-21 23:05:31.418
�[K== Now:       2023-01-21 23:05:33.920 (running for 2.5 seconds)
�[K== Progress:  0 / 1 (0.00%), errors: 0 (0.00%)
�[K== Remaining: unknown (@ 0 pages/second)
�[K== Sys. load: 6.8% CPU / 3.3% memory
�[K== Workers:   1
�[K   #0 WORK https://journals.openedition.org/bibnum/
�[8A�[K
�[K== Start:     2023-01-21 23:05:31.418
�[K== Now:       2023-01-21 23:05:34.420 (running for 3.0 seconds)
�[K== Progress:  0 / 1 (0.00%), errors: 0 (0.00%)
�[K== Remaining: unknown (@ 0 pages/second)
�[K== Sys. load: 7.3% CPU / 3.6% memory
�[K== Workers:   1
�[K   #0 WORK https://journals.openedition.org/bibnum/
�[8A�[K
�[K== Start:     2023-01-21 23:05:31.418
�[K== Now:       2023-01-21 23:05:34.921 (running for 3.5 seconds)
�[K== Progress:  0 / 1 (0.00%), errors: 0 (0.00%)
�[K== Remaining: unknown (@ 0 pages/second)
�[K== Sys. load: 7.3% CPU / 3.6% memory
�[K== Workers:   1
�[K   #0 WORK https://journals.openedition.org/bibnum/
�[8A�[K
�[K== Start:     2023-01-21 23:05:31.418
�[K== Now:       2023-01-21 23:05:35.420 (running for 4.0 seconds)
�[K== Progress:  0 / 1 (0.00%), errors: 0 (0.00%)
�[K== Remaining: unknown (@ 0 pages/second)
�[K== Sys. load: 7.1% CPU / 3.6% memory
�[K== Workers:   1
�[K   #0 WORK https://journals.openedition.org/bibnum/
�[8A�[K
�[K== Start:     2023-01-21 23:05:31.418
�[K== Now:       2023-01-21 23:05:35.920 (running for 4.5 seconds)
�[K== Progress:  0 / 1 (0.00%), errors: 0 (0.00%)
�[K== Remaining: unknown (@ 0 pages/second)
�[K== Sys. load: 7.6% CPU / 3.7% memory
�[K== Workers:   1
�[K   #0 WORK https://journals.openedition.org/bibnum/
�[8A�[K
�[K== Start:     2023-01-21 23:05:31.418
�[K== Now:       2023-01-21 23:05:36.421 (running for 5.0 seconds)
�[K== Progress:  0 / 1 (0.00%), errors: 0 (0.00%)
�[K== Remaining: unknown (@ 0 pages/second)
�[K== Sys. load: 9.1% CPU / 3.6% memory
�[K== Workers:   1
�[K   #0 WORK https://journals.openedition.org/bibnum/
�[8A�[K
�[K== Start:     2023-01-21 23:05:31.418
�[K== Now:       2023-01-21 23:05:36.920 (running for 5.5 seconds)
�[K== Progress:  0 / 1 (0.00%), errors: 0 (0.00%)
�[K== Remaining: unknown (@ 0 pages/second)
�[K== Sys. load: 10.1% CPU / 3.7% memory
�[K== Workers:   1
�[K   #0 WORK https://journals.openedition.org/bibnum/
�[8A�[K
�[K== Start:     2023-01-21 23:05:31.418
�[K== Now:       2023-01-21 23:05:37.420 (running for 6.0 seconds)
�[K== Progress:  0 / 1 (0.00%), errors: 0 (0.00%)
�[K== Remaining: unknown (@ 0 pages/second)
�[K== Sys. load: 10.6% CPU / 3.7% memory
�[K== Workers:   1
�[K   #0 WORK https://journals.openedition.org/bibnum/
�[8ALoad Error: https://journals.openedition.org/bibnum/: Navigation failed because browser has disconnected!
�[K
�[K== Start:     2023-01-21 23:05:31.418
�[K== Now:       2023-01-21 23:05:37.920 (running for 6.5 seconds)
�[K== Progress:  0 / 1 (0.00%), errors: 0 (0.00%)
�[K== Remaining: unknown (@ 0 pages/second)
�[K== Sys. load: 12.0% CPU / 3.1% memory
�[K== Workers:   1
�[K   #0 WORK https://journals.openedition.org/bibnum/
�[8A�[K
�[K== Start:     2023-01-21 23:05:31.418
�[K== Now:       2023-01-21 23:05:38.421 (running for 7.0 seconds)
�[K== Progress:  0 / 1 (0.00%), errors: 0 (0.00%)
�[K== Remaining: unknown (@ 0 pages/second)
�[K== Sys. load: 11.4% CPU / 3.0% memory
�[K== Workers:   1
�[K   #0 WORK https://journals.openedition.org/bibnum/
�[8A�[K
�[K== Start:     2023-01-21 23:05:31.418
�[K== Now:       2023-01-21 23:05:38.920 (running for 7.5 seconds)
�[K== Progress:  0 / 1 (0.00%), errors: 0 (0.00%)
�[K== Remaining: unknown (@ 0 pages/second)
�[K== Sys. load: 11.2% CPU / 3.0% memory
�[K== Workers:   1
�[K   #0 WORK https://journals.openedition.org/bibnum/
�[8A�[K
�[K== Start:     2023-01-21 23:05:31.418
�[K== Now:       2023-01-21 23:05:39.421 (running for 8.0 seconds)
�[K== Progress:  0 / 1 (0.00%), errors: 0 (0.00%)
�[K== Remaining: unknown (@ 0 pages/second)
�[K== Sys. load: 10.6% CPU / 3.0% memory
�[K== Workers:   1
�[K   #0 WORK https://journals.openedition.org/bibnum/
�[8A�[K
�[K== Start:     2023-01-21 23:05:31.418
�[K== Now:       2023-01-21 23:05:39.921 (running for 8.5 seconds)
�[K== Progress:  0 / 1 (0.00%), errors: 0 (0.00%)
�[K== Remaining: unknown (@ 0 pages/second)
�[K== Sys. load: 10.2% CPU / 3.0% memory
�[K== Workers:   1
�[K   #0 WORK https://journals.openedition.org/bibnum/
�[8Anote: waitForNetworkIdle timed out, ignoring
�[K
�[K== Start:     2023-01-21 23:05:31.418
�[K== Now:       2023-01-21 23:05:40.422 (running for 9.0 seconds)
�[K== Progress:  0 / 1 (0.00%), errors: 0 (0.00%)
�[K== Remaining: unknown (@ 0 pages/second)
�[K== Sys. load: 9.7% CPU / 3.0% memory
�[K== Workers:   1
�[K   #0 WORK https://journals.openedition.org/bibnum/
�[8A�[K
�[K== Start:     2023-01-21 23:05:31.418
�[K== Now:       2023-01-21 23:05:40.923 (running for 9.5 seconds)
�[K== Progress:  0 / 1 (0.00%), errors: 0 (0.00%)
�[K== Remaining: unknown (@ 0 pages/second)
�[K== Sys. load: 8.9% CPU / 3.0% memory
�[K== Workers:   1
�[K   #0 WORK https://journals.openedition.org/bibnum/
�[8A�[K
�[K== Start:     2023-01-21 23:05:31.418
�[K== Now:       2023-01-21 23:05:41.423 (running for 10.0 seconds)
�[K== Progress:  0 / 1 (0.00%), errors: 0 (0.00%)
�[K== Remaining: unknown (@ 0 pages/second)
�[K== Sys. load: 7.0% CPU / 3.0% memory
�[K== Workers:   1
�[K   #0 WORK https://journals.openedition.org/bibnum/
�[8A�[K
�[K== Start:     2023-01-21 23:05:31.418
�[K== Now:       2023-01-21 23:05:41.924 (running for 10.5 seconds)
�[K== Progress:  0 / 1 (0.00%), errors: 0 (0.00%)
�[K== Remaining: unknown (@ 0 pages/second)
�[K== Sys. load: 5.4% CPU / 3.0% memory
�[K== Workers:   1
�[K   #0 WORK https://journals.openedition.org/bibnum/
�[8A�[K
�[K== Start:     2023-01-21 23:05:31.418
�[K== Now:       2023-01-21 23:05:42.424 (running for 11.0 seconds)
�[K== Progress:  0 / 1 (0.00%), errors: 0 (0.00%)
�[K== Remaining: unknown (@ 0 pages/second)
�[K== Sys. load: 3.9% CPU / 3.0% memory
�[K== Workers:   1
�[K   #0 WORK https://journals.openedition.org/bibnum/
�[8A�[K
�[K== Start:     2023-01-21 23:05:31.418
�[K== Now:       2023-01-21 23:05:42.924 (running for 11.5 seconds)
�[K== Progress:  0 / 1 (0.00%), errors: 0 (0.00%)
�[K== Remaining: unknown (@ 0 pages/second)
�[K== Sys. load: 2.4% CPU / 3.0% memory
�[K== Workers:   1
�[K   #0 WORK https://journals.openedition.org/bibnum/
�[8A�[K
�[K== Start:     2023-01-21 23:05:31.418
�[K== Now:       2023-01-21 23:05:43.424 (running for 12.0 seconds)
�[K== Progress:  0 / 1 (0.00%), errors: 0 (0.00%)
�[K== Remaining: unknown (@ 0 pages/second)
�[K== Sys. load: 2.5% CPU / 3.0% memory
�[K== Workers:   1
�[K   #0 WORK https://journals.openedition.org/bibnum/
�[8A�[K
�[K== Start:     2023-01-21 23:05:31.418
�[K== Now:       2023-01-21 23:05:43.924 (running for 12.5 seconds)
�[K== Progress:  0 / 1 (0.00%), errors: 0 (0.00%)
�[K== Remaining: unknown (@ 0 pages/second)
�[K== Sys. load: 2.4% CPU / 3.0% memory
�[K== Workers:   1
�[K   #0 WORK https://journals.openedition.org/bibnum/
�[8A�[K
�[K== Start:     2023-01-21 23:05:31.418
�[K== Now:       2023-01-21 23:05:44.425 (running for 13.0 seconds)
�[K== Progress:  0 / 1 (0.00%), errors: 0 (0.00%)
�[K== Remaining: unknown (@ 0 pages/second)
�[K== Sys. load: 2.3% CPU / 3.0% memory
�[K== Workers:   1
�[K   #0 WORK https://journals.openedition.org/bibnum/
�[8A�[K
�[K== Start:     2023-01-21 23:05:31.418
�[K== Now:       2023-01-21 23:05:44.925 (running for 13.5 seconds)
�[K== Progress:  0 / 1 (0.00%), errors: 0 (0.00%)
�[K== Remaining: unknown (@ 0 pages/second)
�[K== Sys. load: 2.2% CPU / 3.0% memory
�[K== Workers:   1
�[K   #0 WORK https://journals.openedition.org/bibnum/
Error: Protocol error (Runtime.callFunctionOn): Session closed. Most likely the page has been closed.
    at CDPSession.send (file:///app/node_modules/puppeteer-core/lib/esm/puppeteer/common/Connection.js:281:35)
    at ExecutionContext._ExecutionContext_evaluate (file:///app/node_modules/puppeteer-core/lib/esm/puppeteer/common/ExecutionContext.js:206:46)
    at ExecutionContext.evaluate (file:///app/node_modules/puppeteer-core/lib/esm/puppeteer/common/ExecutionContext.js:103:113)
    at IsolatedWorld.evaluate (file:///app/node_modules/puppeteer-core/lib/esm/puppeteer/common/IsolatedWorld.js:171:24)
�[8AStarting repair
Page Load Failed: https://journals.openedition.org/bibnum/, Reason: Error: Page crashed!
�[K
�[K== Start:     2023-01-21 23:05:31.418
�[K== Now:       2023-01-21 23:05:45.424 (running for 14.0 seconds)
�[K== Progress:  1 / 1 (100.00%), errors: 1 (100.00%)
�[K== Remaining: 0.0 ms (@ 0.07 pages/second)
�[K== Sys. load: 2.6% CPU / 3.1% memory
�[K== Workers:   1
�[K   #0 IDLE 
�[8A�[K
�[K== Start:     2023-01-21 23:05:31.418
�[K== Now:       2023-01-21 23:05:45.439 (running for 14.0 seconds)
�[K== Progress:  1 / 1 (100.00%), errors: 1 (100.00%)
�[K== Remaining: 0.0 ms (@ 0.07 pages/second)
�[K== Sys. load: 2.6% CPU / 3.1% memory
�[K== Workers:   1
�[K   #0 IDLE 
�[8A�[8BWaiting to ensure pending data is written to WARCs...
done
@tw4l
Copy link
Member

tw4l commented Feb 2, 2023

@rgaudin in v0.8.0 beta, browsertrix-crawler will return 0 unless a fatal error is encountered (something that triggers the Logger class's fatal method).

Examples of that would include no WARC files being created, WACZ generation failing, being unable to connect to Redis, or giving invalid arguments to the crawler. In any of those cases, the crawler will exit 1.

A failed page is not considered a fatal error, but instead logs an error message (now in JSON as of the beta), as long as some data is captured and written to a WARC during the crawl.

(Edited for clarity)

@rgaudin
Copy link
Contributor Author

rgaudin commented Feb 6, 2023

OK, I understand although I'd suggest raising a fatal error in case the --url itself fails as I don't think the output can be useful in that scenario.

In this particular case, 8 entries were added to WARC (resources from the page) before that initial URL crashed so that's why it returned 0.

@tw4l
Copy link
Member

tw4l commented Feb 6, 2023

@rgaudin that is a good suggestion, thank you. I'll look into how to best implement.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants