Use new browser-based archiving mechanism instead of pywb proxy #424

ikreymer · 2023-11-04T01:48:59Z

This PR is a major refactoring of Browsertrix Crawler to generate WARC files (and CDX) while crawl is running directly via the Chrome Debug Protocol (CDP). Allows for more flexibility and accuracy when dealing with HTTP/2.x sites.
Fixes #343.

WARC files generated with TS-based warcio library.
Also includes experimental on-the-fly CDXJ generation, though still using py-wacz to generate the WARC
This also removes pywb / uwsgi

This will be towards the upcoming 1.x release of Browsertrix Crawler, while the 0.x line will stay on

store pageid as 'WARC-Page-ID'

disable proxy add skip check

… events to finish

drop empty queued reqresp

use takeStream and return empty buffer (experimental)

…e disabled

…m, fetch upto 6 responses at a time

- more efficient retry of pending URLs after crawler crash - pending urls are marked with <crawl>:p:<url> to indicate they are current being rendered - when a crawler restarts, check if <crawl>:p:<url> is set to its unique id and remove pending marker, to allow the URL to be retried again, as it's no longer actively being rendered

- use streaming branch of warcio.js for writing larger records! - fix dedup, don't dedup post requests!

… configurable in recorder

- if using take-response-as-stream, wait for async fetch to finish, then rewrite + fulfill. - if fetching, continue and fetch async - only use take-response if length is unknown, only return response if total buffered within the in-memory size limit, otherwise fulfill with empty buffer - set in-memory size limit to 2MB

remove pendingrequests improvements: remove from asyncfetcher always, ignore ids after removal fix tempdir init

eslint: update to latest to catch variables used before decl

… execution context

logging: add additional logging details for loadNetwork loader errors

- don't clear swUrls / swFrameIds after every page, only when service-worker target disconnects - reject SW urls that are from unknown frames, frame list should now be more accurate - commit fetch response directly in handleFetchResponse() for service worker requests, as no network messages will be received - general: don't use cdp network loader for very large requests, as page may close before they're finished, just use asyncfetch instead

- if HEAD succeeds, do a direct fetch of non-HTML resource - add filter to AsyncFetcher: reject if non-200 response or response sets cookies - set loadState to 'full page loaded' (2) for direct-fetched pages - also set mime type to better differntiate non-HTML pages, and lower loadState - AsyncFetcher dupe handling: load() returns, "fetched", "dupe" or "notfetched" to differentiate dupe vs failed loading - response async loading: if 'dupe', don't attempt to load again - direct fetch: add ignoreDupe to ignore dupe check: if loading as page, always load again, even if previously loaded as a non-page resource

tw4l

Testing very well! Looks good to merge to a dev branch for further testing/TS conversion.

I left a bunch of comments just cleaning up commented out bits, feel free to accept/ignore since we're not merging to main yet. Didn't leave suggestions on a few that seemed potentially useful for further testing.

tw4l · 2023-11-06T22:12:07Z

crawler.js

-    let opts = {};
-    let redisStdio;
-
-    if (this.params.logging.includes("pywb")) {


Will want to remove pywb as a option in the description for --logging in argParser

crawler.js

util/recorder.js

tw4l · 2023-11-07T14:44:45Z

util/recorder.js

+    const { stream, headers, httpStatusCode, success, netError, netErrorName } = result.resource;
+
+    if (!success || !stream) {
+      //await this.recorder.crawlState.removeDupe(ASYNC_FETCH_DUPE_KEY, url);


Suggested change

//await this.recorder.crawlState.removeDupe(ASYNC_FETCH_DUPE_KEY, url);

Seems out of place?

Maybe keep for now as still want to revisit how dupes are handled. The idea was that if it failed, we wouldn't track it as a dupe..

util/reqresp.js

util/warcwriter.js

util/browser.js

util/recorder.js

Co-authored-by: Tessa Walsh <[email protected]>

- only keep py-wacz - use cdxj-indexer for --generateCDX

Follows #424. Converts the upcoming 1.0.0 branch based on native browser-based traffic capture and recording to TypeScript. Fixes #426 --------- Co-authored-by: Tessa Walsh <[email protected]> Co-authored-by: emma <[email protected]>

ikreymer added 30 commits March 23, 2023 20:37

recorder work!

7524688

remove dep

af95ad9

fix

ccb5549

rewriting work, wait for requests to finish

5e6a9d2

tweaks, attempt to determine issues on local build

31e2371

work

f0e648a

logging, skip 206

288d2cd

add concurrent

d10a7e8

move recording to recorder

02c9755

store pageid as 'WARC-Page-ID'

tweak logging

c8d2ffa

logging improvements

380da7f

disable proxy add skip check

refactor: also track Network events to get security details, wait for…

6a610aa

… events to finish

use brave image

86117b6

drop empty queued reqresp

keep response data

32b18e2

large files: add streaming to tmp dir in current collection

ee5804e

use takeStream and return empty buffer (experimental)

stream WARC writing, fix dedup

4a94c1d

logging: group network-related logging into separate call which can b…

b7bb59b

…e disabled

add separate async fetch handler separate from browser response strea…

53adc94

…m, fetch upto 6 responses at a time

async fetch work, check for empty response

ba57a0c

streaming fix:

8fe343d

- use streaming branch of warcio.js for writing larger records! - fix dedup, don't dedup post requests!

Merge branch 'unmark-pending-on-restart' into recorder-work

dd07fc6

lower concurrency, add support for takeResponseBodyAsStream vs fetch,…

c88ff5e

… configurable in recorder

update extraOpts, set max in mem to 10MB

270c52c

fix --generateCDX to fix tests

e4d5e54

Merge branch 'main' into recorder-work

c5f6fff

refactor into AsyncFetcher and ResponseStreamAsyncFetcher

dfa86ae

remove pendingrequests improvements: remove from asyncfetcher always, ignore ids after removal fix tempdir init

recorder: init dirs on load, init file on use

df8fbff

eslint: update to latest to catch variables used before decl

don't store partial records, always remove after async fetch

af39d40

ikreymer added 14 commits August 31, 2023 18:15

error handling: better error detection for loadNetworkRespource() path

529a3cd

tweak error message

0925c3c

Merge branch 'main' into recorder-work

7b0de11

update yarn.lock

7415ac1

revert to older version of puppeteer-core due to changes in accessing…

2e76fb4

… execution context

fix header access

9f43f3c

add url to shouldSkip check, only include http/https URLs

a2b4f8b

Merge branch 'main' into recorder-work

bc30d5a

Merge branch 'main' to 'recorder-work', switching to Brave

6f07377

state: add pending-wait state when waiting for crawl to finish

6e9a1be

logging: add additional logging details for loadNetwork loader errors

Merge branch 'main' into recorder-work

ccff712

Merge branch 'main' (0.12.1 release) into recorder-work

53cfd39

ikreymer changed the base branch from main to dev-1.0.0 November 4, 2023 01:49

This was referenced Nov 4, 2023

TypeScript Conversion #425

Merged

Refactor / Cleanup of Crawl (for 1.0.0) #249

Closed

ikreymer marked this pull request as ready for review November 4, 2023 02:48

ikreymer requested a review from tw4l November 4, 2023 02:48

tw4l approved these changes Nov 7, 2023

View reviewed changes

ikreymer commented Nov 8, 2023

View reviewed changes

util/browser.js Outdated Show resolved Hide resolved

ikreymer commented Nov 8, 2023

View reviewed changes

util/recorder.js Outdated Show resolved Hide resolved

ikreymer and others added 5 commits November 7, 2023 17:20

Apply suggestions from code review, remove commented out code

e7a850c

Co-authored-by: Tessa Walsh <[email protected]>

remove unused code, remove references to pywb

988bf7a

fix warcinfo test after version update

034de9a

logging: reenable logging for timed out pending requests for now

468a009

remove pywb dependency

868cd7a

- only keep py-wacz - use cdxj-indexer for --generateCDX

ikreymer merged commit 877d9f5 into dev-1.0.0 Nov 8, 2023

ikreymer deleted the recorder-work branch November 10, 2023 01:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use new browser-based archiving mechanism instead of pywb proxy #424

Use new browser-based archiving mechanism instead of pywb proxy #424

ikreymer commented Nov 4, 2023 •

edited

Loading

tw4l left a comment

tw4l Nov 6, 2023

tw4l Nov 7, 2023

ikreymer Nov 8, 2023

Use new browser-based archiving mechanism instead of pywb proxy #424

Use new browser-based archiving mechanism instead of pywb proxy #424

Conversation

ikreymer commented Nov 4, 2023 • edited Loading

tw4l left a comment

Choose a reason for hiding this comment

tw4l Nov 6, 2023

Choose a reason for hiding this comment

tw4l Nov 7, 2023

Choose a reason for hiding this comment

ikreymer Nov 8, 2023

Choose a reason for hiding this comment

ikreymer commented Nov 4, 2023 •

edited

Loading