Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use new browser-based archiving mechanism instead of pywb proxy #424

Merged
merged 75 commits into from
Nov 8, 2023

Conversation

ikreymer
Copy link
Member

@ikreymer ikreymer commented Nov 4, 2023

This PR is a major refactoring of Browsertrix Crawler to generate WARC files (and CDX) while crawl is running directly via the Chrome Debug Protocol (CDP). Allows for more flexibility and accuracy when dealing with HTTP/2.x sites.
Fixes #343.

WARC files generated with TS-based warcio library.
Also includes experimental on-the-fly CDXJ generation, though still using py-wacz to generate the WARC
This also removes pywb / uwsgi

This will be towards the upcoming 1.x release of Browsertrix Crawler, while the 0.x line will stay on

ikreymer added 30 commits March 23, 2023 20:37
store pageid as 'WARC-Page-ID'
disable proxy
add skip check
drop empty queued reqresp
use takeStream and return empty buffer (experimental)
- more efficient retry of pending URLs after crawler crash
- pending urls are marked with <crawl>:p:<url> to indicate they are current being rendered
- when a crawler restarts, check if <crawl>:p:<url> is set to its unique id and remove pending marker, to allow the URL
to be retried again, as it's no longer actively being rendered
- use streaming branch of warcio.js for writing larger records!
- fix dedup, don't dedup post requests!
- if using take-response-as-stream, wait for async fetch to finish, then rewrite + fulfill.
- if fetching, continue and fetch async
- only use take-response if length is unknown, only return response if total buffered within the in-memory size limit,
otherwise fulfill with empty buffer
- set in-memory size limit to 2MB
remove pendingrequests improvements: remove from asyncfetcher always, ignore ids after removal
fix tempdir init
eslint: update to latest to catch variables used before decl
logging: add additional logging details for loadNetwork loader errors
- don't clear swUrls / swFrameIds after every page, only when service-worker target disconnects
- reject SW urls that are from unknown frames, frame list should now be more accurate
- commit fetch response directly in handleFetchResponse() for service worker requests, as no network messages will be received
- general: don't use cdp network loader for very large requests, as page may close before they're finished, just use asyncfetch instead
- if HEAD succeeds, do a direct fetch of non-HTML resource
- add filter to AsyncFetcher: reject if non-200 response or response sets cookies
- set loadState to 'full page loaded' (2) for direct-fetched pages
- also set mime type to better differntiate non-HTML pages, and lower loadState
- AsyncFetcher dupe handling: load() returns, "fetched", "dupe" or "notfetched" to differentiate dupe vs failed loading
- response async loading: if 'dupe', don't attempt to load again
- direct fetch: add ignoreDupe to ignore dupe check: if loading as page, always load again, even if previously loaded as a non-page resource
@ikreymer ikreymer changed the base branch from main to dev-1.0.0 November 4, 2023 01:49
@ikreymer ikreymer marked this pull request as ready for review November 4, 2023 02:48
@ikreymer ikreymer requested a review from tw4l November 4, 2023 02:48
Copy link
Member

@tw4l tw4l left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Testing very well! Looks good to merge to a dev branch for further testing/TS conversion.

I left a bunch of comments just cleaning up commented out bits, feel free to accept/ignore since we're not merging to main yet. Didn't leave suggestions on a few that seemed potentially useful for further testing.

let opts = {};
let redisStdio;

if (this.params.logging.includes("pywb")) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will want to remove pywb as a option in the description for --logging in argParser

const { stream, headers, httpStatusCode, success, netError, netErrorName } = result.resource;

if (!success || !stream) {
//await this.recorder.crawlState.removeDupe(ASYNC_FETCH_DUPE_KEY, url);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
//await this.recorder.crawlState.removeDupe(ASYNC_FETCH_DUPE_KEY, url);

Seems out of place?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe keep for now as still want to revisit how dupes are handled. The idea was that if it failed, we wouldn't track it as a dupe..

@ikreymer ikreymer merged commit 877d9f5 into dev-1.0.0 Nov 8, 2023
ikreymer added a commit that referenced this pull request Nov 9, 2023
Follows #424. Converts the upcoming 1.0.0 branch based on native browser-based traffic capture and recording to TypeScript. Fixes #426

---------
Co-authored-by: Tessa Walsh <[email protected]>
Co-authored-by: emma <[email protected]>
@ikreymer ikreymer deleted the recorder-work branch November 10, 2023 01:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Switch to archiving directly via CDP protocol instead of MITM proxy via pywb
2 participants