-
-
Notifications
You must be signed in to change notification settings - Fork 97
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use new browser-based archiving mechanism instead of pywb proxy #424
Conversation
store pageid as 'WARC-Page-ID'
disable proxy add skip check
… events to finish
drop empty queued reqresp
use takeStream and return empty buffer (experimental)
…m, fetch upto 6 responses at a time
- more efficient retry of pending URLs after crawler crash - pending urls are marked with <crawl>:p:<url> to indicate they are current being rendered - when a crawler restarts, check if <crawl>:p:<url> is set to its unique id and remove pending marker, to allow the URL to be retried again, as it's no longer actively being rendered
- use streaming branch of warcio.js for writing larger records! - fix dedup, don't dedup post requests!
… configurable in recorder
- if using take-response-as-stream, wait for async fetch to finish, then rewrite + fulfill. - if fetching, continue and fetch async - only use take-response if length is unknown, only return response if total buffered within the in-memory size limit, otherwise fulfill with empty buffer - set in-memory size limit to 2MB
remove pendingrequests improvements: remove from asyncfetcher always, ignore ids after removal fix tempdir init
eslint: update to latest to catch variables used before decl
… execution context
logging: add additional logging details for loadNetwork loader errors
- don't clear swUrls / swFrameIds after every page, only when service-worker target disconnects - reject SW urls that are from unknown frames, frame list should now be more accurate - commit fetch response directly in handleFetchResponse() for service worker requests, as no network messages will be received - general: don't use cdp network loader for very large requests, as page may close before they're finished, just use asyncfetch instead
- if HEAD succeeds, do a direct fetch of non-HTML resource - add filter to AsyncFetcher: reject if non-200 response or response sets cookies - set loadState to 'full page loaded' (2) for direct-fetched pages - also set mime type to better differntiate non-HTML pages, and lower loadState - AsyncFetcher dupe handling: load() returns, "fetched", "dupe" or "notfetched" to differentiate dupe vs failed loading - response async loading: if 'dupe', don't attempt to load again - direct fetch: add ignoreDupe to ignore dupe check: if loading as page, always load again, even if previously loaded as a non-page resource
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Testing very well! Looks good to merge to a dev branch for further testing/TS conversion.
I left a bunch of comments just cleaning up commented out bits, feel free to accept/ignore since we're not merging to main yet. Didn't leave suggestions on a few that seemed potentially useful for further testing.
let opts = {}; | ||
let redisStdio; | ||
|
||
if (this.params.logging.includes("pywb")) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will want to remove pywb
as a option in the description for --logging
in argParser
const { stream, headers, httpStatusCode, success, netError, netErrorName } = result.resource; | ||
|
||
if (!success || !stream) { | ||
//await this.recorder.crawlState.removeDupe(ASYNC_FETCH_DUPE_KEY, url); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
//await this.recorder.crawlState.removeDupe(ASYNC_FETCH_DUPE_KEY, url); |
Seems out of place?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe keep for now as still want to revisit how dupes are handled. The idea was that if it failed, we wouldn't track it as a dupe..
Co-authored-by: Tessa Walsh <[email protected]>
- only keep py-wacz - use cdxj-indexer for --generateCDX
Follows #424. Converts the upcoming 1.0.0 branch based on native browser-based traffic capture and recording to TypeScript. Fixes #426 --------- Co-authored-by: Tessa Walsh <[email protected]> Co-authored-by: emma <[email protected]>
This PR is a major refactoring of Browsertrix Crawler to generate WARC files (and CDX) while crawl is running directly via the Chrome Debug Protocol (CDP). Allows for more flexibility and accuracy when dealing with HTTP/2.x sites.
Fixes #343.
WARC files generated with TS-based warcio library.
Also includes experimental on-the-fly CDXJ generation, though still using py-wacz to generate the WARC
This also removes pywb / uwsgi
This will be towards the upcoming 1.x release of Browsertrix Crawler, while the 0.x line will stay on