Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use new browser-based archiving mechanism instead of pywb proxy #424

Merged
merged 75 commits into from
Nov 8, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
75 commits
Select commit Hold shift + click to select a range
7524688
recorder work!
ikreymer Mar 24, 2023
af95ad9
remove dep
ikreymer Mar 24, 2023
ccb5549
fix
ikreymer Mar 24, 2023
5e6a9d2
rewriting work, wait for requests to finish
ikreymer Mar 24, 2023
31e2371
tweaks, attempt to determine issues on local build
ikreymer Mar 24, 2023
f0e648a
work
ikreymer Mar 24, 2023
288d2cd
logging, skip 206
ikreymer Mar 24, 2023
d10a7e8
add concurrent
ikreymer Mar 24, 2023
02c9755
move recording to recorder
ikreymer Mar 24, 2023
c8d2ffa
tweak logging
ikreymer Mar 24, 2023
380da7f
logging improvements
ikreymer Mar 25, 2023
6a610aa
refactor: also track Network events to get security details, wait for…
ikreymer Mar 25, 2023
86117b6
use brave image
ikreymer Mar 26, 2023
32b18e2
keep response data
ikreymer Mar 26, 2023
ee5804e
large files: add streaming to tmp dir in current collection
ikreymer Mar 27, 2023
4a94c1d
stream WARC writing, fix dedup
ikreymer Mar 27, 2023
b7bb59b
logging: group network-related logging into separate call which can b…
ikreymer Mar 27, 2023
53adc94
add separate async fetch handler separate from browser response strea…
ikreymer Mar 28, 2023
ba57a0c
async fetch work, check for empty response
ikreymer Mar 29, 2023
ba58c0d
pending reset:
ikreymer Mar 29, 2023
8fe343d
streaming fix:
ikreymer Mar 30, 2023
dd07fc6
Merge branch 'unmark-pending-on-restart' into recorder-work
ikreymer Mar 30, 2023
c88ff5e
lower concurrency, add support for takeResponseBodyAsStream vs fetch,…
ikreymer Mar 30, 2023
270c52c
update extraOpts, set max in mem to 10MB
ikreymer Mar 31, 2023
e4d5e54
fix --generateCDX to fix tests
ikreymer Mar 31, 2023
7a387a9
fix streaming logic!
ikreymer Mar 31, 2023
c5f6fff
Merge branch 'main' into recorder-work
ikreymer Mar 31, 2023
dfa86ae
refactor into AsyncFetcher and ResponseStreamAsyncFetcher
ikreymer Apr 1, 2023
df8fbff
recorder: init dirs on load, init file on use
ikreymer Apr 1, 2023
af39d40
don't store partial records, always remove after async fetch
ikreymer Apr 1, 2023
7d392e9
ensure filehandle is inited
ikreymer Apr 1, 2023
b010407
Merge branch 'main' into recorder-work
ikreymer Apr 1, 2023
7af413e
deps: update to latest warcio.js serializer branch
ikreymer Apr 2, 2023
becc195
warcwriter: move writing to warcwriter
ikreymer Apr 2, 2023
f6e3551
add writeCdx
ikreymer Apr 3, 2023
0f9ba83
Merge branch 'main' into recorder-work
ikreymer Apr 14, 2023
3aad61a
Merge branch 'main' into recorder-work
ikreymer Apr 20, 2023
361f765
Merge branch 'main' into recorder-work
ikreymer Apr 26, 2023
a359f3a
recorder: attempt to ensure serviceworkers are also captured:
ikreymer Apr 28, 2023
134695b
browser: switch to latest chrome
ikreymer Apr 28, 2023
40cba63
tweaks:
ikreymer Apr 28, 2023
0ef763c
refactor:
ikreymer Apr 29, 2023
6951fc0
disable expected size check if content-encoding is present!
ikreymer Apr 30, 2023
107ac23
headers: ensure content-encoding and transfer-encoding are rewritten …
ikreymer Apr 30, 2023
e332ee4
Merge branch 'main' into recorder-work
ikreymer Jun 17, 2023
cf53a51
Merge branch 'main' into recorder-work
ikreymer Jul 27, 2023
ed127a9
update to latest warcio with stream-serializer
ikreymer Aug 5, 2023
2420896
Merge branch 'main' into recorder-work
ikreymer Aug 9, 2023
147a13b
update to warcio 2.2.0!
ikreymer Aug 11, 2023
123762f
fix for warcio update: buffer takestream if read in memory for rewriting
ikreymer Aug 12, 2023
b35a33d
Merge branch 'main' into recorder-work
ikreymer Aug 13, 2023
0985f96
recorder fixes:
ikreymer Aug 17, 2023
5b7d46c
fix log msg
ikreymer Aug 17, 2023
b924ae1
fix typo in asyncLoad check!
ikreymer Aug 17, 2023
60aec17
Merge branch 'main' into recorder-work
ikreymer Aug 23, 2023
759f950
fix redirect handling:
ikreymer Aug 26, 2023
529a3cd
error handling: better error detection for loadNetworkRespource() path
ikreymer Sep 1, 2023
0925c3c
tweak error message
ikreymer Sep 1, 2023
7b0de11
Merge branch 'main' into recorder-work
ikreymer Sep 19, 2023
7415ac1
update yarn.lock
ikreymer Sep 19, 2023
2e76fb4
revert to older version of puppeteer-core due to changes in accessing…
ikreymer Sep 19, 2023
9f43f3c
fix header access
ikreymer Sep 19, 2023
a2b4f8b
add url to shouldSkip check, only include http/https URLs
ikreymer Sep 20, 2023
bc30d5a
Merge branch 'main' into recorder-work
ikreymer Sep 21, 2023
6f07377
Merge branch 'main' to 'recorder-work', switching to Brave
ikreymer Oct 4, 2023
6e9a1be
state: add pending-wait state when waiting for crawl to finish
ikreymer Oct 5, 2023
0a1e6df
service worker handling fixes:
ikreymer Oct 21, 2023
60cf313
reenable HEAD check + direct (non-browser) fetch of non-HTML pages:
ikreymer Oct 21, 2023
ccff712
Merge branch 'main' into recorder-work
ikreymer Nov 1, 2023
53cfd39
Merge branch 'main' (0.12.1 release) into recorder-work
ikreymer Nov 4, 2023
e7a850c
Apply suggestions from code review, remove commented out code
ikreymer Nov 8, 2023
988bf7a
remove unused code, remove references to pywb
ikreymer Nov 8, 2023
034de9a
fix warcinfo test after version update
ikreymer Nov 8, 2023
468a009
logging: reenable logging for timed out pending requests for now
ikreymer Nov 8, 2023
868cd7a
remove pywb dependency
ikreymer Nov 8, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions .eslintrc.cjs
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,10 @@ module.exports = {
"no-constant-condition": [
"error",
{"checkLoops": false }
],
"no-use-before-define": [
"error",
{"variables": true, "functions": false, "classes": false, "allowNamedExports": true}
]
}
};
1 change: 0 additions & 1 deletion Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,6 @@ ENV PROXY_HOST=localhost \
WORKDIR /app

ADD requirements.txt /app/
RUN pip install 'uwsgi==2.0.21'
RUN pip install -U setuptools; pip install -r requirements.txt

ADD package.json /app/
Expand Down
202 changes: 90 additions & 112 deletions crawler.js
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,6 @@ import fsp from "fs/promises";

import { RedisCrawlState, LoadState, QueueState } from "./util/state.js";
import Sitemapper from "sitemapper";
import { v4 as uuidv4 } from "uuid";
import yaml from "js-yaml";

import * as warcio from "warcio";
Expand Down Expand Up @@ -103,8 +102,9 @@ export class Crawler {

this.emulateDevice = this.params.emulateDevice || {};

this.captureBasePrefix = `http://${process.env.PROXY_HOST}:${process.env.PROXY_PORT}/${this.params.collection}/record`;
this.capturePrefix = process.env.NO_PROXY ? "" : this.captureBasePrefix + "/id_/";
//this.captureBasePrefix = `http://${process.env.PROXY_HOST}:${process.env.PROXY_PORT}/${this.params.collection}/record`;
//this.capturePrefix = "";//process.env.NO_PROXY ? "" : this.captureBasePrefix + "/id_/";
this.captureBasePrefix = null;

this.gotoOpts = {
waitUntil: this.params.waitUntil,
Expand Down Expand Up @@ -213,13 +213,36 @@ export class Crawler {
return new ScreenCaster(transport, this.params.workers);
}

async bootstrap() {
const initRes = child_process.spawnSync("wb-manager", ["init", this.params.collection], {cwd: this.params.cwd});
launchRedis() {
let redisStdio;

if (this.params.logging.includes("redis")) {
const redisStderr = fs.openSync(path.join(this.logDir, "redis.log"), "a");
redisStdio = [process.stdin, redisStderr, redisStderr];

} else {
redisStdio = "ignore";
}

if (initRes.status) {
logger.info("wb-manager init failed, collection likely already exists");
let redisArgs = [];
if (this.params.debugAccessRedis) {
redisArgs = ["--protected-mode", "no"];
}

return child_process.spawn("redis-server", redisArgs,{cwd: "/tmp/", stdio: redisStdio});
}

async bootstrap() {
const subprocesses = [];

subprocesses.push(this.launchRedis());

//const initRes = child_process.spawnSync("wb-manager", ["init", this.params.collection], {cwd: this.params.cwd});

//if (initRes.status) {
// logger.info("wb-manager init failed, collection likely already exists");
//}

fs.mkdirSync(this.logDir, {recursive: true});
this.logFH = fs.createWriteStream(this.logFilename);
logger.setExternalLogStream(this.logFH);
Expand All @@ -246,42 +269,8 @@ export class Crawler {
this.customBehaviors = this.loadCustomBehaviors(this.params.customBehaviors);
}

let opts = {};
let redisStdio;

if (this.params.logging.includes("pywb")) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will want to remove pywb as a option in the description for --logging in argParser

const pywbStderr = fs.openSync(path.join(this.logDir, "pywb.log"), "a");
const stdio = [process.stdin, pywbStderr, pywbStderr];

const redisStderr = fs.openSync(path.join(this.logDir, "redis.log"), "a");
redisStdio = [process.stdin, redisStderr, redisStderr];

opts = {stdio, cwd: this.params.cwd};
} else {
opts = {stdio: "ignore", cwd: this.params.cwd};
redisStdio = "ignore";
}

this.headers = {"User-Agent": this.configureUA()};

const subprocesses = [];

let redisArgs = [];
if (this.params.debugAccessRedis) {
redisArgs = ["--protected-mode", "no"];
}

subprocesses.push(child_process.spawn("redis-server", redisArgs, {cwd: "/tmp/", stdio: redisStdio}));

opts.env = {
...process.env,
COLL: this.params.collection,
ROLLOVER_SIZE: this.params.rolloverSize,
DEDUP_POLICY: this.params.dedupPolicy
};

subprocesses.push(child_process.spawn("uwsgi", [new URL("uwsgi.ini", import.meta.url).pathname], opts));

process.on("exit", () => {
for (const proc of subprocesses) {
proc.kill();
Expand Down Expand Up @@ -472,7 +461,7 @@ self.__bx_behaviors.selectMainBehavior();
async crawlPage(opts) {
await this.writeStats();

const {page, cdp, data, workerid, callbacks} = opts;
const {page, cdp, data, workerid, callbacks, directFetchCapture} = opts;
data.callbacks = callbacks;

const {url} = data;
Expand All @@ -481,6 +470,38 @@ self.__bx_behaviors.selectMainBehavior();
data.logDetails = logDetails;
data.workerid = workerid;

data.isHTMLPage = await timedRun(
this.isHTML(url, logDetails),
FETCH_TIMEOUT_SECS,
"HEAD request to determine if URL is HTML page timed out",
logDetails,
"fetch",
true
);

if (!data.isHTMLPage && directFetchCapture) {
try {
const {fetched, mime} = await timedRun(
directFetchCapture(url),
FETCH_TIMEOUT_SECS,
"Direct fetch capture attempt timed out",
logDetails,
"fetch",
true
);
if (fetched) {
data.loadState = LoadState.FULL_PAGE_LOADED;
if (mime) {
data.mime = mime;
}
logger.info("Direct fetch successful", {url, ...logDetails}, "fetch");
return true;
}
} catch (e) {
// ignore failed direct fetch attempt, do browser-based capture
}
}

// run custom driver here
await this.driver({page, data, crawler: this});

Expand Down Expand Up @@ -660,9 +681,8 @@ self.__bx_behaviors.selectMainBehavior();
async getInfoString() {
const packageFileJSON = JSON.parse(await fsp.readFile("../app/package.json"));
const warcioPackageJSON = JSON.parse(await fsp.readFile("/app/node_modules/warcio/package.json"));
const pywbVersion = child_process.execSync("pywb -V", {encoding: "utf8"}).trim().split(" ")[1];

return `Browsertrix-Crawler ${packageFileJSON.version} (with warcio.js ${warcioPackageJSON.version} pywb ${pywbVersion})`;
return `Browsertrix-Crawler ${packageFileJSON.version} (with warcio.js ${warcioPackageJSON.version})`;
}

async createWARCInfo(filename) {
Expand Down Expand Up @@ -872,7 +892,7 @@ self.__bx_behaviors.selectMainBehavior();
headless: this.params.headless,
emulateDevice: this.emulateDevice,
chromeOptions: {
proxy: !process.env.NO_PROXY,
proxy: false,
userAgent: this.emulateDevice.userAgent,
extraArgs: this.extraChromeArgs()
},
Expand All @@ -882,9 +902,10 @@ self.__bx_behaviors.selectMainBehavior();
}
});


// --------------
// Run Crawl Here!
await runWorkers(this, this.params.workers, this.maxPageTime);
await runWorkers(this, this.params.workers, this.maxPageTime, this.collDir);
// --------------

await this.serializeConfig(true);
Expand All @@ -898,8 +919,6 @@ self.__bx_behaviors.selectMainBehavior();

await this.writeStats();

// extra wait for all resources to land into WARCs
await this.awaitPendingClear();

// if crawl has been stopped, mark as final exit for post-crawl tasks
if (await this.crawlState.isCrawlStopped()) {
Expand All @@ -916,8 +935,19 @@ self.__bx_behaviors.selectMainBehavior();

if (this.params.generateCDX) {
logger.info("Generating CDX");
await fsp.mkdir(path.join(this.collDir, "indexes"), {recursive: true});
await this.crawlState.setStatus("generate-cdx");
const indexResult = await this.awaitProcess(child_process.spawn("wb-manager", ["reindex", this.params.collection], {cwd: this.params.cwd}));

const warcList = await fsp.readdir(path.join(this.collDir, "archive"));
const warcListFull = warcList.map((filename) => path.join(this.collDir, "archive", filename));

//const indexResult = await this.awaitProcess(child_process.spawn("wb-manager", ["reindex", this.params.collection], {cwd: this.params.cwd}));
const params = [
"-o",
path.join(this.collDir, "indexes", "index.cdxj"),
...warcListFull
];
const indexResult = await this.awaitProcess(child_process.spawn("cdxj-indexer", params, {cwd: this.params.cwd}));
if (indexResult === 0) {
logger.debug("Indexing complete, CDX successfully created");
} else {
Expand Down Expand Up @@ -1136,34 +1166,6 @@ self.__bx_behaviors.selectMainBehavior();

const failCrawlOnError = ((depth === 0) && this.params.failOnFailedSeed);

let isHTMLPage = await timedRun(
this.isHTML(url),
FETCH_TIMEOUT_SECS,
"HEAD request to determine if URL is HTML page timed out",
logDetails,
"fetch",
true
);

if (!isHTMLPage) {
try {
const captureResult = await timedRun(
this.directFetchCapture(url),
FETCH_TIMEOUT_SECS,
"Direct fetch capture attempt timed out",
logDetails,
"fetch",
true
);
if (captureResult) {
logger.info("Direct fetch successful", {url, ...logDetails}, "fetch");
return;
}
} catch (e) {
// ignore failed direct fetch attempt, do browser-based capture
}
}

let ignoreAbort = false;

// Detect if ERR_ABORTED is actually caused by trying to load a non-page (eg. downloadable PDF),
Expand All @@ -1172,6 +1174,8 @@ self.__bx_behaviors.selectMainBehavior();
ignoreAbort = shouldIgnoreAbort(req);
});

let isHTMLPage = data.isHTMLPage;

if (isHTMLPage) {
page.once("domcontentloaded", () => {
data.loadState = LoadState.CONTENT_LOADED;
Expand Down Expand Up @@ -1441,9 +1445,12 @@ self.__bx_behaviors.selectMainBehavior();
}
}

async writePage({url, depth, title, text, loadState, favicon}) {
const id = uuidv4();
const row = {id, url, title, loadState};
async writePage({pageid, url, depth, title, text, loadState, mime, favicon}) {
const row = {id: pageid, url, title, loadState};

if (mime) {
row.mime = mime;
}

if (depth === 0) {
row.seed = true;
Expand All @@ -1469,23 +1476,23 @@ self.__bx_behaviors.selectMainBehavior();
return urlParsed.protocol === "https:" ? HTTPS_AGENT : HTTP_AGENT;
}

async isHTML(url) {
async isHTML(url, logDetails) {
try {
const resp = await fetch(url, {
method: "HEAD",
headers: this.headers,
agent: this.resolveAgent
});
if (resp.status !== 200) {
logger.debug(`Skipping HEAD check ${url}, invalid status ${resp.status}`);
logger.debug("HEAD response code != 200, loading in browser", {status: resp.status, ...logDetails});
return true;
}

return this.isHTMLContentType(resp.headers.get("Content-Type"));

} catch(e) {
// can't confirm not html, so try in browser
logger.debug("HEAD request failed", {...e, url});
logger.debug("HEAD request failed", {...errJSON(e), ...logDetails});
return true;
}
}
Expand All @@ -1505,35 +1512,6 @@ self.__bx_behaviors.selectMainBehavior();
return false;
}

async directFetchCapture(url) {
const abort = new AbortController();
const signal = abort.signal;
const resp = await fetch(this.capturePrefix + url, {signal, headers: this.headers, redirect: "manual"});
abort.abort();
return resp.status === 200 && !resp.headers.get("set-cookie");
}

async awaitPendingClear() {
logger.info("Waiting to ensure pending data is written to WARCs...");
await this.crawlState.setStatus("pending-wait");

const redis = await initRedis("redis://localhost/0");

while (!this.interrupted) {
try {
const count = Number(await redis.get(`pywb:${this.params.collection}:pending`) || 0);
if (count <= 0) {
break;
}
logger.debug("Waiting for pending requests to finish", {numRequests: count});
} catch (e) {
break;
}

await sleep(1);
}
}

async parseSitemap(url, seedId, sitemapFromDate) {
// handle sitemap last modified date if passed
let lastmodFromTimestamp = null;
Expand Down
15 changes: 1 addition & 14 deletions create-login-profile.js
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,6 @@ import yargs from "yargs";

import { logger } from "./util/logger.js";

import { sleep } from "./util/timing.js";
import { Browser } from "./util/browser.js";
import { initStorage } from "./util/storage.js";

Expand Down Expand Up @@ -144,26 +143,14 @@ async function main() {
]);
}

let useProxy = false;

if (params.proxy) {
child_process.spawn("wayback", ["--live", "--proxy", "live"], {stdio: "inherit", cwd: "/tmp"});

logger.debug("Running with pywb proxy");

await sleep(3000);

useProxy = true;
}

const browser = new Browser();

await browser.launch({
profileUrl: params.profile,
headless: params.headless,
signals: true,
chromeOptions: {
proxy: useProxy,
proxy: false,
extraArgs: [
"--window-position=0,0",
`--window-size=${params.windowSize}`,
Expand Down
Loading