Skip to content

Commit 71b618f

Browse files
authored
Switch back to Puppeteer from Playwright (#301)
- reduced memory usage, avoids memory leak issues caused by using playwright (see #298) - browser: split Browser into Browser and BaseBrowser - browser: puppeteer-specific functions added to Browser for additional flexibility if need to change again later - browser: use defaultArgs from playwright - browser: attempt to recover if initial target is gone - logging: add debug logging from process.memoryUsage() after every page - request interception: use priorities for cooperative request interception - request interception: move to setupPage() to run once per page, enable if any of blockrules, adblockrules or originOverrides are used - request interception: fix originOverrides enabled check, fix to work with catch-all request interception - default args: set --waitUntil back to 'load,networkidle2' - Update README with changes for puppeteer - tests: fix extra hops depth test to ensure more than one page crawled --------- Co-authored-by: Tessa Walsh <[email protected]>
1 parent d4e222f commit 71b618f

12 files changed

+561
-204
lines changed

README.md

+21-14
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Browsertrix Crawler
22

3-
Browsertrix Crawler is a simplified (Chrome) browser-based high-fidelity crawling system, designed to run a complex, customizable browser-based crawl in a single Docker container. Browsertrix Crawler uses [Playwright](https://github.com/microsoft/playwright) to control one or more browser windows in parallel.
3+
Browsertrix Crawler is a simplified (Chrome) browser-based high-fidelity crawling system, designed to run a complex, customizable browser-based crawl in a single Docker container. Browsertrix Crawler uses [Puppeteer](https://github.com/puppeteer/puppeteer) to control one or more browser windows in parallel.
44

55
## Features
66

@@ -14,7 +14,7 @@ Thus far, Browsertrix Crawler supports:
1414
- Screencasting: Ability to watch crawling in real-time (experimental).
1515
- Screenshotting: Ability to take thumbnails, full page screenshots, and/or screenshots of the initial page view.
1616
- Optimized (non-browser) capture of non-HTML resources.
17-
- Extensible Playwright driver script for customizing behavior per crawl or page.
17+
- Extensible Puppeteer driver script for customizing behavior per crawl or page.
1818
- Ability to create and reuse browser profiles interactively or via automated user/password login using an embedded browser.
1919
- Multi-platform support -- prebuilt Docker images available for Intel/AMD and Apple Silicon (M1/M2) CPUs.
2020

@@ -69,13 +69,14 @@ Options:
6969
--crawlId, --id A user provided ID for this crawl or
7070
crawl configuration (can also be se
7171
t via CRAWL_ID env var)
72-
[string] [default: "454230b33b8f"]
72+
[string] [default: "97792ef37eaf"]
7373
--newContext Deprecated as of 0.8.0, any values p
7474
assed will be ignored
7575
[string] [default: null]
76-
--waitUntil Playwright page.goto() condition to
77-
wait for before continuing
78-
[default: "load"]
76+
--waitUntil Puppeteer page.goto() condition to w
77+
ait for before continuing, can be mu
78+
ltiple separated by ','
79+
[default: "load,networkidle2"]
7980
--depth The depth of the crawl for all seeds
8081
[number] [default: -1]
8182
--extraHops Number of extra 'hops' to follow, be
@@ -150,10 +151,9 @@ Options:
150151
o process.cwd()
151152
[string] [default: "/crawls"]
152153
--mobileDevice Emulate mobile device by name from:
153-
https://github.com/microsoft/playwri
154-
ght/blob/main/packages/playwright-co
155-
re/src/server/deviceDescriptorsSourc
156-
e.json [string]
154+
https://github.com/puppeteer/puppete
155+
er/blob/main/src/common/DeviceDescri
156+
ptors.ts [string]
157157
--userAgent Override user-agent with specified s
158158
tring [string]
159159
--userAgentSuffix Append suffix to existing browser us
@@ -240,6 +240,13 @@ Options:
240240
--description, --desc If set, write supplied description i
241241
nto WACZ datapackage.json metadata
242242
[string]
243+
--originOverride if set, will redirect requests from
244+
each origin in key to origin in the
245+
value, eg. --originOverride https://
246+
host:port=http://alt-host:alt-port
247+
[array] [default: []]
248+
--logErrorsToRedis If set, write error messages to redi
249+
s [boolean] [default: false]
243250
--config Path to YAML config file
244251
245252
```
@@ -250,9 +257,9 @@ Options:
250257

251258
One of the key nuances of browser-based crawling is determining when a page is finished loading. This can be configured with the `--waitUntil` flag.
252259

253-
The default is `load`, which waits until page load, but for static sites, `--wait-until domcontentloaded` may be used to speed up the crawl (to avoid waiting for ads to load for example). The `--waitUntil networkidle` may make sense for sites where absolutely all requests must be waited until before proceeding.
260+
The default is `load,networkidle2`, which waits until page load and <=2 requests remain, but for static sites, `--wait-until domcontentloaded` may be used to speed up the crawl (to avoid waiting for ads to load for example). `--waitUntil networkidle0` may make sense for sites where absolutely all requests must be waited until before proceeding.
254261

255-
See [page.goto waitUntil options](https://playwright.dev/docs/api/class-page#page-goto-option-wait-until) for more info on the options that can be used with this flag from the Playwright docs.
262+
See [page.goto waitUntil options](https://pptr.dev/api/puppeteer.page.goto#remarks) for more info on the options that can be used with this flag from the Puppeteer docs.
256263

257264
The `--pageLoadTimeout`/`--timeout` option sets the timeout in seconds for page load, defaulting to 90 seconds. Behaviors will run on the page once either the page load condition or the page load timeout is met, whichever happens first.
258265

@@ -543,11 +550,11 @@ The webhook URL can be an HTTP URL which receives a JSON POST request OR a Redis
543550

544551
</details>
545552

546-
### Configuring Chromium / Playwright / pywb
553+
### Configuring Chromium / Puppeteer / pywb
547554

548555
There is a few environment variables you can set to configure chromium and pywb:
549556

550-
- CHROME_FLAGS will be split by spaces and passed to Chromium (via `args` in Playwright). Note that setting some options is not supported such as `--proxy-server` since they are set by browsertrix itself.
557+
- CHROME_FLAGS will be split by spaces and passed to Chromium (via `args` in Puppeteer). Note that setting some options is not supported such as `--proxy-server` since they are set by browsertrix itself.
551558
- SOCKS_HOST and SOCKS_PORT are read by pywb to proxy upstream traffic
552559

553560
Here's some examples use cases:

crawler.js

+40-26
Original file line numberDiff line numberDiff line change
@@ -355,6 +355,24 @@ export class Crawler {
355355
async setupPage({page, cdp, workerid}) {
356356
await this.browser.setupPage({page, cdp});
357357

358+
if ((this.adBlockRules && this.params.blockAds) ||
359+
this.blockRules || this.originOverride) {
360+
361+
await page.setRequestInterception(true);
362+
363+
if (this.adBlockRules && this.params.blockAds) {
364+
await this.adBlockRules.initPage(this.browser, page);
365+
}
366+
367+
if (this.blockRules) {
368+
await this.blockRules.initPage(this.browser, page);
369+
}
370+
371+
if (this.originOverride) {
372+
await this.originOverride.initPage(this.browser, page);
373+
}
374+
}
375+
358376
if (this.params.logging.includes("jserrors")) {
359377
page.on("console", (msg) => {
360378
if (msg.type() === "error") {
@@ -374,7 +392,7 @@ export class Crawler {
374392

375393
if (this.params.behaviorOpts) {
376394
await page.exposeFunction(BEHAVIOR_LOG_FUNC, (logdata) => this._behaviorLog(logdata, page.url(), workerid));
377-
await page.addInitScript(behaviors + `;\nself.__bx_behaviors.init(${this.params.behaviorOpts});`);
395+
await this.browser.addInitScript(page, behaviors + `;\nself.__bx_behaviors.init(${this.params.behaviorOpts});`);
378396
}
379397
}
380398

@@ -404,7 +422,7 @@ export class Crawler {
404422
logger.debug("Skipping screenshots for non-HTML page", logDetails);
405423
}
406424
const archiveDir = path.join(this.collDir, "archive");
407-
const screenshots = new Screenshots({page, url, directory: archiveDir});
425+
const screenshots = new Screenshots({browser: this.browser, page, url, directory: archiveDir});
408426
if (this.params.screenshot.includes("view")) {
409427
await screenshots.take();
410428
}
@@ -430,7 +448,7 @@ export class Crawler {
430448
logger.info("Skipping behaviors for slow page", logDetails, "behavior");
431449
} else {
432450
const res = await timedRun(
433-
this.runBehaviors(page, data.filteredFrames, logDetails),
451+
this.runBehaviors(page, cdp, data.filteredFrames, logDetails),
434452
this.params.behaviorTimeout,
435453
"Behaviors timed out",
436454
logDetails,
@@ -495,16 +513,14 @@ export class Crawler {
495513
}
496514
}
497515

498-
async runBehaviors(page, frames, logDetails) {
516+
async runBehaviors(page, cdp, frames, logDetails) {
499517
try {
500518
frames = frames || page.frames();
501519

502-
const context = page.context();
503-
504520
logger.info("Running behaviors", {frames: frames.length, frameUrls: frames.map(frame => frame.url()), ...logDetails}, "behavior");
505521

506522
return await Promise.allSettled(
507-
frames.map(frame => this.browser.evaluateWithCLI(context, frame, "self.__bx_behaviors.run();", logDetails, "behavior"))
523+
frames.map(frame => this.browser.evaluateWithCLI(page, frame, cdp, "self.__bx_behaviors.run();", logDetails, "behavior"))
508524
);
509525

510526
} catch (e) {
@@ -711,7 +727,7 @@ export class Crawler {
711727

712728
this.screencaster = this.initScreenCaster();
713729

714-
if (this.params.originOverride) {
730+
if (this.params.originOverride.length) {
715731
this.originOverride = new OriginOverride(this.params.originOverride);
716732
}
717733

@@ -905,6 +921,14 @@ export class Crawler {
905921
});
906922
}
907923

924+
logMemory() {
925+
const memUsage = process.memoryUsage();
926+
const { heapUsed, heapTotal } = memUsage;
927+
this.maxHeapUsed = Math.max(this.maxHeapUsed || 0, heapUsed);
928+
this.maxHeapTotal = Math.max(this.maxHeapTotal || 0, heapTotal);
929+
logger.debug("Memory", {maxHeapUsed: this.maxHeapUsed, maxHeapTotal: this.maxHeapTotal, ...memUsage}, "memory");
930+
}
931+
908932
async writeStats(toFile=false) {
909933
if (!this.params.logging.includes("stats")) {
910934
return;
@@ -926,6 +950,7 @@ export class Crawler {
926950
};
927951

928952
logger.info("Crawl statistics", stats, "crawlStatus");
953+
this.logMemory();
929954

930955
if (toFile && this.params.statsFilename) {
931956
try {
@@ -965,18 +990,6 @@ export class Crawler {
965990
}
966991
}
967992

968-
if (this.adBlockRules && this.params.blockAds) {
969-
await this.adBlockRules.initPage(page);
970-
}
971-
972-
if (this.blockRules) {
973-
await this.blockRules.initPage(page);
974-
}
975-
976-
if (this.originOverride) {
977-
await this.originOverride.initPage(page);
978-
}
979-
980993
let ignoreAbort = false;
981994

982995
// Detect if ERR_ABORTED is actually caused by trying to load a non-page (eg. downloadable PDF),
@@ -998,7 +1011,7 @@ export class Crawler {
9981011
try {
9991012
const resp = await page.goto(url, gotoOpts);
10001013

1001-
const contentType = await resp.headerValue("content-type");
1014+
const contentType = await this.browser.responseHeader(resp, "content-type");
10021015

10031016
isHTMLPage = this.isHTMLContentType(contentType);
10041017

@@ -1068,7 +1081,7 @@ export class Crawler {
10681081
await sleep(0.5);
10691082

10701083
try {
1071-
await page.waitForLoadState("networkidle", {timeout: this.params.netIdleWait * 1000});
1084+
await this.browser.waitForNetworkIdle(page, {timeout: this.params.netIdleWait * 1000});
10721085
} catch (e) {
10731086
logger.debug("waitForNetworkIdle timed out, ignoring", details);
10741087
// ignore, continue
@@ -1095,7 +1108,7 @@ export class Crawler {
10951108
try {
10961109
const linkResults = await Promise.allSettled(
10971110
frames.map(frame => timedRun(
1098-
frame.evaluate(loadFunc, {selector: selector, extract: extract}),
1111+
frame.evaluate(loadFunc, {selector, extract}),
10991112
PAGE_OP_TIMEOUT_SECS,
11001113
"Link extraction timed out",
11011114
logDetails,
@@ -1152,9 +1165,10 @@ export class Crawler {
11521165
try {
11531166
logger.debug("Check CF Blocking", logDetails);
11541167

1155-
const cloudflare = page.locator("div.cf-browser-verification.cf-im-under-attack");
1156-
1157-
while (await cloudflare.waitFor({timeout: PAGE_OP_TIMEOUT_SECS})) {
1168+
while (await timedRun(
1169+
page.$("div.cf-browser-verification.cf-im-under-attack"),
1170+
PAGE_OP_TIMEOUT_SECS
1171+
)) {
11581172
logger.debug("Cloudflare Check Detected, waiting for reload...", logDetails);
11591173
await sleep(5.5);
11601174
}

create-login-profile.js

+7-10
Original file line numberDiff line numberDiff line change
@@ -158,10 +158,8 @@ async function main() {
158158

159159
const browser = new Browser();
160160

161-
const profileDir = await browser.loadProfile(params.profile);
162-
163161
await browser.launch({
164-
dataDir: profileDir,
162+
profileUrl: params.profile,
165163
headless: params.headless,
166164
signals: true,
167165
chromeOptions: {
@@ -191,18 +189,17 @@ async function main() {
191189
params.password = await promptInput("Enter password: ", true);
192190
}
193191

194-
const { page, cdp } = await browser.getFirstPageWithCDP();
192+
const { page, cdp } = await browser.newWindowPageWithCDP();
195193

196-
const waitUntil = "load";
194+
const waitUntil = "load";
197195

198-
//await page.setCacheEnabled(false);
199-
await cdp.send("Network.setCacheDisabled", {cacheDisabled: true});
196+
await page.setCacheEnabled(false);
200197

201198
if (!params.automated) {
202199
await browser.setupPage({page, cdp});
203200

204201
// for testing, inject browsertrix-behaviors
205-
await page.addInitScript(behaviors + ";\nself.__bx_behaviors.init();");
202+
await browser.addInitScript(page, behaviors + ";\nself.__bx_behaviors.init();");
206203
}
207204

208205
logger.info(`Loading page: ${params.url}`);
@@ -384,7 +381,7 @@ class InteractiveBrowser {
384381
return;
385382
}
386383

387-
const cookies = await this.browser.context.cookies(url);
384+
const cookies = await this.browser.getCookies(this.page, url);
388385
for (const cookie of cookies) {
389386
cookie.expires = (new Date().getTime() / 1000) + this.params.cookieDays * 86400;
390387
delete cookie.size;
@@ -396,7 +393,7 @@ class InteractiveBrowser {
396393
cookie.url = url;
397394
}
398395
}
399-
await this.browser.context.addCookies(cookies);
396+
await this.browser.setCookies(this.page, cookies);
400397
} catch (e) {
401398
logger.error("Save Cookie Error: ", e);
402399
}

package.json

+1-1
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@
1717
"ioredis": "^4.27.1",
1818
"js-yaml": "^4.1.0",
1919
"minio": "7.0.26",
20-
"playwright-core": "^1.31.2",
20+
"puppeteer-core": "^19.11.1",
2121
"sitemapper": "^3.1.2",
2222
"uuid": "8.3.2",
2323
"warcio": "^1.6.0",

tests/extra_hops_depth.test.js

+6-2
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,8 @@ test("check that URLs are crawled 2 extra hops beyond depth", async () => {
1616
console.log(error);
1717
}
1818

19-
const crawled_pages = fs.readFileSync("test-crawls/collections/extra-hops-beyond/pages/pages.jsonl", "utf8");
19+
const crawledPages = fs.readFileSync("test-crawls/collections/extra-hops-beyond/pages/pages.jsonl", "utf8");
20+
const crawledPagesArray = crawledPages.trim().split("\n");
2021

2122
const expectedPages = [
2223
"https://webrecorder.net/",
@@ -28,7 +29,10 @@ test("check that URLs are crawled 2 extra hops beyond depth", async () => {
2829
"https://webrecorder.net/faq",
2930
];
3031

31-
for (const page of crawled_pages.trim().split("\n")) {
32+
// first line is the header, not page, so adding -1
33+
expect(expectedPages.length).toEqual(crawledPagesArray.length - 1);
34+
35+
for (const page of crawledPagesArray) {
3236
const url = JSON.parse(page).url;
3337
if (!url) {
3438
continue;

0 commit comments

Comments
 (0)