You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
- reduced memory usage, avoids memory leak issues caused by using playwright (see #298)
- browser: split Browser into Browser and BaseBrowser
- browser: puppeteer-specific functions added to Browser for additional flexibility if need to change again later
- browser: use defaultArgs from playwright
- browser: attempt to recover if initial target is gone
- logging: add debug logging from process.memoryUsage() after every page
- request interception: use priorities for cooperative request interception
- request interception: move to setupPage() to run once per page, enable if any of blockrules, adblockrules or originOverrides are used
- request interception: fix originOverrides enabled check, fix to work with catch-all request interception
- default args: set --waitUntil back to 'load,networkidle2'
- Update README with changes for puppeteer
- tests: fix extra hops depth test to ensure more than one page crawled
---------
Co-authored-by: Tessa Walsh <[email protected]>
Copy file name to clipboardexpand all lines: README.md
+21-14
Original file line number
Diff line number
Diff line change
@@ -1,6 +1,6 @@
1
1
# Browsertrix Crawler
2
2
3
-
Browsertrix Crawler is a simplified (Chrome) browser-based high-fidelity crawling system, designed to run a complex, customizable browser-based crawl in a single Docker container. Browsertrix Crawler uses [Playwright](https://github.com/microsoft/playwright) to control one or more browser windows in parallel.
3
+
Browsertrix Crawler is a simplified (Chrome) browser-based high-fidelity crawling system, designed to run a complex, customizable browser-based crawl in a single Docker container. Browsertrix Crawler uses [Puppeteer](https://github.com/puppeteer/puppeteer) to control one or more browser windows in parallel.
- Screencasting: Ability to watch crawling in real-time (experimental).
15
15
- Screenshotting: Ability to take thumbnails, full page screenshots, and/or screenshots of the initial page view.
16
16
- Optimized (non-browser) capture of non-HTML resources.
17
-
- Extensible Playwright driver script for customizing behavior per crawl or page.
17
+
- Extensible Puppeteer driver script for customizing behavior per crawl or page.
18
18
- Ability to create and reuse browser profiles interactively or via automated user/password login using an embedded browser.
19
19
- Multi-platform support -- prebuilt Docker images available for Intel/AMD and Apple Silicon (M1/M2) CPUs.
20
20
@@ -69,13 +69,14 @@ Options:
69
69
--crawlId, --id A user provided ID for this crawl or
70
70
crawl configuration (can also be se
71
71
t via CRAWL_ID env var)
72
-
[string] [default: "454230b33b8f"]
72
+
[string] [default: "97792ef37eaf"]
73
73
--newContext Deprecated as of 0.8.0, any values p
74
74
assed will be ignored
75
75
[string] [default: null]
76
-
--waitUntil Playwright page.goto() condition to
77
-
wait for before continuing
78
-
[default: "load"]
76
+
--waitUntil Puppeteer page.goto() condition to w
77
+
ait for before continuing, can be mu
78
+
ltiple separated by ','
79
+
[default: "load,networkidle2"]
79
80
--depth The depth of the crawl for all seeds
80
81
[number] [default: -1]
81
82
--extraHops Number of extra 'hops' to follow, be
@@ -150,10 +151,9 @@ Options:
150
151
o process.cwd()
151
152
[string] [default: "/crawls"]
152
153
--mobileDevice Emulate mobile device by name from:
153
-
https://github.com/microsoft/playwri
154
-
ght/blob/main/packages/playwright-co
155
-
re/src/server/deviceDescriptorsSourc
156
-
e.json [string]
154
+
https://github.com/puppeteer/puppete
155
+
er/blob/main/src/common/DeviceDescri
156
+
ptors.ts [string]
157
157
--userAgent Override user-agent with specified s
158
158
tring [string]
159
159
--userAgentSuffix Append suffix to existing browser us
@@ -240,6 +240,13 @@ Options:
240
240
--description, --desc If set, write supplied description i
241
241
nto WACZ datapackage.json metadata
242
242
[string]
243
+
--originOverride if set, will redirect requests from
244
+
each origin in key to origin in the
245
+
value, eg. --originOverride https://
246
+
host:port=http://alt-host:alt-port
247
+
[array] [default: []]
248
+
--logErrorsToRedis If set, write error messages to redi
249
+
s [boolean] [default: false]
243
250
--config Path to YAML config file
244
251
245
252
```
@@ -250,9 +257,9 @@ Options:
250
257
251
258
One of the key nuances of browser-based crawling is determining when a page is finished loading. This can be configured with the `--waitUntil` flag.
252
259
253
-
The default is `load`, which waits until page load, but for static sites, `--wait-until domcontentloaded` may be used to speed up the crawl (to avoid waiting for ads to load for example). The `--waitUntil networkidle` may make sense for sites where absolutely all requests must be waited until before proceeding.
260
+
The default is `load,networkidle2`, which waits until page load and <=2 requests remain, but for static sites, `--wait-until domcontentloaded` may be used to speed up the crawl (to avoid waiting for ads to load for example). `--waitUntil networkidle0` may make sense for sites where absolutely all requests must be waited until before proceeding.
254
261
255
-
See [page.goto waitUntil options](https://playwright.dev/docs/api/class-page#page-goto-option-wait-until) for more info on the options that can be used with this flag from the Playwright docs.
262
+
See [page.goto waitUntil options](https://pptr.dev/api/puppeteer.page.goto#remarks) for more info on the options that can be used with this flag from the Puppeteer docs.
256
263
257
264
The `--pageLoadTimeout`/`--timeout` option sets the timeout in seconds for page load, defaulting to 90 seconds. Behaviors will run on the page once either the page load condition or the page load timeout is met, whichever happens first.
258
265
@@ -543,11 +550,11 @@ The webhook URL can be an HTTP URL which receives a JSON POST request OR a Redis
543
550
544
551
</details>
545
552
546
-
### Configuring Chromium / Playwright / pywb
553
+
### Configuring Chromium / Puppeteer / pywb
547
554
548
555
There is a few environment variables you can set to configure chromium and pywb:
549
556
550
-
- CHROME_FLAGS will be split by spaces and passed to Chromium (via `args` in Playwright). Note that setting some options is not supported such as `--proxy-server` since they are set by browsertrix itself.
557
+
- CHROME_FLAGS will be split by spaces and passed to Chromium (via `args` in Puppeteer). Note that setting some options is not supported such as `--proxy-server` since they are set by browsertrix itself.
551
558
- SOCKS_HOST and SOCKS_PORT are read by pywb to proxy upstream traffic
0 commit comments