Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove puppeteer-cluster #219

Merged
merged 5 commits into from
Mar 9, 2023
Merged

Conversation

tw4l
Copy link
Member

@tw4l tw4l commented Feb 2, 2023

Connected to #214

tw4l added 2 commits February 7, 2023 11:24
This commit removes puppeteer-cluster as a dependency in favor of
a simpler concurrency implementation, using p-queue to limit
concurrency to the number of available workers. As part of the
refactor, the custom window concurrency model in windowconcur.js
is removed and its logic implemented in the new Worker class's
initPage method.
@tw4l tw4l force-pushed the issue-214-remove-puppeteer-cluster branch from d4dbaf9 to 2aad268 Compare February 7, 2023 16:24
@tw4l tw4l marked this pull request as ready for review February 7, 2023 17:04
@tw4l tw4l requested a review from ikreymer February 7, 2023 17:04
ikreymer added 2 commits March 2, 2023 13:23
logging: log info string / version as first line
logging: improve logging of error stack traces
interruption: support interrupting crawl directly with 'interrupt' check which stops the job queue
interruption: don't repair if interrupting, wait for queue to be idle
@tw4l
Copy link
Member Author

tw4l commented Mar 3, 2023

Looks like latest changes are leading to a failure to write cdx index: https://github.com/webrecorder/browsertrix-crawler/actions/runs/4319906519/jobs/7539596680#step:6:179

* log text extraction
update puppeteer-core

* iframe filtering:
- fix filtering for about:blank iframes, support non-async shouldProcessFrame()
- filter iframes both for behaviors and for link extraction
- add 5-second timeout to link extraction, to avoid link extraction holding up crawl!
- cache filtered frames

* logging: adjust info->debug logging
tests: bail on first failure

* init order: ensure wb-manager init called first, then logs created

* healthcheck/worker reuse:
- refactor healthchecker into separate class
- increment healthchecker (if provided) if new page load fails
- remove expermeintal repair functionality for now
- add healthcheck

* remove unused arg

* Log no jobs available as debug

---------

Co-authored-by: Tessa Walsh <[email protected]>
@ikreymer ikreymer merged commit 1bee46b into main Mar 9, 2023
@ikreymer ikreymer deleted the issue-214-remove-puppeteer-cluster branch March 9, 2023 02:31
@ikreymer ikreymer mentioned this pull request Mar 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants