Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Healthcheck + Logging Fixes + Iframe Filtering Fixes #238

Merged

Conversation

ikreymer
Copy link
Member

@ikreymer ikreymer commented Mar 6, 2023

  • Refactor healthchecker into own class, increment healthcheck for new page load failures
  • Reuse pages (upto 5 page loads per page) before creating a new page (similar to old window-concurrency system)
  • iframes: ensure blank iframes are filtered out correctly for both behaviors and link extraction
  • adblock: add sync isAdUrl() check
  • link extraction: set timeout to avoid hanging in link extraction
  • logging: set debug logging globally
  • deps: bump puppeteer-core to latest

ikreymer added 5 commits March 3, 2023 13:49
update puppeteer-core
- fix filtering for about:blank iframes, support non-async shouldProcessFrame()
- filter iframes both for behaviors and for link extraction
- add 5-second timeout to link extraction, to avoid link extraction holding up crawl!
- cache filtered frames
tests: bail on first failure
- refactor healthchecker into separate class
- increment healthchecker (if provided) if new page load fails
- remove expermeintal repair functionality for now
- add healthcheck
@ikreymer ikreymer requested a review from tw4l March 6, 2023 23:10
Copy link
Member

@tw4l tw4l left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One small comment, otherwise looks great. I like the dedicated HealthChecker class, and thanks for fixing the ad block frame issues!

@ikreymer ikreymer merged commit af577b4 into issue-214-remove-puppeteer-cluster Mar 9, 2023
@ikreymer ikreymer mentioned this pull request Mar 9, 2023
@tw4l tw4l deleted the remove-pptr-cluster-fixes branch March 14, 2023 16:05
@tw4l tw4l restored the remove-pptr-cluster-fixes branch March 14, 2023 16:05
@tw4l tw4l deleted the remove-pptr-cluster-fixes branch March 14, 2023 16:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants