-
-
Notifications
You must be signed in to change notification settings - Fork 97
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Catch loading issues #255
Catch loading issues #255
Conversation
when goto() times out, detect if 'domcontentloaded' at least happened - if not, fail page immediately - if yes, at least extract links, but don't run behaviors for frames: - detect if frame actually not from a frame tag (eg. OBJECT) tag, and skip as well screencaster: - set Nth frame to 1 to avoid no screencast data when page is static behaviors: - update to 0.5.0-beta.0 (includes autoscroll improvements)
decouple worker new page creation vs page completed/failed status - worker ids: just use 0, 1 indexes - worker: only keeps track of crash state to recreate page - screencaster: allow reusing caster slots with fixed ids - abort timedCrawlPage() if 'crash' event happens - crawler: pageFinished() callback when page finishes - page considered 'finished' if it got past pageLoaded, even if behaviors timed out - todo: figure out how to log page load status in pages list
…orker is empty to send 'close' message when no more work work is expected
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Left a minor comment or two but tested and working well, and I really like these changes.
One note I don't have a clear answer for but was thinking about: in pages.jsonl, would we maybe want to log the description of the state rather than the number? e.g. BEHAVIORS_DONE
rather than 4
? or is the numeric value maybe more useful for post-processing/analysis?
@@ -3,6 +3,38 @@ import { logger } from "./logger.js"; | |||
import { MAX_DEPTH } from "./constants.js"; | |||
|
|||
|
|||
// ============================================================================ | |||
export const LoadState = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
very nice!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we could store enums, but yeah, was thinking the numeric value could be useful for processing / filtering later?
One issue is that non-HTML pages can only go as high as 2 (loaded), so should perhaps also store content-type, eg. a PDF with loadState 2 would be considered success, while an HTML page with loadState 2 would mean the loading timed out...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or, maybe its worth doing something like "00-failed"
, "01-content-loaded"
, "02-full-page-loaded"
so can have more user friendly status and be able to do comparison?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leaving as is for now, can reevaluate in a separate PR
Fixes for various loading issues
when goto() times out, detect if 'domcontentloaded' at least happened
for frames:
screencaster:
behaviors: