Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Catch loading issues #255

Merged
merged 8 commits into from
Mar 21, 2023
Merged

Catch loading issues #255

merged 8 commits into from
Mar 21, 2023

Conversation

ikreymer
Copy link
Member

Fixes for various loading issues

when goto() times out, detect if 'domcontentloaded' at least happened

  • if not, fail page immediately
  • if yes, at least extract links, but don't run behaviors

for frames:

  • detect if frame actually not from a frame tag (eg. OBJECT) tag, and skip as well

screencaster:

  • set Nth frame to 1 to avoid no screencast data when page is static

behaviors:

  • update to 0.5.0-beta.0 (includes autoscroll improvements)

when goto() times out, detect if 'domcontentloaded' at least happened
- if not, fail page immediately
- if yes, at least extract links, but don't run behaviors

for frames:
- detect if frame actually not from a frame tag (eg. OBJECT) tag, and skip as well

screencaster:
- set Nth frame to 1 to avoid no screencast data when page is static

behaviors:
- update to 0.5.0-beta.0 (includes autoscroll improvements)
decouple worker new page creation vs page completed/failed status
- worker ids: just use 0, 1 indexes
- worker: only keeps track of crash state to recreate page
- screencaster: allow reusing caster slots with fixed ids
- abort timedCrawlPage() if 'crash' event happens
- crawler: pageFinished() callback when page finishes
- page considered 'finished' if it got past pageLoaded, even if behaviors timed out
- todo: figure out how to log page load status in pages list
…orker is empty to send 'close' message when no more work work is expected
@ikreymer ikreymer requested a review from tw4l March 20, 2023 17:05
Copy link
Member

@tw4l tw4l left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Left a minor comment or two but tested and working well, and I really like these changes.

One note I don't have a clear answer for but was thinking about: in pages.jsonl, would we maybe want to log the description of the state rather than the number? e.g. BEHAVIORS_DONE rather than 4? or is the numeric value maybe more useful for post-processing/analysis?

@@ -3,6 +3,38 @@ import { logger } from "./logger.js";
import { MAX_DEPTH } from "./constants.js";


// ============================================================================
export const LoadState = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very nice!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we could store enums, but yeah, was thinking the numeric value could be useful for processing / filtering later?
One issue is that non-HTML pages can only go as high as 2 (loaded), so should perhaps also store content-type, eg. a PDF with loadState 2 would be considered success, while an HTML page with loadState 2 would mean the loading timed out...

Copy link
Member Author

@ikreymer ikreymer Mar 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or, maybe its worth doing something like "00-failed", "01-content-loaded", "02-full-page-loaded" so can have more user friendly status and be able to do comparison?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leaving as is for now, can reevaluate in a separate PR

@ikreymer ikreymer merged commit 02fb137 into main Mar 21, 2023
@ikreymer ikreymer deleted the catch-loading-issues branch March 21, 2023 01:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants