Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

various edge-case loading optimizations: #709

Merged
merged 8 commits into from
Oct 31, 2024
Merged

Conversation

ikreymer
Copy link
Member

  • if too many range requests for same URL are being made, try skipping/failing right away to reduce load
  • assume main browser context is used not just for service workers, always enable
  • check false positive 'net-aborted' error that may actually be ok for media, as well as documents
  • improve logging
  • possible fix for issues in Browser disconnected (crashed?) #706
  • interrupt any pending requests (that may be loading via browser context) after page timeout, log dropped requests

@tw4l
Copy link
Member

tw4l commented Oct 28, 2024

I'm wondering if we should also move the "Large payload written to WARC, but not returned to browser (would require rereading into memory)" log messages to debug since users can interpret it as an error rather than expected behavior

Copy link
Member

@tw4l tw4l left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, a few small comments. I tested against a few of the examples in #706 (comment) and am not seeing any browser crashes, though these were admittedly smaller-scoped crawls.

@ikreymer ikreymer requested a review from tw4l October 31, 2024 17:06
@ikreymer
Copy link
Member Author

This is ready for re-review - though probably want to merge test fixes from #710 first

@ikreymer
Copy link
Member Author

Main changes are actually streaming 206 responses (!) which were before always loaded into memory, also failing duplicate 206 responses that are not from 0- to avoid additional load.

- if too many range requests for same URL are being made, try skipping/failing right away to reduce load
- assume main browser context is used not just for service workers, always enable
- check false positive 'net-aborted' error that may actually be ok for media, as well as documents
- improve logging
- possible fix for issues in #706
- interrupt any pending requests (that may be loading via browser context) after page timeout, log dropped requests
…amed by only checking 200, not 206 (ack!) status code

new logic:
- if content length > 25MB (text rewrite limit), always stream, won't do any rewriting (should be fairly rare)
- if content length > 5MB, always stream, unless essential resource (eg. html/js/css) which requires in-memory fetch for possible rewriting
- if content length is unknown, stream if non-error code <300, as error codes likely aren't very large.
- add fetchContinued to avoid double-handling requestPaused if intercepted both in browser context and page context
- rename isServiceWorker -> isBrowserContext, write record in fetch response if no page context
- don't write 'no payload' requests in browser context, as they may be redirects reusing same requestId, and 204 will get skipped anyway
- don't auto-attempt aborted media, already handled via behavior fetch
- interrupt pending requests when page is finished, so pageinfo record is written after
- add pageFinished flag to recorder, remove unused 'skipping' flag
- renable attempt refetch, should be using dedup
@tw4l tw4l force-pushed the range-load-optimizations branch from 5cacea5 to 5ea30d3 Compare October 31, 2024 17:25
Copy link
Member

@tw4l tw4l left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Great catch on the 206 streaming issue. Just left one small suggestion

Co-authored-by: Tessa Walsh <[email protected]>
@ikreymer ikreymer merged commit e5bab8e into main Oct 31, 2024
4 checks passed
@ikreymer ikreymer deleted the range-load-optimizations branch October 31, 2024 21:06
ikreymer added a commit that referenced this pull request Nov 13, 2024
- fix: prefer streaming current response via takeStream, not only when size is unknown
- ensure partial range requests are not async fetched, only full responses
- don't serialize zero-payload responses
- don't serialize 206 responses if there is size mismatch
ikreymer added a commit that referenced this pull request Nov 14, 2024
various fixes for streaming, especially related to range requests
- follow up to #709
- fix: prefer streaming current response via takeStream, not only when
size is unknown
- don't serialize async responses prematurely
- don't serialize 206 responses if there is size mismatch
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants