various edge-case loading optimizations: #709

ikreymer · 2024-10-27T21:36:57Z

if too many range requests for same URL are being made, try skipping/failing right away to reduce load
assume main browser context is used not just for service workers, always enable
check false positive 'net-aborted' error that may actually be ok for media, as well as documents
improve logging
possible fix for issues in Browser disconnected (crashed?) #706
interrupt any pending requests (that may be loading via browser context) after page timeout, log dropped requests

tw4l · 2024-10-28T14:47:50Z

I'm wondering if we should also move the "Large payload written to WARC, but not returned to browser (would require rereading into memory)" log messages to debug since users can interpret it as an error rather than expected behavior

tw4l

Looks good, a few small comments. I tested against a few of the examples in #706 (comment) and am not seeing any browser crashes, though these were admittedly smaller-scoped crawls.

src/util/browser.ts

src/util/recorder.ts

ikreymer · 2024-10-31T17:07:29Z

This is ready for re-review - though probably want to merge test fixes from #710 first

ikreymer · 2024-10-31T17:08:28Z

Main changes are actually streaming 206 responses (!) which were before always loaded into memory, also failing duplicate 206 responses that are not from 0- to avoid additional load.

- if too many range requests for same URL are being made, try skipping/failing right away to reduce load - assume main browser context is used not just for service workers, always enable - check false positive 'net-aborted' error that may actually be ok for media, as well as documents - improve logging - possible fix for issues in #706 - interrupt any pending requests (that may be loading via browser context) after page timeout, log dropped requests

…amed by only checking 200, not 206 (ack!) status code new logic: - if content length > 25MB (text rewrite limit), always stream, won't do any rewriting (should be fairly rare) - if content length > 5MB, always stream, unless essential resource (eg. html/js/css) which requires in-memory fetch for possible rewriting - if content length is unknown, stream if non-error code <300, as error codes likely aren't very large.

- add fetchContinued to avoid double-handling requestPaused if intercepted both in browser context and page context - rename isServiceWorker -> isBrowserContext, write record in fetch response if no page context - don't write 'no payload' requests in browser context, as they may be redirects reusing same requestId, and 204 will get skipped anyway - don't auto-attempt aborted media, already handled via behavior fetch

- interrupt pending requests when page is finished, so pageinfo record is written after - add pageFinished flag to recorder, remove unused 'skipping' flag - renable attempt refetch, should be using dedup

tw4l

Nice! Great catch on the 206 streaming issue. Just left one small suggestion

src/util/recorder.ts

Co-authored-by: Tessa Walsh <[email protected]>

src/util/recorder.ts

- fix: prefer streaming current response via takeStream, not only when size is unknown - ensure partial range requests are not async fetched, only full responses - don't serialize zero-payload responses - don't serialize 206 responses if there is size mismatch

various fixes for streaming, especially related to range requests - follow up to #709 - fix: prefer streaming current response via takeStream, not only when size is unknown - don't serialize async responses prematurely - don't serialize 206 responses if there is size mismatch

ikreymer requested a review from tw4l October 27, 2024 21:37

benoit74 mentioned this pull request Oct 28, 2024

Browser disconnected (crashed?) #706

Closed

tw4l approved these changes Oct 28, 2024

View reviewed changes

src/util/browser.ts Outdated Show resolved Hide resolved

src/util/recorder.ts Outdated Show resolved Hide resolved

ikreymer requested a review from tw4l October 31, 2024 17:06

ikreymer added 5 commits October 31, 2024 13:25

fixes from code review

e084e13

further cleanup:

5ea30d3

- interrupt pending requests when page is finished, so pageinfo record is written after - add pageFinished flag to recorder, remove unused 'skipping' flag - renable attempt refetch, should be using dedup

tw4l force-pushed the range-load-optimizations branch from 5cacea5 to 5ea30d3 Compare October 31, 2024 17:25

tw4l approved these changes Oct 31, 2024

View reviewed changes

src/util/recorder.ts Show resolved Hide resolved

Update src/util/recorder.ts

b6b4557

Co-authored-by: Tessa Walsh <[email protected]>

tw4l reviewed Oct 31, 2024

View reviewed changes

src/util/recorder.ts Outdated Show resolved Hide resolved

tw4l and others added 2 commits October 31, 2024 15:12

Fix typo

856448d

format fix

ce4da11

ikreymer merged commit e5bab8e into main Oct 31, 2024
4 checks passed

ikreymer deleted the range-load-optimizations branch October 31, 2024 21:06

ikreymer mentioned this pull request Nov 13, 2024

Ensure partial responses are not written #721

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

various edge-case loading optimizations: #709

various edge-case loading optimizations: #709

ikreymer commented Oct 27, 2024

tw4l commented Oct 28, 2024

tw4l left a comment

ikreymer commented Oct 31, 2024

ikreymer commented Oct 31, 2024

tw4l left a comment

various edge-case loading optimizations: #709

various edge-case loading optimizations: #709

Conversation

ikreymer commented Oct 27, 2024

tw4l commented Oct 28, 2024

tw4l left a comment

Choose a reason for hiding this comment

ikreymer commented Oct 31, 2024

ikreymer commented Oct 31, 2024

tw4l left a comment

Choose a reason for hiding this comment