Enhance scraper reliability with improved page load detection #80

tharropoulos · 2025-02-03T13:39:56Z

Change Summary

What is this?

This PR addresses the inconsistent page scraping behavior where the number of pages found and processed varies between runs. The issue stems from race conditions during page loading, where content might not be fully rendered before scraping begins. By implementing a more robust page load detection mechanism, we ensure consistent scraping results across multiple runs.

Changes

Code Changes:

In custom_downloader_middleware.py:
- Implement dynamic page load detection using WebDriverWait instead of static delays
- Add proper error handling for timeout scenarios
- Improve code organization and imports structure
Changes in detail:
- Replace time.sleep(spider.js_wait) with WebDriverWait to actively check document readiness
- Add fallback to original timeout behavior if dynamic check fails
- Add proper imports for selenium exceptions and WebDriverWait
- Reorganize imports for better code organization

Demo

Running the scraper now produces consistent results across multiple runs for the same website. The scraper properly waits for the DOM content to be fully loaded before extracting content.

Context

This change addresses issue #75 where the scraper was producing inconsistent results due to race conditions in page loading. The number of pages found would vary between runs because some dynamic content might not have been fully loaded when the scraping occurred.

Previously, the scraper used a static delay which could be:

Too short: Missing dynamically loaded content
Too long: Unnecessarily slowing down the scraping process

The new implementation ensures we wait exactly as long as needed for each page to load completely.

PR Checklist

I have read and signed the Contributor License Agreement.

- use `WebDriverWait` to ensure complete page load before scraping - replace static delay with dynamic ready state check - prevent race conditions in DOM content extraction

jasonbosco · 2025-02-04T02:02:10Z

This PR is now available in 0.12.0.rc6

fix(middleware): improve content loading reliability

6254eb7

- use `WebDriverWait` to ensure complete page load before scraping - replace static delay with dynamic ready state check - prevent race conditions in DOM content extraction

tharropoulos mentioned this pull request Feb 3, 2025

Scraper is unreliable: Some pages are not found at times. #75

Open

jasonbosco merged commit 27407d7 into typesense:master Feb 4, 2025
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhance scraper reliability with improved page load detection #80

Enhance scraper reliability with improved page load detection #80

tharropoulos commented Feb 3, 2025

jasonbosco commented Feb 4, 2025

Enhance scraper reliability with improved page load detection #80

Enhance scraper reliability with improved page load detection #80

Conversation

tharropoulos commented Feb 3, 2025

Change Summary

What is this?

Changes

Code Changes:

Demo

Context

PR Checklist

jasonbosco commented Feb 4, 2025