Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhance scraper reliability with improved page load detection #80

Merged
merged 1 commit into from
Feb 4, 2025

Conversation

tharropoulos
Copy link
Contributor

Change Summary

What is this?

This PR addresses the inconsistent page scraping behavior where the number of pages found and processed varies between runs. The issue stems from race conditions during page loading, where content might not be fully rendered before scraping begins. By implementing a more robust page load detection mechanism, we ensure consistent scraping results across multiple runs.

Changes

Code Changes:

  1. In custom_downloader_middleware.py:

    • Implement dynamic page load detection using WebDriverWait instead of static delays
    • Add proper error handling for timeout scenarios
    • Improve code organization and imports structure

    Changes in detail:

    • Replace time.sleep(spider.js_wait) with WebDriverWait to actively check document readiness
    • Add fallback to original timeout behavior if dynamic check fails
    • Add proper imports for selenium exceptions and WebDriverWait
    • Reorganize imports for better code organization

Demo

Running the scraper now produces consistent results across multiple runs for the same website. The scraper properly waits for the DOM content to be fully loaded before extracting content.

Context

This change addresses issue #75 where the scraper was producing inconsistent results due to race conditions in page loading. The number of pages found would vary between runs because some dynamic content might not have been fully loaded when the scraping occurred.

Previously, the scraper used a static delay which could be:

  • Too short: Missing dynamically loaded content
  • Too long: Unnecessarily slowing down the scraping process

The new implementation ensures we wait exactly as long as needed for each page to load completely.

PR Checklist

- use `WebDriverWait` to ensure complete page load before scraping
- replace static delay with dynamic ready state check
- prevent race conditions in DOM content extraction
@jasonbosco jasonbosco merged commit 27407d7 into typesense:master Feb 4, 2025
1 check passed
@jasonbosco
Copy link
Member

This PR is now available in 0.12.0.rc6

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants