Enhance scraper reliability with improved page load detection #80
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Change Summary
What is this?
This PR addresses the inconsistent page scraping behavior where the number of pages found and processed varies between runs. The issue stems from race conditions during page loading, where content might not be fully rendered before scraping begins. By implementing a more robust page load detection mechanism, we ensure consistent scraping results across multiple runs.
Changes
Code Changes:
In
custom_downloader_middleware.py
:WebDriverWait
instead of static delaysChanges in detail:
time.sleep(spider.js_wait)
withWebDriverWait
to actively check document readinessDemo
Running the scraper now produces consistent results across multiple runs for the same website. The scraper properly waits for the DOM content to be fully loaded before extracting content.
Context
This change addresses issue #75 where the scraper was producing inconsistent results due to race conditions in page loading. The number of pages found would vary between runs because some dynamic content might not have been fully loaded when the scraping occurred.
Previously, the scraper used a static delay which could be:
The new implementation ensures we wait exactly as long as needed for each page to load completely.
PR Checklist