-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Organize Celery tasks #248
Comments
CREC scraper
url example -
When we run the Django command above, it calls It stores the data into Statements of Administration Policy Scraper
When we run the Django command above, it calls It stores the data into CBO Scraper
Before running it, the django command automatically delete all the cbo instances in the database.
CRS Scraper
How Daily Updates.
We will need to integrate Scrapy with Django. Here is the schema. Client sends a request with a URL to crawl it. (1) Django triggers Scrapy to run a spider to crawl that URL. (2) Django returns a response to tell Client that crawling just started. (3) Scrapy completes crawling and saves extracted data into a database. (4) Django fetches that data from the database and return it to Client. (5) In this way, we don't need to store data into json files in Scrapy anymore. |
Notes:
For the scrapy scrapers, we also want to:
|
Also, for the crec scraper, I believe there is code that makes the crec_detail_urls.json. |
For Cbo scraper, we don't need to delete the data and recreate it. CRS scraper: In that way, we can avoid duplicates in Scrapy. |
We have a number of scrapers and processing tasks. We need to make sure they run efficiently and get only the latest changes, if possible.
This is first an issue of analyzing the current scrapers and how they work. We need to know:
@ayeshamk, I'm asking Wei to organize this. He may have questions about individual scrapers.
We now have the following scrapers and processors:
crec_loader
, Ayesha)The text was updated successfully, but these errors were encountered: