This project demonstrates web scraping using Scrapy to extract data from multiple websites, including the Steam gaming platform and Inshorts news site. The extracted data is then processed and saved in various formats, including CSV and PDF.
The project consists of following main parts:
- **Reddit/subreddit data and comment and metadata extractor.
- Steam Scraper: Scrapes top-selling game data from the Steam platform.
- Inshorts Scraper: Scrapes news articles from Inshorts.
- A Scrapy spider designed to scrape posts and comments from a specified subreddit on [https://old.reddit.com].
- The spider extracts post titles, links, and comments, storing them in a structured format.
- Extracts data such as game name, game URL, image URL, release date, price, and review summary.
- Saves the extracted data into a CSV file.
- Converts the CSV data into a formatted PDF.
- Extracts news articles including titles, content, author, and timestamp.
- Saves the extracted data into a CSV file.
-
Clone the Repository
git clone https://github.com/ManikSinghSarmaal/Web-Scraping cd Web-Scraping
-
Create and Activate a Virtual Environment
# On macOS and Linux python3 -m venv venv source venv/bin/activate # On Windows python -m venv venv venv\Scripts\activate
-
Install Required Packages
pip install -r requirements.txt
-
Configure Scrapy Settings
- Ensure you have the correct settings in
settings.py
for each Scrapy spider.
- Ensure you have the correct settings in
-
Navigate to the Reddit Scraper Directory
cd subreddit
-
Run the Scraper
scrapy crawl subreddit_data -o data.csv
If you wish to use rotating proxy to prevent ban from website as request count increases, make your account on ScrapeOps and add your api key and uncomment some lines, for more informatiion on using scrapy-scrapeops-proxy-sdk refer this [https://github.com/ScrapeOps/scrapeops-scrapy-proxy-sdk#integrating-into-your-scrapy-project]
-
Navigate to the Steam Scraper Directory
cd steam_scraper
-
Run the Scraper
scrapy crawl infinite_scroll -o steam_best_sellers.csv
-
Convert CSV to PDF
python csv_to_pdf.py
-
Navigate to the Inshorts Scraper Directory
cd inshorts_scraper
-
Run the Scraper
scrapy crawl inshorts -o inshorts_news.csv
Please clone this repo and run this command in terminal tree
to understand the directory structure
- Fork the Repository
- Create a New Branch
git checkout -b feature-branch
- Make Changes and Commit
git commit -m "Description of changes"
- Push to Your Fork
git push origin feature-branch
- Create a Pull Request
For any questions or suggestions, feel free to contact me at [[email protected]].