Skip to content

This repository contains web scraping scripts using Scrapy to extract data from the Steam gaming platform and Inshorts news site. The project demonstrates how to scrape and process data, saving it in CSV and PDF formats for further analysis and presentation.

Notifications You must be signed in to change notification settings

ManikSinghSarmaal/Web-Scraping

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Web Scraping Project

This project demonstrates web scraping using Scrapy to extract data from multiple websites, including the Steam gaming platform and Inshorts news site. The extracted data is then processed and saved in various formats, including CSV and PDF.

Project Description

The project consists of following main parts:

  1. **Reddit/subreddit data and comment and metadata extractor.
  2. Steam Scraper: Scrapes top-selling game data from the Steam platform.
  3. Inshorts Scraper: Scrapes news articles from Inshorts.

Reddit Scraper

  • A Scrapy spider designed to scrape posts and comments from a specified subreddit on [https://old.reddit.com].
  • The spider extracts post titles, links, and comments, storing them in a structured format.

Steam Scraper

  • Extracts data such as game name, game URL, image URL, release date, price, and review summary.
  • Saves the extracted data into a CSV file.
  • Converts the CSV data into a formatted PDF.

Inshorts Scraper

  • Extracts news articles including titles, content, author, and timestamp.
  • Saves the extracted data into a CSV file.

Overview of a Web Scraper functioning

Flow diagram of a Scraping bot

Setup Instructions

  1. Clone the Repository

    git clone https://github.com/ManikSinghSarmaal/Web-Scraping
    cd Web-Scraping
  2. Create and Activate a Virtual Environment

    # On macOS and Linux
    python3 -m venv venv
    source venv/bin/activate
    
    # On Windows
    python -m venv venv
    venv\Scripts\activate
  3. Install Required Packages

    pip install -r requirements.txt
  4. Configure Scrapy Settings

    • Ensure you have the correct settings in settings.py for each Scrapy spider.

Usage

Running the Subreddit Scraper

  1. Navigate to the Reddit Scraper Directory

    cd subreddit
  2. Run the Scraper

    scrapy crawl subreddit_data -o data.csv

Note -

If you wish to use rotating proxy to prevent ban from website as request count increases, make your account on ScrapeOps and add your api key and uncomment some lines, for more informatiion on using scrapy-scrapeops-proxy-sdk refer this [https://github.com/ScrapeOps/scrapeops-scrapy-proxy-sdk#integrating-into-your-scrapy-project]

Running the Steam Scraper

  1. Navigate to the Steam Scraper Directory

    cd steam_scraper
  2. Run the Scraper

    scrapy crawl infinite_scroll -o steam_best_sellers.csv
  3. Convert CSV to PDF

    python csv_to_pdf.py

Running the Inshorts Scraper

  1. Navigate to the Inshorts Scraper Directory

    cd inshorts_scraper
  2. Run the Scraper

    scrapy crawl inshorts -o inshorts_news.csv

File Structure

Please clone this repo and run this command in terminal tree to understand the directory structure

Demo

CSV data as in steam_scraper/steam_bestsellers_ALL.csv

Contribution Guidelines

  1. Fork the Repository
  2. Create a New Branch
    git checkout -b feature-branch
  3. Make Changes and Commit
    git commit -m "Description of changes"
  4. Push to Your Fork
    git push origin feature-branch
  5. Create a Pull Request

Contact

For any questions or suggestions, feel free to contact me at [[email protected]].

About

This repository contains web scraping scripts using Scrapy to extract data from the Steam gaming platform and Inshorts news site. The project demonstrates how to scrape and process data, saving it in CSV and PDF formats for further analysis and presentation.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published