This project is a Reddit scraper that retrieves posts from a given subreddit. The retrieved data includes key details like the post title, author, number of comments, and timestamp.
- Scrapes posts from a specified subreddit.
- Retrieves key details such as title, author, number of comments, timestamp, link, and score.
- Uses direct HTTP requests without relying on browser automation tools or the official Reddit API.
- Python 3.x
requests
librarybeautifulsoup4
library
-
Clone the repository:
git clone https://github.com/Korkii/RedditScraper cd reddit-scraper
-
Create a virtual environment:
python -m venv venv
-
Activate the virtual environment:
- On Windows:
venv\Scripts\activate
- On macOS/Linux:
source venv/bin/activate
- On Windows:
-
Install the required packages:
pip install -r requirements.txt
-
Run the script with the desired arguments:
python src/main.py <subreddit> <sort> <max_posts>
<subreddit>
: The subreddit to scrape (e.g.,python
)<sort>
: The sorting order of posts (new
,hot
, orold
)<max_posts>
: The maximum number of posts to scrape (e.g.,100
)
Example:
python src/main.py vim hot 100
- This scraper uses direct HTTP requests to fetch the HTML content of the subreddit pages.
- The
BeautifulSoup
library is used to parse the HTML and extract the required data. - The script handles pagination to retrieve the specified number of posts from the subreddit.