Crawler | Web Scraping | Browser Automation

Below are the details and instructions for setting up and using the project.

Objective

The objective of this TypeScript (TS) assignment is to demonstrate web scraping techniques using Puppeteer to extract data from various websites.

Technologies Used

Node.js
TypeScript (TS)
Puppeteer

Description

This TypeScript (TS) script utilizes Puppeteer, a Node library, to scrape data from different websites. It employs headless browsing and request interception to optimize resource usage and bypass security measures.

The script performs the following tasks:

Launches a headless browser instance using Puppeteer.
Navigates to multiple websites to extract specific data.
Filters out unnecessary resources to improve efficiency.
Saves the extracted data to a text file in an organized and meaningful manner for human read.

Project Structure

crawler/: Project folder.

dist/ : Contains the compiled JavaScript file (scrap.js) after compilation.

package.json: Project metadata and dependencies.

posts.txt: Text file to store scraped data.

README.md: Documentation file (this file).

scrap.tsx: TypeScript file for the web scraping logic.

tsconfig.json: TypeScript configuration file.

Installation

Clone the project repository to your local machine. (if you are accessing this from github)
Navigate to the project directory in your terminal.

cd /crawler

Install dependencies using npm.

npm install

Running the Project

Compile the TypeScript file to JavaScript using TypeScript Compiler (tsc).

tsc scrap.tsx

This will generate the compiled JavaScript file scrap.js in the dist/ folder.

Run the compiled JavaScript file.

node dist/scrap.js

This will execute the web scraping script and write the scraped data to posts.txt.

Usage

The scrap.tsx file contains the web scraping logic using Puppeteer. The posts.txt file stores the scraped data. Modify the scrap.tsx file to adjust the scraping logic as needed. Run the project to fetch data from specified websites and store it in posts.txt. Feel free to reach out if you need further assistance!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Crawler | Web Scraping | Browser Automation

Objective

Technologies Used

Description

Project Structure

Installation

Running the Project

Usage

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
dist		dist
README.md		README.md
package.json		package.json
posts.txt		posts.txt
scrap.js		scrap.js
scrap.tsx		scrap.tsx
tsconfig.json		tsconfig.json

muhammedjemal/crawler

Folders and files

Latest commit

History

Repository files navigation

Crawler | Web Scraping | Browser Automation

Objective

Technologies Used

Description

Project Structure

Installation

Running the Project

Usage

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages