Below are the details and instructions for setting up and using the project.
The objective of this TypeScript (TS) assignment is to demonstrate web scraping techniques using Puppeteer to extract data from various websites.
- Node.js
- TypeScript (TS)
- Puppeteer
This TypeScript (TS) script utilizes Puppeteer, a Node library, to scrape data from different websites. It employs headless browsing and request interception to optimize resource usage and bypass security measures.
The script performs the following tasks:
- Launches a headless browser instance using Puppeteer.
- Navigates to multiple websites to extract specific data.
- Filters out unnecessary resources to improve efficiency.
- Saves the extracted data to a text file in an organized and meaningful manner for human read.
crawler/: Project folder.
dist/ : Contains the compiled JavaScript file (scrap.js) after compilation.
package.json: Project metadata and dependencies.
posts.txt: Text file to store scraped data.
README.md: Documentation file (this file).
scrap.tsx: TypeScript file for the web scraping logic.
tsconfig.json: TypeScript configuration file.
- Clone the project repository to your local machine. (if you are accessing this from github)
- Navigate to the project directory in your terminal.
cd /crawler
- Install dependencies using npm.
npm install
- Compile the TypeScript file to JavaScript using TypeScript Compiler (tsc).
tsc scrap.tsx
This will generate the compiled JavaScript file scrap.js in the dist/ folder.
- Run the compiled JavaScript file.
node dist/scrap.js
This will execute the web scraping script and write the scraped data to posts.txt.
The scrap.tsx file contains the web scraping logic using Puppeteer. The posts.txt file stores the scraped data. Modify the scrap.tsx file to adjust the scraping logic as needed. Run the project to fetch data from specified websites and store it in posts.txt. Feel free to reach out if you need further assistance!