Java WebCrawler

This is a multithreaded web crawler written in Java that uses the JSoup library to scrape websites. The program takes a URL as input and crawls through web pages up to a maximum depth of 5, printing out the title of each web page it visits. Multiple web crawlers run in parallel using threads to speed up the process.

Features

Multithreaded: Each web crawler runs on its own thread, allowing for parallel web scraping.
Depth-limited crawling: The crawler stops after reaching a specified depth of 5 to avoid infinite loops.
Page Title Extraction: The crawler prints the title of each web page it visits.
Unique URL Visits: The program ensures that no URL is visited more than once in a given session.

Project Structure

The project contains two main classes:

WebCrawl: Implements the crawling logic. Each instance runs on its own thread and follows links up to a depth of 5.
Main: The entry point of the program. It creates multiple web crawlers and manages their execution.

Installation

Prerequisites

Java Development Kit (JDK) version 8 or higher
Maven or Gradle (optional, for dependency management)
JSoup Library, version 1.13.1 or higher

Steps

Clone the repository or copy the source code:

git clone https://github.com/yourusername/JavaWebCrawler.git

Download JSoup: If using Maven, add the following dependency in your pom.xml file:
```
<dependency>
    <groupId>org.jsoup</groupId>
    <artifactId>jsoup</artifactId>
    <version>1.13.1</version>
</dependency>
```
If using Gradle, add this line to your build.gradle:
```
implementation 'org.jsoup:jsoup:1.13.1'
```
Alternatively, download the JSoup JAR manually and add it to your project’s classpath.

Compile and run the program:

javac -cp jsoup-1.13.1.jar WebCrawler/*.java
java -cp .:jsoup-1.13.1.jar WebCrawler.Main

How It Works

WebCrawl Class

Constructor:
- Takes the starting URL and a unique identifier for the crawler.
- Starts a new thread to run the crawling process.
crawl(int level, String url):
- Recursively crawls through the links found on the given URL until the maximum depth (5) is reached.
request(String url):
- Fetches the document from the given URL using JSoup.
- Prints the status code, title, and adds the URL to the visited list if the page was successfully fetched.

Main Class

Main Method:
- Initializes multiple web crawlers with different starting URLs.
- Manages the execution of each web crawler by calling join() to ensure the main thread waits for each crawler to finish before exiting.

Example

In the Main class, three instances of WebCrawl are created, each starting at different websites:

https://www.wikipedia.org/
https://timesofindia.indiatimes.com/
https://www.cricbuzz.com/

Each crawler will explore up to a depth of 5 and print the titles of the web pages it visits.

Output Example:

WebCrawler Created Successfully
WebCrawler Created Successfully
WebCrawler Created Successfully

Bot ID: 1 Recieved webpage link at : https://www.wikipedia.org/
Wikipedia

Bot ID: 1 Recieved webpage link at : https://www.wikibooks.org/
Wikibooks

Bot ID: 2 Recieved webpage link at : https://timesofindia.indiatimes.com/
Times of India

Bot ID: 3 Recieved webpage link at : https://www.cricbuzz.com/
Cricbuzz

Customization

To customize the starting URLs or add more web crawlers:

Open the Main.java file.

Add more instances of WebCrawl with the desired URLs:

bot.add(new WebCrawl("https://example.com/", 4));

Limitations

The current implementation does not handle loops or duplicate links across different web crawlers.
The program may run into issues with sites that block web crawlers (like CAPTCHA or IP rate-limiting).
External links (URLs outside the base domain) are not filtered.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

JSoup Library for easy HTML parsing and web scraping in Java.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
bin		bin
src		src
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Java WebCrawler

Features

Project Structure

Installation

Prerequisites

Steps

How It Works

WebCrawl Class

Main Class

Example

Output Example:

Customization

Limitations

License

Acknowledgments

About

Releases

Packages

Languages

techut30/MultiThreadWebCrawler

Folders and files

Latest commit

History

Repository files navigation

Java WebCrawler

Features

Project Structure

Installation

Prerequisites

Steps

How It Works

WebCrawl Class

Main Class

Example

Output Example:

Customization

Limitations

License

Acknowledgments

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages