Wikipedia Web Crawler

A Node.js tool that crawls Wikipedia articles to build a knowledge graph and extract key information. This tool uses JigsawStack's AI Web Scraper to intelligently parse Wikipedia articles and follow related links, creating a connected map of knowledge on a topic.

Features

🔍 Smart Content Extraction - Extracts article titles, introductions, key concepts, and image captions
🕸️ Knowledge Graph Building - Creates a graph of related concepts by following relevant links
🧠 Intelligent Link Selection - Prioritizes links most relevant to the current article
🌐 English Wikipedia Focus - Filters to ensure only English Wikipedia content is processed
⚡ Efficient Crawling - Configurable depth and breadth limits with built-in delays to respect servers
🔄 Error Handling - Includes retry logic with simplified parameters for failed requests

Installation

Clone the repository:

git clone https://github.com/yourusername/wikipedia-knowledge-crawler.git
cd wikipedia-knowledge-crawler

Install dependencies:

npm install jigsawstack

Configuration

Before using the crawler, you need to sign up for a JigsawStack API key at jigsawstack.com.

Create a .env file in the project root:

JIGSAWSTACK_API_KEY=your_api_key_here

Usage

JavaScript API

// Basic usage
const crawler = require('./wikipedia-crawler');

// Start crawling from a specific Wikipedia article
crawler.start({
  seedUrl: "https://en.wikipedia.org/wiki/Machine_learning",
  maxDepth: 1,             // How many links deep to crawl
  maxArticlesPerLevel: 3,  // How many links to follow from each article
  outputFile: "output.json" // Optional: save results to a file
});

Command Line Usage

Run with default settings (Machine Learning as seed topic):

node index.js

Run with custom topic:

node index.js --seed="https://en.wikipedia.org/wiki/Artificial_intelligence" --depth=2 --breadth=5

Example Output

The crawler generates a detailed output including:

Article Summaries - Introduction and key concepts for each article
Knowledge Graph - Connections between related articles
Crawl Statistics - Number of articles, depth, crawl time

Example console output:

=== WIKIPEDIA KNOWLEDGE CRAWLER RESULTS ===

Total articles crawled: 4
Seed article: https://en.wikipedia.org/wiki/Machine_learning
Crawl time: February 26, 2025, 3:45:12 PM

=== ARTICLE SUMMARIES ===

1. Machine learning (Depth: 0)
   URL: https://en.wikipedia.org/wiki/Machine_learning
   Introduction: Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of statistical algorithms that can learn from data and generalize to unseen data...
   Key Concepts: Neural networks, supervised learning, unsupervised learning, reinforcement learning

2. Artificial intelligence (Depth: 1)
   URL: https://en.wikipedia.org/wiki/Artificial_intelligence
   Introduction: Artificial intelligence (AI) is the intelligence of machines or software, as opposed to the intelligence of humans or animals. It is a field of study in computer science that develops and studies intelligent machines...
   Key Concepts: Machine learning, neural networks, natural language processing, computer vision

3. Neural network (Depth: 1)
   URL: https://en.wikipedia.org/wiki/Neural_network
   Introduction: A neural network is a network or circuit of biological neurons, or, in a modern sense, an artificial neural network, composed of artificial neurons or nodes...
   Image: A simple neural network with two inputs, one hidden layer with two nodes, and one output

=== KNOWLEDGE GRAPH STATISTICS ===

Nodes: 4
Connections: 3

Saved Output Format

When saving to a file, the crawler produces a JSON structure:

{
  "articles": [
    {
      "title": "Machine learning",
      "url": "https://en.wikipedia.org/wiki/Machine_learning",
      "introduction": "Machine learning (ML) is a field of study...",
      "keyConcepts": "Neural networks, supervised learning...",
      "articleSubject": "Machine_learning",
      "depth": 0,
      "crawlDate": "2025-02-26T15:45:12.000Z"
    },
    ...
  ],
  "knowledgeGraph": {
    "nodes": [
      {"id": "https://en.wikipedia.org/wiki/Machine_learning", "label": "Machine learning", "depth": 0},
      ...
    ],
    "edges": [
      {"source": "https://en.wikipedia.org/wiki/Machine_learning", "target": "https://en.wikipedia.org/wiki/Artificial_intelligence"},
      ...
    ]
  },
  "metadata": {
    "seedUrl": "https://en.wikipedia.org/wiki/Machine_learning",
    "crawlTime": "2025-02-26T15:45:12.000Z",
    "maxDepth": 1,
    "maxArticlesPerLevel": 3
  }
}

Advanced Configuration

The crawler accepts several configuration options:

{
  seedUrl: "https://en.wikipedia.org/wiki/Machine_learning", // Starting point
  maxDepth: 2,             // How many links deep to follow
  maxArticlesPerLevel: 5,  // Maximum links to follow from each article
  delayBetweenRequests: 3000, // Milliseconds to wait between requests
  timeout: 12000,          // Milliseconds before timing out a request
  outputFile: "output.json", // File to save results (optional)
  retryFailedRequests: true // Whether to retry failed requests
}

Limitations

The crawler only works with English Wikipedia articles
Performance depends on JigsawStack's AI web scraper capabilities
Wikipedia's structure may change, potentially affecting extraction quality
Respect Wikipedia's terms of service and avoid excessive crawling

Dependencies

JigsawStack - AI Web Scraper
Node.js (v14 or higher)

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
README.md		README.md
package.json		package.json
test.js		test.js

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Wikipedia Web Crawler

Features

Installation

Configuration

Usage

JavaScript API

Command Line Usage

Example Output

Example console output:

Saved Output Format

Advanced Configuration

Limitations

Dependencies

About

Releases

Packages

Languages

JigsawStack/web-crawler-demo

Folders and files

Latest commit

History

Repository files navigation

Wikipedia Web Crawler

Features

Installation

Configuration

Usage

JavaScript API

Command Line Usage

Example Output

Example console output:

Saved Output Format

Advanced Configuration

Limitations

Dependencies

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages