Webpage Scraper Tool

Overview

The Webpage Scraper Tool is a utility within the Atomic Agents ecosystem designed for scraping web content and converting it to markdown format. It includes features for extracting metadata and cleaning up the content for better readability.

Prerequisites and Dependencies

Python 3.9 or later
atomic-agents (See here for installation instructions)
pydantic
requests
beautifulsoup4
markdownify
readability-lxml

Installation

You can install the tool using any of the following options:

Using the CLI tool that comes with Atomic Agents. Simply run atomic and select the tool from the list of available tools. After doing so you will be asked for a target directory to download the tool into.
Good old fashioned copy/paste: Just like any other tool inside the Atomic Forge, you can copy the code from this repo directly into your own project, provided you already have atomic-agents installed according to the instructions in the main README.

Configuration

Parameters

user_agent (str): User agent string to use for requests. Defaults to Chrome/Windows user agent.
timeout (int): Timeout in seconds for HTTP requests. Defaults to 30.
max_content_length (int): Maximum content length in bytes to process. Defaults to 1,000,000.

Example

config = WebpageScraperToolConfig(
    user_agent="Custom User Agent String",
    timeout=60,
    max_content_length=2_000_000
)

Input & Output Structure

Input Schema

url (HttpUrl): URL of the webpage to scrape.
include_links (bool): Whether to preserve hyperlinks in the markdown output. Defaults to True.

Output Schema

content (str): The scraped content in markdown format.
metadata (WebpageMetadata): Metadata about the scraped webpage, including:
- title (str): The title of the webpage
- author (Optional[str]): The author of the webpage content
- description (Optional[str]): Meta description of the webpage
- site_name (Optional[str]): Name of the website
- domain (str): Domain name of the website

Usage

Here's an example of how to use the Webpage Scraper Tool:

from tool.webpage_scraper import WebpageScraperTool, WebpageScraperToolConfig

# Initialize the tool
scraper = WebpageScraperTool(config=WebpageScraperToolConfig())

# Define input data
input_data = WebpageScraperTool.input_schema(
    url="https://example.com/article",
    include_links=True
)

# Perform the scraping
result = scraper.run(input_data)
print(f"Title: {result.metadata.title}")
print(f"Content: {result.content[:200]}...")  # Preview first 200 chars

Contributing

Contributions are welcome! To contribute:

Fork the repository.
Create a new feature branch.
Commit your changes with clear messages.
Open a pull request detailing your changes.

Please ensure you follow the project's coding standards and include tests for any new features or bug fixes.

License

This project is licensed under the same license as the main Atomic Agents project. See the LICENSE file in the repository root for more details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Webpage Scraper Tool

Overview

Prerequisites and Dependencies

Installation

Configuration

Parameters

Example

Input & Output Structure

Input Schema

Output Schema

Usage

Contributing

License

Files

README.md

Latest commit

History

README.md

File metadata and controls

Webpage Scraper Tool

Overview

Prerequisites and Dependencies

Installation

Configuration

Parameters

Example

Input & Output Structure

Input Schema

Output Schema

Usage

Contributing

License