go-wayback
is a high-performance command-line tool written in Go that interacts with the Wayback Machine API to retrieve archived URLs and related data for a given website. Version 1.0.4 introduces concurrent processing, advanced filtering options, and multiple output formats to efficiently explore historical snapshots of web content.
- Wayback URLs Retrieval: Fetch original URLs archived by the Wayback Machine
- Browsable Archive Links: Generate direct links to archived versions
- Subdomain Extraction: Identify and list unique subdomains
- Multiple Output Formats: Support for plain text, CSV, JSON, and XML
- Concurrent Processing: Fast retrieval through parallel processing
- Date Range Filtering: Filter archives by specific time periods
- Rate Limiting: Control request rates to prevent API throttling
- Regex Filtering: Filter URLs using regular expressions
- Result Limiting: Control the number of results returned
- Batch Processing: Process multiple URLs from an input file
go install -v github.com/Abhinandan-Khurana/go-wayback/[email protected]
-
Prerequisites:
- Go 1.16 or higher
-
Clone and Build:
git clone https://github.com/Abhinandan-Khurana/go-wayback.git cd go-wayback go build -o go-wayback main.go
./go-wayback [options]
-wayback-only
: Get only wayback URLs-browsable
: Get wayback browsable links-subdomain
: Extract unique subdomains-unique-urls
: Remove duplicate URLs-save-wayback-csv
: Output as CSV-o [file]
: Specify output file (optional, defaults to stdout)-v
: Enable verbose output-h
: Display help information-version
: Show version information
-start-date
: Start date for filtering (YYYY-MM-DD)-end-date
: End date for filtering (YYYY-MM-DD)-format
: Output format (text/json/xml/csv)-input-file
: File containing URLs to process-filter
: Regex pattern to filter URLs-rate-limit
: Maximum requests per second (default: 10)-max-results
: Maximum number of results (0 for unlimited)-concurrent
: Number of concurrent processors (default: 10)-timeout
: Request timeout in seconds (default: 30)
./go-wayback example.com
./go-wayback -o results.txt example.com
./go-wayback -subdomain example.com
./go-wayback -start-date 2020-01-01 -end-date 2023-12-31 -format json example.com
./go-wayback -input-file urls.txt -rate-limit 5 -concurrent 20
./go-wayback -filter ".*\.pdf$" -max-results 100 example.com
./go-wayback -browsable -timeout 45 example.com
./go-wayback -save-wayback-csv -o archive_data.csv example.com
./go-wayback -format json -o data.json example.com
./go-wayback -format xml -o data.xml example.com
- One URL per line
- Simple and grep-friendly
- Headers: URL, LENGTH, TIMESTAMP
- Includes metadata for each archive
{
"results": [
{
"url": "http://example.com",
"length": "12345",
"timestamp": "20230101120000",
"date": "2023-01-01T12:00:00Z"
}
]
}
<results>
<result>
<url>http://example.com</url>
<length>12345</length>
<timestamp>20230101120000</timestamp>
<date>2023-01-01T12:00:00Z</date>
</result>
</results>
- Use
-concurrent
to adjust parallel processing based on your system capabilities - Use
-rate-limit
to prevent API throttling - For large datasets, consider using
-max-results
to limit output - Enable
-v
for progress monitoring during long operations
Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.
This project is licensed under the MIT License - see the LICENSE file for details.
- The Wayback Machine for providing access to archived web content
- Go community for excellent tooling and libraries