Get all URLs from a HTML markup. It's based on W3C link checker.
$ npm install html-urls --save
const got = require('got')
const htmlUrls = require('html-urls')
;(async () => {
const url = process.argv[2]
if (!url) throw new TypeError('Need to provide an url as first argument.')
const { body: html } = await got(url)
const links = htmlUrls({ html, url })
links.forEach(({ url }) => console.log(url))
// => [
// '',
// '',
// '',
// '',
// '',
// '',
// ...
// ]
It returns the following structure per every value detect on the HTML markup:
Type: <string>
The original value.
Type: <string|undefined>
The normalized URL, if the value can be considered an URL.
Type: <string|undefined>
The normalized value as URI.
See examples for more!
Type: string
Default: ''
The HTML markup.
Type: string
Default: ''
The URL associated with the HTML markup.
It is used for resolve relative links that can be present in the HTML markup.
Type: array
Default: []
A list of links to be excluded from the final output. It supports regex patterns.
See matcher for know more.
Type: boolean
Default: true
Remove duplicated links detected over all the HTML tags.
- xml-urls – Get all urls from a Feed/Atom/RSS/Sitemap xml markup.
- css-urls – Get all URLs referenced from stylesheet files.
html-urls © Kiko Beats, released under the MIT License.
Authored and maintained by Kiko Beats with help from contributors. · GitHub @Kiko Beats · X @Kikobeats