Find the best route to the inner links of a website, find dead links and create the sitemap.xml
Implements the Dijkstra algorithm for finding the shortest paths. Not professionally made, but provides three utilities that can be easily extended. All results are saved on separate xml files.
It is also provided an http server for testing purposes and a script that creates randomly interlinked html pages.
It requires the following libraries:
Runs on python 3.x
From terminal:
python3 web_crawler.py --domain http://www.example.com --firstpage thefirstpage.html
-d, --domain is required
-f, --firstpage defaults to /
-so, --sitemapout defaults to sitemap.xml
-po, --pathsout defaults to paths.xml
-do, --deadout defaults to dead.xml