Login Spider

Spider website pages protected by a login.

Purpose

Log-in to a website to access the area of a registered user, then spider the page links and process the page content.

Requirements

Python 2.6+
pycurl

Background

Having to use Windows on-site, Python dependencies were somewhat restricted to build a spider (pip would only install some, Beautiful Soup was not one of them).

Usage

Configure the website access details in the CONFIG section of login_spider.py.

(Viewing the website login form's HTML source will be needed to configure the FORM_POST string, as each site will use something different.)

Execute:

    python login_spider.py

Speed

Dependent on CPU and OS, approximately 35 seconds to process a 200 page website with a localhost connection (zero network overhead).

Credits

jfs and philshem for threading pools in Python.

License

Login Spider is released under the GPL v.3.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
login_spider.py		login_spider.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Login Spider

Spider website pages protected by a login.

Purpose

Requirements

Background

Usage

Speed

Credits

License

About

Releases

Packages

Languages

License

Tinram/Login-Spider

Folders and files

Latest commit

History

Repository files navigation

Login Spider

Spider website pages protected by a login.

Purpose

Requirements

Background

Usage

Speed

Credits

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages