CS 4250 Web Search and Recommender Systems Project 1

Assignment Details

Implement a web crawler and do analysis on 3 different crawls (each one a different language).
Analysis includes graphing to compare to Zipf's Law and Heap's Law.
Crawling included downloading the raw HTML content for each page and saving it to the repository folder.

Install

# Clone the repo 
$ git clone https://github.com/CS4250-Group6/Project-1.git

# Install the requirements
$ python3 -m pip install -r requirements.txt

Use

Modify the seed URL and crawl language according to the accepted languages in langdetect
Make a repository folder and repository/<lang> folders according to the selected language ISO 639-1 code
python3 crawler.py
Modify analyzer.py for the correct language name and code
python3 analyzer.py

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
.gitignore		.gitignore
README.md		README.md
analyzer.py		analyzer.py
crawler.py		crawler.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CS 4250 Web Search and Recommender Systems Project 1

Assignment Details

Install

Use

About

Releases

Packages

Contributors 5

Languages

CS4250-Group6/Project-1

Folders and files

Latest commit

History

Repository files navigation

CS 4250 Web Search and Recommender Systems Project 1

Assignment Details

Install

Use

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 5

Languages

Packages