Skip to content

CS4250-Group6/Project-1

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

43 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CS 4250 Web Search and Recommender Systems Project 1

Assignment Details

Implement a web crawler and do analysis on 3 different crawls (each one a different language).
Analysis includes graphing to compare to Zipf's Law and Heap's Law.
Crawling included downloading the raw HTML content for each page and saving it to the repository folder.

Install

# Clone the repo 
$ git clone https://github.com/CS4250-Group6/Project-1.git

# Install the requirements
$ python3 -m pip install -r requirements.txt

Use

  1. Modify the seed URL and crawl language according to the accepted languages in langdetect
  2. Make a repository folder and repository/<lang> folders according to the selected language ISO 639-1 code
  3. python3 crawler.py
  4. Modify analyzer.py for the correct language name and code
  5. python3 analyzer.py

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages