Implement a web crawler and do analysis on 3 different crawls (each one a different language).
Analysis includes graphing to compare to Zipf's Law and Heap's Law.
Crawling included downloading the raw HTML content for each page and saving it to the repository
folder.
# Clone the repo
$ git clone https://github.com/CS4250-Group6/Project-1.git
# Install the requirements
$ python3 -m pip install -r requirements.txt
- Modify the seed URL and crawl language according to the accepted languages in langdetect
- Make a
repository
folder andrepository/<lang>
folders according to the selected language ISO 639-1 code python3 crawler.py
- Modify
analyzer.py
for the correct language name and code python3 analyzer.py