GitHub - ritwiktakkar/JCP-Stack: An efficient way to scrape results from the ACM, Springer, and IEEE Xplore digital libraries

JCP-Stack

An efficient way to collect results from the ACM, Springer, and IEEE Xplore digital libraries
View Demo

Table of Contents

About The Project
- Onomatology
- Dependencies
Getting Started
- Steps
Usage
Status
Issues
Contributing
License
Contact

About The Project

This project aims to help researchers find and sort papers from the ACM, Springer, and IEEE Xplore online databases efficiently. I have compiled a list of 291 journals and conferences with their CCF, Core and Qualis rankings in SelectedJournalsAndConferences.csv. This web scraper compares the similarity (Levenshtein ratio) between every single search result's journal/conference title and those listed in SelectedJournalsAndConferences.csv. If the similarity between them is greater than or equal to a user-specified percentage, then the result is placed in a CSV file whose path and name is also selected by the user. Once the web scraper has completed traversing each page generated by the user's search term, analyzing the results therein, and storing the ones that fit the given criteria, it alerts the user of its status prior to restarting.

Onomatology

"JCP-Stack" is short for "Journal/Conference Paper Stack" given that executing this program (ideally) outputs a CSV file that contains information about a stack of journal/conference papers related to a given keyword.

Dependencies

appdirs==1.4.4
beautifulsoup4==4.9.3
black==21.6b0
certifi==2021.5.30
charset-normalizer==2.0.3
click==8.0.1
colorama==0.4.4
configparser==5.0.2
crayons==0.4.0
idna==3.2
levenshtein==0.12.0
mypy-extensions==0.4.3
pathspec==0.9.0
regex==2021.4.4
requests==2.26.0
selenium==3.141.0
soupsieve==2.2.1
toml==0.10.2
urllib3==1.26.5
webdriver-manager==3.4.2

Getting Started

To get this project running on your local machine, follow these simple steps:

Steps

Clone the repo

git clone https://github.com/ritwiktakkar/rdb-scraper.git

Make sure you're running Python 3 (I wrote and tested this project with Python 3.9.6 64-bit)
```
python -V
```
Install all the packages specified in the configuration file (requirements.txt)
```
pip install -r requirements.txt
```
You will need the latest version of Google Chrome installed on your machine

Create a file called config.py inside this repo and add the following:

from common_functions import platform

if platform == "win32":
    path_to_search_results = "C:/<PATH TO SEARCH RESULTS>"
else:
    path_to_search_results = "/Users/<PATH TO SEARCH RESULTS>"

View the "Name" column inside SelectedJournalsAndConferences.csv: this is the list of names whose similarity (Levenshtein ratio) will be checked against each search result's journal/conference name. Feel free to modify this column on your local machine to add/remove journal names (not) of interest to you.

Execute get_all_results.py using Python

PATH_TO_PYTHON_INTERPRETER PATH_TO_get_all_results.py

Usage

Here is a video demo.

Status

Given that the layouts of online research databases are updated occasionally, the scraper may also need to be updated accordingly to successfully retrieve the necessary information therein. The table below provides the current status of the scraper's ability to retrieve information from different online research databases. As of 12/30/21...

Database	Scraper Status
ACM	✅
Springer	✅
IEEE Xplore	✅

Issues

On Windows only: Selenium's quit() method alone fails to kill chromedriver processes thereby leading to a sort of memory leak. To counter this, I added a batch file (kill_chromedriver.bat) that kills all chrome.exe processes. As a result, ANY Chrome process unrelated to this program will ALSO die at the hands of this rather brute approach.

Contributing

Any contributions you make are greatly appreciated.

Fork the Project
Create your Feature Branch
Commit your Changes
Push to the Branch
Open a Pull Request

License

Distributed under the MIT License. See LICENSE for more information.

Contact

📧 rt398 [at] cornell [dot] edu

🏠 ritwiktakkar.com

Project Link: https://github.com/ritwiktakkar/JCP-Stack

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

JCP-Stack

About The Project

Onomatology

Dependencies

Getting Started

Steps

Usage

Status

Issues

Contributing

License

Contact

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
SelectedJournalsAndConferences.csv		SelectedJournalsAndConferences.csv
common_functions.py		common_functions.py
get_all_results.py		get_all_results.py
kill_chromedriver.bat		kill_chromedriver.bat
requirements.txt		requirements.txt

License

ritwiktakkar/JCP-Stack

Folders and files

Latest commit

History

Repository files navigation

JCP-Stack

About The Project

Onomatology

Dependencies

Getting Started

Steps

Usage

Status

Issues

Contributing

License

Contact

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages