An efficient way to collect results from the ACM, Springer, and IEEE Xplore digital libraries
View Demo
Table of Contents
This project aims to help researchers find and sort papers from the ACM, Springer, and IEEE Xplore online databases efficiently. I have compiled a list of 291 journals and conferences with their CCF, Core and Qualis rankings in SelectedJournalsAndConferences.csv
. This web scraper compares the similarity (Levenshtein ratio) between every single search result's journal/conference title and those listed in SelectedJournalsAndConferences.csv
. If the similarity between them is greater than or equal to a user-specified percentage, then the result is placed in a CSV file whose path and name is also selected by the user. Once the web scraper has completed traversing each page generated by the user's search term, analyzing the results therein, and storing the ones that fit the given criteria, it alerts the user of its status prior to restarting.
"JCP-Stack" is short for "Journal/Conference Paper Stack" given that executing this program (ideally) outputs a CSV file that contains information about a stack of journal/conference papers related to a given keyword.
- appdirs==1.4.4
- beautifulsoup4==4.9.3
- black==21.6b0
- certifi==2021.5.30
- charset-normalizer==2.0.3
- click==8.0.1
- colorama==0.4.4
- configparser==5.0.2
- crayons==0.4.0
- idna==3.2
- levenshtein==0.12.0
- mypy-extensions==0.4.3
- pathspec==0.9.0
- regex==2021.4.4
- requests==2.26.0
- selenium==3.141.0
- soupsieve==2.2.1
- toml==0.10.2
- urllib3==1.26.5
- webdriver-manager==3.4.2
To get this project running on your local machine, follow these simple steps:
- Clone the repo
git clone https://github.com/ritwiktakkar/rdb-scraper.git
- Make sure you're running Python 3 (I wrote and tested this project with Python 3.9.6 64-bit)
python -V
- Install all the packages specified in the configuration file (
requirements.txt
)pip install -r requirements.txt
- You will need the latest version of Google Chrome installed on your machine
- Create a file called
config.py
inside this repo and add the following:from common_functions import platform if platform == "win32": path_to_search_results = "C:/<PATH TO SEARCH RESULTS>" else: path_to_search_results = "/Users/<PATH TO SEARCH RESULTS>"
- View the "Name" column inside
SelectedJournalsAndConferences.csv
: this is the list of names whose similarity (Levenshtein ratio) will be checked against each search result's journal/conference name. Feel free to modify this column on your local machine to add/remove journal names (not) of interest to you. - Execute
get_all_results.py
using PythonPATH_TO_PYTHON_INTERPRETER PATH_TO_get_all_results.py
Here is a video demo.
Given that the layouts of online research databases are updated occasionally, the scraper may also need to be updated accordingly to successfully retrieve the necessary information therein. The table below provides the current status of the scraper's ability to retrieve information from different online research databases. As of 12/30/21...
Database | Scraper Status |
---|---|
ACM | ✅ |
Springer | ✅ |
IEEE Xplore | ✅ |
On Windows only: Selenium's quit() method alone fails to kill chromedriver processes thereby leading to a sort of memory leak. To counter this, I added a batch file (kill_chromedriver.bat
) that kills all chrome.exe
processes. As a result, ANY Chrome process unrelated to this program will ALSO die at the hands of this rather brute approach.
Any contributions you make are greatly appreciated.
- Fork the Project
- Create your Feature Branch
- Commit your Changes
- Push to the Branch
- Open a Pull Request
Distributed under the MIT License. See LICENSE
for more information.
📧 rt398 [at] cornell [dot] edu
Project Link: https://github.com/ritwiktakkar/JCP-Stack