This repo contains the scripts for PharmaDB's ETL pipeline. The results of the pipeline are:
- Updated Drug Labels, Patent Claims and (label to patent) matching scores in a MongoDB database
- A compressed CSV export (in
.zip
format) of the DB
The script src/main.py
contains all the logic for periodically retrieving data from the respective data sources and updating the MongoDB database. The following are the steps to be taken.
- Clone this repo, with the submodules
$ git clone --recurse-submodules https://github.com/pharmaDB/etl_pipeline.git
- Build the
node
project for the patent data collection.
$ cd src/submodules/uspto_bulk_file_processor_v4
$ npm install
$ npm run build
- Install the Python project dependencies in the submodules. Also, start the file server (for the CSV export download) in the background, from the scoring submodule (more context on this can be found in the main README).
$ cd src/submodules/dailymed_data_processor
$ pip3 install -r requirements.txt
$ cd src/submodules/scoring_data_processor
$ pip3 install -r requirements.txt
$ sudo nohup python3 server.py &
- If using a local MongoDB server, the
docker-compose
set up provided in this repo can be used. Follow the steps in the subsequent section to start the server. - If NOT using a local MongoDB server, update the
.env
files in the root of this repo and each of the submodules with the host/port (and optionally, the DB name). - Follow the steps in the main README on importing data into the database, for the first run.
- Run the pipeline script from the
src/
folder usingpython3 main.py
. The script can also be run monthly, as a cron job. Eg:0 0 1 * * python3 main.py
. This can be saved to the cron file usingcrontab -e
. - Upon successful pipeline run, the data in MongoDB should be updated. As a quick check, the
pipeline
collection should show the updated timestamp for the latest successful run. The CSV export of the DB should also be updated insubmodules/scoring_data_processor/resources/hosted_folder/db2csv.zip
.
The pipeline runs a sequence of tasks that can be logically visualized using the following diagram.
However, it may be noted that the pipeline, in its current form, differs from the above depiction in 2 aspects:
- All steps are sequential as opposed to parallel execution for some of the steps
- Step 4.b would save all patents into the MongoDB and not just the ones appearing in the Orange Book.
Both of these may be viewed as future optimizations.
- Build the docker image using
docker-compose build
- Start the containers using
docker-compose up
The Mongo database is exposed at localhost:27017
. The mongo shell can be accessed, to view / manipulate the data, by exec-ing into the container. Local folder mongo/data
is attached to the DB container using bind mount, to map to the DB's data folder. This retains the data even when the container is removed.
Access localhost:8081
for the Mongo Express viewer, which provides a limited UI to explore the data in the Mongo DB.
Run docker-compose down
to stop and remove all containers.
NOTE: The Mongo data should be retained at ./mongo/
.
Unit tests are created using Pytest and can be run simply using the following command, from the repo root.
$ PYTHONPATH=src/ pytest tests/ -vv