Skip to content

djb2188/ingester

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

38 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

HealthPro Data Ingester

This software parses data from a HealthPro Work Queue CSV and imports it into a database, and also performs ancillary tasks.

It was developed to facilitate All of Us Research Program efforts at Weill Cornell Medicine. (HealthPro is software for use by All of Us personnel.) All of Us is a national program designed to gather data from one million or more people living in the United States to accelerate research and improve health. More information here: https://allofus.nih.gov.

How It Works

The ingester runs as a server process that monitors an inbox folder for the appearance of new files; it goes to work when a new HealthPro Work Queue CSV appears (deposited by a researcher or other study personnel).

The name and contents of the CSV file should not be modified in any way prior to putting it into the inbox folder.

Once the file appears, these are the steps it takes (at a high level):

  • The ingester does some light validation on the file, sending a notification email if the file doesn't appear to be a valid Work Queue CSV (or if it's not a CSV at all).
  • If the file appears valid, it then parses the CSV file and imports it into a destination database table (truncating it first).
  • It saves the CSV in an archive folder and deletes it from the inbox folder.
  • It subsequently runs a SQL Server Agent job which reads data from a REDCap project, pivots and transforms the data, and inserts it into the destination database.
  • A metadata table is also updated noting the time that both the HealthPro and REDCap data were last refreshed. For the HealthPro data, it uses the date/time info from the CSV filename.
  • It sends an email to the team letting them know if the process succeeded or failed.

CSV Validation and Safeguards

There are a number of safeguards in place including:

  • Ensuring the columns in the CSV match that of the target database table (this likely means a new version of HealthPro has been deployed and the ingester needs to be updated).
  • Ensuring that the number of rows in the CSV are the same or greater than in the database (to prevent accidentally processing an old file).

Requirements

  • Targets Python 2.7 on Linux
  • pip
  • virtualenv
  • SQL Server
  • Package requirements are listed in requirements.txt (use installdeps.sh; see below for step-by-step details)

Configuration / First-Time Setup

Database

The ingester currently works with SQL Server.

Configuring Python-to-SQL-Server connectivity is outside the scope of this walk-through, but potentially helpful notes can be found here: https://github.com/seanpompea/pymssqlcheck

Create the database tables in SQL Server using the following files:

  • sql/create-healthpro-table.sql
  • sql/create-metadata-table.sql
  • sql/create-redcap-table.sql

Customize the database, schema, and table names to suit.

Application Folders

Create runtime and staging folders.

mkdir ingester-staging
mkdir ingester

In the ingester folder, create the following subfolders:

cd ingester
mkdir archive
mkdir enclave

The archive folder holds processed CSVs.

The enclave folder will hold the configuration file (see below).

Inbox Folder

The inbox folder (where a CSV can be deposited by a researcher) can be anything you want. You'll then put its location in the config file (see below).

Configuration File (enclave folder)

In the enclave folder, create a file named healthproimporter_config.json. It should contain JSON like so

{ "inbox_dir": "/path/to/inbox"
, "archive_dir": "/path/to/archive"
, "consortium_tag": "CONSORTIUM"
, "db_info" : { "host": "X"
              , "user": "X" 
              , "password": "X" }
, "healthpro_table_name": "[dm_aou].[dbo].[healthpro]"
, "metadata_table_name": "[dm_aou].[dbo].[metadata]"
, "redcap_table_name": "[dm_aou].[dbo].[redcap]"
, "redcap_job_name": "DM_AOU REDCap Refresh Decoupled"
, "agent_job_timeout": 900
, "from_email": "X"
, "to_email": "X"
}

Customize the configuration values to suit. The email addresses are for success, error, and notification emails to be sent to the team.

virtualenv

Create a virtualenv for the process to run in; from the ingester folder, run:

mkdir venv
virtualenv venv

Deploying code and dependencies

From the ingester-staging folder, run:

git clone https://github.com/wcmc-research-informatics/ingester .
cp main.py requirements.txt installdeps.sh ../ingester

Then run the install script from the ingester folder (activate the virtualenv first):

cd ../ingester
source venv/bin/activate
./installdeps.sh

Starting the process

Ensure you're inside the virtualenv, then start:

source venv/bin/activate
nohup python /home/ingester/main.py >> out.log 2>&1 &

You can also configure the process as a daemon; the exact method varies depending on the Linux distro you're using.

Updating your installation

To pull in the latest version, use the steps detailed in Deploying code anad dependencies above, but replace the git clone ... command with simply this (again, run this from the ingester-staging folder:

git pull

Logging

Logs are written to hpimporter.log. The logging process is self-cleaning over time; see the kickshaws library for implementation: https://github.com/wcmc-research-informatics/kickshaws.

Notes

Technical Details

Use Watchdog's PollingObserver (rather than the vanilla Observer) to observe files being delivered on a Samba share, NFS or similar.

Back Matter

Repository

https://github.com/wcmc-research-informatics/ingester

Watchdog library

We're using our forked version of the Watchdog library: https://github.com/wcmc-research-informatics/watchdog

Technical Notes and References

About

HealthPro Data Ingester

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published