The following script loads some raw home-health visit notes from a local or remote location, cleans the data, and extracts some valuable hand-crafted features from the notes.
Agencies that use AlayaCare have a wealth of unstructured visit notes stored on their Alayacare Cloud instances. These notes are left by caregivers every time they visit a patient. Alayalabs, wants to put this unstructured data to work for their customers by using it to predict the probability of adverse events like the patient falling or landing in the emergency room. This model will obtain these notes from a database, parse the response, and extract clean features that can then be used to train models and inference pipelines.
You must have python and pip installed first. This project relies on several libraries:
BeautifulSoup for extracting text from HTML and smart-open with Boto3 to read S3 buckets. To pre-process the notes, nltk must be installed.
You can install both dependencies by using:
pip install bs4 boto3 smart-open nltk
You also need to download the nltk stop words. You can do this by running the following command:
python -m nltk.downloader stopwords
To run, use:
python clean_data.py --source <path or s3 url> --destination <path>
name | description | required |
---|---|---|
source | source file to read data from, either a local file or an s3 bucket url. Data must be in ndjson | true |
destination | destination file, either a local file or an s3 bucket url | true |
You can get help by using:
python clean_data.py -h
The hand-crafted features from the notes were displaying the note without punctuation, and removing the stopwords from the notes. Removing stopwords was a good idea since they often do not contribute much to the information in the sentences. This feature involved some extra steps, including tokenization, which would break the note up into an array of the individual words, while also removing punctuation. For each note, the array of words (without stopwords) was then displayed in the csv file, displaying a more efficient way to extract the important information from the notes.