- Python 3
- Github
- Jupyter Notebook Interface (optional)
- Tableau (optional)
- datetime
- ffmpeg
- keras
- matplotlib
- more_itertools
- numpy
- pandas
- plotly
- random
- scipy
- seaborn
- sklearn
- tensorflow
- torch
- window_slider
- Set up Git on your system (Mac: https://www.macworld.co.uk/how-to/mac-software/how-use-git-github-on-your-mac-3639136/; Windows download Git Bash: https://gitforwindows.org/)
- In Bash (Windows) or Command Line (Mac), cd to directory you want to clone this repo into
- Fork or clone GitHub see here: https://www.toolsqa.com/git/difference-between-git-clone-and-git-fork/
Cd to a preferred location on your local machine & clone this repository with the following entry to your terminal:
git clone https://github.com/Big-Ideas-Lab/DBDP/tree/master/DigitalBiomarkers-HumanActivityRecognition.git
- Removing columns that have data that does not originate from the Empatica E4 sensors
- Compiling all individual .csv files of subjects by sensor type
- Arranging data by Subject_ID
You do not need to run anything in this section, but feel free to check out the functions we used to do some cleaning of the source data in this section before we move to analysis code. This folder also contains the source data, with 280 .csv files (56 participants X 5 sensors)
This notebook will remove columns that have data that does not originate from Empatica E4 sensors and outputs start & end times for each subject's respective activity periods.
This notebook takes properly formatted .csv files from the E4FileFormatter (DBDP preprocessing not shown) and compiles all .csv files of subjects by sensor type.
Let's walk through the code pipeline step-by-step.
- Resample the combined sensor data at 4 Hz
- Clean outcomes dataset to select periods of time where activity occured in the combined sensor dataset.
- Add time segments to a new dataframe and output for each participant, and all participants
This notebook is composed of data resampling from the combined sensor data and cleaning the outcomes dataset, taking in the combined sensor and outcomes time files and returning Datasets for Individuals and Outcomes Dataset w/ End-Times.
- Explore Class Distribution and Comparing Classes by Sensor Distribution
- Analyze Outliers by Time and by Acticity
- Cluster on Sensor Summary Statistics
- Plot 3D Anmiated Sensor Plots
This notebook contains the exploratory data analysis for the Human Activity Recognition team's data. It uses the aggregated w/ activity csv. Check out our analysis of the data, including target class distribution, outlier detection, sensor distribution, activity comparison, 3D animation, etc.
- Remove rows of data where the Apple Watch sensors were used instead of the Empatica E4 sensor, to allow for a consistent sampling rate among the data without the need for extra interpolation.
- Remove Subject ID 19-028 from data because activity key stated that their rounds were flipped, but when flipped this resulted in an odd amount of rows of data
- Create a rolling analysis of our time series data, which is used to capture our sensor instability over time
- Compute parameter estimates over a rolling window of a fixed size through the sample
- Add 'super rows' to your code which will also include all sensor readings and activity, Subject ID, and round labels for the next time point, attempt to bypass data augmentation.
- Engineer features of our rolled data that will be used in the Random Forest models & Deep Learning Models.
This notebook removes rows of data where the Apple Watch sensors were used instead of the Empatica E4 sensor, to allow for a consistent sampling rate among the data without the need for extra interpolation. Using the aggregated file, it outputs:
- .csv without Apple Watch Data: label rounds 1 & 2 separated
- .csv without Apple Watch Data: label rounds 1 & 2 combined
- 31_roll_timepoints.ipynb contains our procedure for creating a rolling analysis of our time series data, which is used to capture our sensor instability over time. A common technique to assess the constancy of a model’s parameters is to compute parameter estimates over a rolling window of a fixed size through the sample. As the sensor parameters due to some variability in time sampling, the rolling estimates should capture this instability.
- 32_super.ipynb provides code to add 'super rows' to your code. Each row will also include all sensor readings and activity, Subject ID, and round labels for the next time point. The goal is to include other time points as features for each row, without relying on other data augmentation methods that reduce the total amount of data, such as windowing, rolling, or aggregating.
- 33_feature_engineering.ipynb outlines the process for engineering features of our data that will be used in the Random Forest models. The five types of features we created are window mean, standard deviation, skew, minimum, and maximum.
We have been working entirely with the Duke_Data so far. It comes from this study: https://www.nature.com/articles/s41746-020-0226-6. We also used the PAMAP2 Dataset to further test our models.
- ANN on raw sensor values, LOOCV validation
- ANN with engineered window features (e.g., summary stats on 10s windows no overlap), LOOCV validation
- ANN on engineered window features with multinomial voting classification mechanism, LOOCV validation
Provides just the data preparation needed to input data into any neural network, split by data source
Contains ANN models for both the STEP and PAMAP2 data.
Contains a multitude of deep learning models that have a specific notebook for each data source. Also has model files for each model run, including models with only mechanical, phyiological, and all sensors. These models use the STEP dataset which is also refered to as the Duke dataset.
This artificial neural network has 2 hidden fully connected layers with a dropout layer between them to prevent overfitting. The finaly fully connected layer classifies each timepoint fed into the model into 4 classes. This model uses leave-one-person-out validation. With this model we are able to compare the difference between including only mechanical sensors, only physiological sensors, or both.
This model uses engineered features from 20 second windows, with 10 second overlap. The engineered features are min, max, mean, standard deviation of our sensor values during these windows of time. The model consists of 6 hidden fully connected layers, a dropout layer and uses leave-one-person-out validation. With this model we are able to compare the difference between including only mechanical sensors, only physiological sensors, or both.
- Run Functions for Random Forest feature importances on feature engineered data
- Develop Random Forest on Individual Data
- Develop Random Forest on Feature Engineered Data
- Compare results w/ ACC-Only, PHYS-Only, ACC+PHYS
This notebook contains functions for the Random Forest w/ LOOCV. It also contains code for evaluating feature importances from the random forest. These functions are modular and can be adapted to your own classification needs. The model is built on the rolling average data with engineered features and helped us calculate feature importances, compare ACC+PHYS vs ACC-Only models, and incorporate LOOCV.
This notebook is composed of a random forest classification model to evaluate a general accuracy level of traditional ML methods in classifying our HAR data based on activity. The model is built on the individual raw signal data.
This notebook is composed of a random forest classification model to evaluate a general accuracy level of traditional ML methods in classifying our HAR data based on activity. The model is built on sliding window average data with engineered features.
These are the overall best results for our models using balanced classes. Higher accuracy and F1 score was actually achieved for ANN_WFE when using our imbalanced dataset. This points to the idea that our models metrics could possibly be improved with more data as currently only 505 rows of data are fed into ANN_WFE for each class during training.
Model Name | Data Input Type | Data Source | Accuracy | F1 Score |
---|---|---|---|---|
ANN_WFE | Feature Engineered Windows | STEP | 0.81 | 0.80 |
Random Forest | Feature Engineered Windows | STEP | 0.84 | 0.81 |
ANN | Individual Timepoints | STEP | 0.64 | 0.61 |
ANN | Individual Timepoints | PAMAP2 | 0.96 | 0.96 |
Accuracy Comparisons
The figure above displays the overall F1 scores for our Random Forest, ANN, and ANN with feature engineered windows. The highest F1 score achieved is the ANN_WFE model only using accelerometry data.
Confusion Matrix Results: ANN_WFE
Accuracy Results: ANN_WFE
Accuracy | Accuracy SD | F1 | F1 SD | |
---|---|---|---|---|
All Sensors | 0.81 | 0.12 | 0.79 | 0.14 |
Accelerometry Only | 0.84 | 0.11 | 0.83 | 0.12 |
Physiological Only | 0.60 | 0.16 | 0.57 | 0.18 |
Based on our ANN_WFE model, it appears that the addition of physiological data does not improve the model. The results shown in the table above are based on our leave-one-person-out validation. Interestingly, It appears that only physiological sensors are unable to classify the class, "activity" well. Classification for the class, "Type", is strong regardless of the kind of sensors used.
Confusion Matrix Results: Random Forest