Skip to content

Commit

Permalink
Browse files Browse the repository at this point in the history
  • Loading branch information
S-Driscoll committed Feb 11, 2020
2 parents 571f92f + 33ee8fb commit ee8103d
Showing 1 changed file with 28 additions and 10 deletions.
38 changes: 28 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,19 +1,26 @@
Kurtosis-based Projection Pursuit
======

Kurtosis-based projection pursuit analysis (PPA) was developed as an alternative exploratory data analysis algorithm. Instead of using variance and distance metrics to obtain, hopefully, informative projections of high-dimensional data (like PCA, HCA, and kNN), PPA searches for interesting projections by optimizing kurtosis. This repository contains MATLAB and Python code to perform PPA, published literature involving the on-going development of PPA, as well as some examples of how to apply apply PPA to uncover interesting projections in high-dimensional data. Below is a figure from our recent paper that I think demonstrates the value of searching for distributions with low kurtosis values.
Kurtosis-based projection pursuit analysis (PPA) is an exploratory data analysis algortihm originally developed by Siyuan Hou and Peter Wentzell in 2011 and remains an active research area for the [Wentzell Research Group](http://groupwentzell.chemistry.dal.ca/) at Dalhousie University. Instead of using variance and distance metrics to find explore high-demnsional data (PCA, HCA etc.), PPA searches for interesting projections by optimizing kurtosis. This repository contains MATLAB and Python code to perform PPA, published literature involving the on-going development of PPA, as well as some examples of how to apply apply PPA to uncover interesting projections in high-dimensional data. Below is a figure from our recent paper that I think demonstrates the value of searching for distributions with low kurtosis values.

<h1 align="center">
<img src="https://S-Driscoll.github.io/img/dist.png" alt="kurtosis" width="500"/>
<img src="https://S-Driscoll.github.io/img/dist.png" alt="kurtosis" width="400"/>
</h1>


MATLAB and Python Functions
----------

* `projpursuit.m` is a MATLAB function to perform kurtosis-based projection pursuit.

* `projpursuit.py` is a Python function that is more or less a line by line port of the MATLAB function. `kurtosis.py` is a python implementation of the MATLAB function `kurtosis.m`. A list of dependencies to run `projpursuit.py` are found in `dependencies.txt`.
* `projpursuit.m` is a MATLAB function to perform kurtosis-based projection pursuit. It has the following MATLAB function call format:
```matlab
[T, V, ppout] = projpursuit(X,varargin)
```
where X is the m (samples) x n (response variables) data matrix, T is the m x p scores for the samples in each of the p dimensions (default p= 2), V is the corresponding n x p projection vectors and ppout (1 x p) contains the final kurtosis value for each dimension.
* `projpursuit.py` is a Python function that is more or less a line by line port of the MATLAB function. `kurtosis.py` is a python implementation of the MATLAB function `kurtosis.m`. A list of dependencies to run `projpursuit.py` are found in `dependencies.txt`. The Python PPA function has the following call format:
```python
projpursuit(X, **kwargs)
```
that returns T, V, and ppout.

Literature
----------
Expand All @@ -28,9 +35,20 @@ Literature
Examples
----------

### Wood Identification using Near-infrared Spectroscopy
To be completed.
### Unsupervised Facial Recognition
### Wood Identification using Near-infrared (NIR) Spectroscopy and univariate PPA
PPA was originally developed for searching high-dimensional chemical data for informative projections. As such, this example employs a data set designed for the identification of different Brazilian wood species using NIR spectroscopy. The original paper and the data for this example can be found here: [Implications of measurement error structure on the visualization of multivariate chemical data: hazards and alternatives (2018)](https://www.nrcresearchpress.com/doi/abs/10.1139/cjc-2017-0730#.XkHstSMpCCo).

The NIR wood data set contains 4 replicate scans of the follow wood samples: 26 of crabwood, 28 of cedar, 29 of curupixa, and 25 of mahogany. This results 432 samples across 100 NIR channels. Let's apply PCA and PPA and plot the corresponding scores:

```matlab
Xm = X - mean(X);
% PCA of mean centered data via singular value decomposition
[U, S, V] = svds(Xm, 2);
T_PCA = U*S;
```


### Unsupervised Facial Recognition using Univariate PPA
Of course, the data being explored does not have to be chemical in nature... the PPA framework can be applied to any multivariate data set. In this example, we will apply it to a subset of [The AT&T face data set](https://git-disl.github.io/GTDLBench/datasets/att_face_dataset/). This subset consists of 4 classes (people) each with 10 different grayscale images of their face (112 x 92 pixels). All images were vectorized along the row direction (112 x 92 --> 1 x 10304) producing a 40 x 10304 data set X which was then column mean-centered. Let's apply PCA and PPA and plot the first two scores vectors:

```matlab
Expand Down Expand Up @@ -58,7 +76,6 @@ end
set(gca,'linewidth',2,'FontSize',14)
xlabel('PCA Score 1')
ylabel('PCA Score 2')
zlabel('PCA Score 3')
% Plot the PPA scores
figure
Expand All @@ -69,8 +86,9 @@ end
set(gca,'linewidth',2,'FontSize',14)
xlabel('PP Score 1')
ylabel('PP Score 2')
zlabel('PP Score 3')
```
![PCA vs PPA](https://github.com/S-Driscoll/Projection-pursuit/blob/master/common/images/PCA_PPA.PNG)

While PCA reveals 3 clusters corresponding to 2 distinct classes and 2 overlapping classes, PPA is able to reveal 4 distinct clusters corresponding to the 4 different classes.

PPA can also be used to optimize the multivariate kurtosis and the recentered kurtosis. For more information on these options the reader is encouraged to explore the literature linked previously in this repository,

0 comments on commit ee8103d

Please sign in to comment.