- Ever wondered what is the value of this application? This page aims to illustrate the relative performance of - our sleep scoring compared to clinical hypnogram scoring (which is usually considered the state-of-the-art - technique). +
+ This page aims to illustrate the performance of our sleep scoring algorithm compared to professional manual + scoring, which is considered the gold standard in sleep stage scoring. If you want to learn more about how + we've defined our sleep scoring algorithm, please either refer to our presentation video in our home page, or + to our{' '} + + wiki + + .
-- Here is the plan: +
+ In summary, we will look at our classifier's performance against three different point of view, as described + here;
+ We will finally cover the limitations of our current sleep staging model and further work to be done in order + to improve our results. +
++ In order to train our model for the prediction of sleep stages based on EEG data, we used an open source dataset, + + Physionet SleepEDF expanded + + , which is composed of 153 nights of sleep coming from 82 subjects, of different ages and sex. The labelled + data has been produced by well-trained technicians according to the 1968 Rechtschaffen and Kales manual, + adapted to the used EEG montage. In order to compare and validate the prototyped models, we divided the data + into a training and testing set, and we made sure that both sets had different subjects. Indeed, we didn't + want our test metrics to be biaised, as we'd have already trained on data coming from a same subject. +
+ ++ In order to select the most accurate model, we looked at the{' '} + + Cohen's Kappa agreement score + + , as it is often used in the automatic sleep scoring litterature. We then chose the model with the highest + agreement score, which is a voting classifier, composed of a K Nearest Neighbour and a Random Forest + classifiers. Unfortunatly, the model couldn't be exported to be used in the server, as we included, inside the + voting classifier, a pipeline to reduce the dimension of the inputed features. We then decided to use our + second best model, which is a Random Forest. It is the one we will evaluate the performances, as it is the one + currently used to classify sleep stages in our server. +
++ Let's first look at the performances of the classifier over the selected test set composed of subjects from + the SleepEDF dataset. The resulted Cohen's Kappa agreement score is{' '} + + 0.741 + + . Please note that the classes are imbalanced, and thus impacting the metrics displayed. +
++ The test set, on which these metrics were calculated, is composed of randomly chosen subjects from different + ages groups (a 33 year old female, a 54 year old female, a 67 year old female and a 88 year old male), so that + the obtained score is the most representative of our ability to classify sleep, no matter the age. + {/* On another side, we could compare the results obtained to the ones found in the litterature. To do so, we had to find a + paper that uses the same dataset, the same metric and that splits their dataset in a similar fashion as + ours.TODO. */} +
++ Although we obtained good results, it didn't quite validated that our classifier could accurately score OpenBCI + data into sleep stages. After all, we only validated on data coming from the same acquisition hardware than + the data we've trained on, which is not the case when we analyze data submitted in the application. We then + had to make our own polysomnographic dataset based on the hardware we use, namely an OpenBCI board. +
++ As we had limited resources, we planned to create a small dataset of two manually scored night's of sleep, + based on biosginals acquired with OpenBCI Cyton boards. Due to a technical problem that occured while + recording one of them, we only had one night of sleep scored. The subject is one of our team member, William, + who turned exactly 23 years old on the night of the recording 🥳. Afterwards, Alexandra, the + electrophysiologist with who this part of the project was realized, manually scored the night of sleep based + on the signals from the EEG channels, namely Fpz-Cz and Pz-Oz, the EOG channel and the EMG channel. We finally + compared the scoring made by our classifier, which we recall has been trained on the SleepEDF dataset, and the + scoring made by Alexandra. + + We obtained a Cohen's Kappa agreement score of 0.8310! + {' '} + Let's see below how the scorings compare in a hypnogram. +
++ The results are quite similar! Below are the other metrics describing the differences in the classification of + William's night. +
+ So, we have been able to verify that, indeed, our automatic sleep stage classifier could accurately score EEG + data acquired from an OpenBCI Cyton. Of course, we only verified on one night of sleep, and on a single + subject. In the future, it would be interesting to test our classifier on subjects of different age and sex. + Also, we did not tested on OpenBCI Ganglion boards, and it would be really helpful to be able + to certify, in the same maner as we did for the Cyton, that the classification works also accurately on this + board. +
+
+ As it is stated in{' '}
+ The American Academy of Sleep Medicine Inter-scorer Reliability Program: Sleep Stage Scoring
, written
+ by R. Rosenberg, manual sleep stage agreement averaged 82.6% when scorers based their analysis on the AASM's
+ scoring manual [R. Rosenberg, 2013]. Based on that statistic, we were curious to see how the classification
+ made by Alexandra compares to the classification found in our dataset. As we've already mentionned, we were
+ expecting some differences, as the scoring is based on two different manuals (see our further developmnent
+ section below).
+
+ We then randomly selected a night of sleep within our dataset and asked Alexandra to score it. The selected + subject is a 74 year old woman. You can see below the differences between both classification. The Cohen's + Kappa agreement score is of{' '} + + 0.6315 + + . +
++ The main differences can be seen at the N3 sleep stage level, as no epochs were tagged as N3 by our + electrophysiologist. She'd explain to us, in an{' '} + + interview you can view here + {' '} + (in french only), that no epochs filled the N3 sleep stage conditions. It may be explained by the fact that + the scoring manual used is different. +
+ And what if we looked at the automatic sleep classification of the same subject? We then reused the same model + description, trained on all the dataset's recording except for the randomly selected recording, and looked at + the results. The Cohen's Kappa agreement score is of{' '} + + 0.6709 + + . +
++ We can see that some differences between the automatic classification and SleepEDF's are the same as + Alexandra's. For instance, near the end of the night, both Alexandra and the automatic scoring model classified + N2 instead of N1. On another note, we can see that the obtained Cohen's Kappa agreement score is less than the + one obtained for our test set above, which was 0.741. We can then reasonably assume that this night of sleep + may be hard to conclude on. +
+ If we take a step back and look at the main differences between the automatic and manual scoring that + Alexandra did, there's notably the manual used for the classification. Indeed, the dataset we've used to train + our model contains sleep stages classification based on the 1968 Rechtschaffen and Kales manual, whereas + Alexandra, of course, used the most recent manual scoring guide, which is the American Academy of Sleep + Medicine Manual for the Scoring of Sleep and Assoicated Events. In order to output AASM's sleep stages instead + of R&K's sleep stages, we've merged both Sleep Stage 3 and 4 together. Further work could be done to either + review the litterature to see if there's a better way to translate R&K's sleep stages into AASM's sleep + stages. Even better, it would be to train on labels scored based on the latest AASM's sleep stages. We have + considered more recent datasets, such as{' '} + + the Montreal Archive of Sleep Studies (MASS) + + , but it involved having a complete accreditation coming from an ethics board. +
++ Furthermore, as we've already mentionned, we would also like to test our automatic sleep stage scoring based + on OpenBCI Ganglion board's data, by comparing it to manual scoring. It could also be interesting to test on + subjects of different age groups and sex. +
++ Finally, in terms of the explored machine learning models, we've mostly looked at classic statistical models, + and have not exhaustively looked at deep learning algorithms. We did look at the differences between manual + feature extraction and representation learning, through a CNN (we've written{' '} + + an article + {' '} + about the results we've obtained, which currently is in french only). Since we were limited in both time and + in hardware, we only trained on a few subjects. Also, considering that the dependancy of sleep stages over + time is quite important, we could greatly improve our model by exploring recurrent neural networks (RNN) or + long short term memory (LSTM) networks. +