EOG-based auto-staging: less is more
This issue of Sleep and Breathing presents the validation results of a new automated wake/sleep staging method based on EOG activity, developed by Jussi Virkkala from the Finnish Institute of Occupational Health. Classically, the automated method is compared to visual analysis, on an epoch by epoch basis. It reaches a level of global concordance of 88 % with a Kappa of 0.57. In other words, on the 248,696 epochs of the validation dataset, 212,138 were scored correctly in wake/sleep, that is as the human expert did it, and on 36,558 epochs, the two scorings differ.
Automated analysis methods are continuously developing .
Performance wise, two trends in literature coexist: one aiming at evaluating inter-expert agreement (the percentage of epochs of a recording or a set of recordings for which two human scorers give exactly the same score), when not intra-expert agreement (the percentage of epochs of a recording or a set of recordings for which a human scorer give the same score, when scoring data twice within a given period of time) [2, 3, 4, 5, 6, 7, 8, 9]. The other one aiming at evaluating performances of automated methods [10, 11, 12], compared to visual analysis. A recent publication demonstrated that on a dataset of 70 recordings, an automated method did not differ more than visual analysis from a reference scoring . In other words, automated analysis can reach accuracy comparable to visual analysis. These levels of performances are new. Let us remember what automated analysis looked like only a few years ago. There was some vicious circle: automated analysis was disregarded, thus attracted only little attention and effort, and was therefore doomed to be unsatisfactory as it obviously takes talent and time to learn a machine to mimic the extremely complex operations that an experienced scorer does when scoring sleep. The vicious circle seems to turn virtuous as automated analysis becomes a topic of interest where high-profile research teams get involved.
Now that the accuracy of this method is established, let us consider how it works. Indeed, when visual analysis is standardized, automated methods are very diverse: many alternative approaches to PSG are explored.
As stated in the AASM manual, conventional PSG, which is necessary even for the not so simple discrimination between wake and sleep, requires a minimum of seven channels. Here, the proposed montage is respiratory polygraphy + 3 sensors (2 EOG + ref). The EOG-based method validated in this paper belongs to a set of methods which all tend to reduce the number of sensors on the patient: actimetry , peripheral arterial tone and pulse transit time , motion analysis , EOG , and EEG only [18, 19, 20, 21, 22, 23, 24]. One question immediately appears: could the discrepancies observed between visual staging and the validated methods be explained by this reduction in the number of signals? Probably not, as in this study, the automated-visual agreement is nearing the upper-bound of visual-visual (inter-scorer) agreement reached in the above cited literature. Discrepancies with visual analysis, when they reach this level, can be considered as an effect of the irreducible imprecision of scoring sleep—due to the content itself, difficult patterns, transitional epochs—as well as to the intellectual process—interpretation of rules, limits of human perception, and fatigue.
Good validation results of new scoring methods are not an endpoint, but an invitation to imagine new applications. And in that perspective, as far as sensors are concerned, less is more.
Indeed, these alternative methods reducing the number of sensors pave the way for new diagnostic approaches particularly relevant and interesting for the diagnosis of OSA. OSA is largely and increasingly prevalent, has consequences, and can be treated with real results, even if difficulties should not be underestimated. But when it comes to diagnosis, clinicians face a frustrating alternative between home sleep testing with a relatively simple, cheap and comfortable respiratory polygraphy, which also comes with questionable reliability, and a very reliable but expensive, complex, and considered invasive by patients full PSG. This alternative gives way to pitfalls and suboptimal diagnosis schemes, when a non-conclusive HST ends up either with an additional PSG—at high cost—or an untreated patient, when PSG cannot be performed, for technical, operational, or financial reasons.
At risk of displaying the obvious, why does PV lack reliability? Because it misses crucial information when it comes to diagnosing a sleep-disordered breathing—is the patient awake or asleep? This allows false negative results, when long periods void of respiratory events can indicate an absence of respiratory events—hence a sound sleep—as well as long WASO periods indicating disordered sleep. With methods able to identify wake/sleep periods, new enhanced PV protocols become possible, as a third way between HST and PSG. Identifying automatically wake periods could ease the tedious and imprecise operation of excluding wake portions of the traces based on body position, light, and sound.
Demonstrate their ability to cope with real-life data, not only data provided by studies where conditions are carefully controlled: patients, variable montage, movement artifacts, and environment artifacts as they can occur when PV is performed in ambulatory mode
Keep their rejection level low enough to remain a sensible option in everyday practice
If they fulfill these conditions, innovative methods allowing higher reliability with reduced number of sensors could help in addressing the challenges of contemporary health systems: provide care at a larger scale in order to address increasing prevalence of OSA and its consequences while keeping resources at a reasonable and sustainable level.
Seen as a succession of performance records, the topic of validating automated diagnosis methods is repetitive, arid, and technical. But it is not only technical. Technique is a bad master but a good servant. It opens perspectives. Behind a good agreement level reached with less sensors, it is possible to see more patients provided with better care, more performing and efficient sleep labs, and more satisfaction for health professionals thanks to higher quality tools.
Conflicts of interest
Both authors have ownership and directorship in Physip Company.
- 3.Danker-Hopfe H, Kunz D, Gruber G, Klösch G, Lorenzo JL, Himanen SL, Kemp B, Penzel T, Röschke J, Dorn H, Schlögl A, Trenker E, Dorffner G (2004) Interrater reliability between scorers from eight European sleep laboratories in subjects with different sleep disorders. J Sleep Res 13:63–69CrossRefPubMedGoogle Scholar
- 4.Moser D, Anderer P, Gruber G, Parapatics S, Loretz E, Boeck M, Kloesch G, Heller E, Schmidt A, Danker-Hopfe H, Saletu B, Zeitlhofer J, Dorffner G (2009) Sleep classification according to AASM and Rechtschaffen & Kales: effects on sleep scoring parameters. Sleep 32(2):139–149PubMedCentralPubMedGoogle Scholar
- 6.Penzel T, Zhang X, Fietze I (2013) Inter-scorer reliability between sleep centers can teach us what to improve in the scoring rules. J Clin Sleep Med: JCSM: Off Publ Am Acad Sleep Med 9(1):89–91Google Scholar
- 11.Anderer P, Gruber G, Parapatics S, Woertz M, Miazhynskaia T, Klosch G, Saletu B, Zeitlhofer J, Barbanoj MJ, Danker-Hopfe H, Himanen SL, Kemp B, Penzel T, Grozinger M, Kunz D, Rappelsberger P, Schlogl A, Dorffner G (2005) An E-health solution for automatic sleep classification according to Rechtschaffen and Kales: validation study of the Somnolyzer 24× 7 utilizing the Siesta database. Neuropsychobiology 51(3):115–133CrossRefPubMedGoogle Scholar