Current status and prospects of automatic sleep stages scoring: Review

The scoring of sleep stages is one of the essential tasks in sleep analysis. Since a manual procedure requires considerable human and financial resources, and incorporates some subjectivity, an automated approach could result in several advantages. There have been many developments in this area, and in order to provide a comprehensive overview, it is essential to review relevant recent works and summarise the characteristics of the approaches, which is the main aim of this article. To achieve it, we examined articles published between 2018 and 2022 that dealt with the automated scoring of sleep stages. In the final selection for in-depth analysis, 125 articles were included after reviewing a total of 515 publications. The results revealed that automatic scoring demonstrates good quality (with Cohen's kappa up to over 0.80 and accuracy up to over 90%) in analysing EEG/EEG + EOG + EMG signals. At the same time, it should be noted that there has been no breakthrough in the quality of results using these signals in recent years. Systems involving other signals that could potentially be acquired more conveniently for the user (e.g. respiratory, cardiac or movement signals) remain more challenging in the implementation with a high level of reliability but have considerable innovation capability. In general, automatic sleep stage scoring has excellent potential to assist medical professionals while providing an objective assessment.


Introduction
Sleep plays a vital role in human life. Its influence on physical and psychological health and human well-being is enormous, which has already been demonstrated and underlined in numerous publications [1][2][3][4]. Accordingly, it is crucial to be able to analyse sleep to have the possibility to intervene in time in case of disorders, to eliminate them if possible, or at least to reduce their negative influence on health.
As early as the 1950s, William Dement observed that the signal of the electroencephalogram (EEG), the frequency of eye movements and the frequency of body movements were subject to regular cyclical fluctuations during the night [5].
Further research into this phenomenon led over time to the recognition of typical sleep patterns and ultimately to the creation of the concept of sleep stages.
In 1968, the first standardised categorisation of sleep into sleep stages was conducted using the method of Rechtschaffen and Kales (R&K) [6]. It divided sleep into 30-s intervals-"epochs"-and assigned one of the following sleep stages to each epoch: • Stage W-Wakefulness/Wakefulness • Stage 1 (S1) • Stage 2 (S2) • Stage 3 (S3) • Stage 4 (S4) • Stage REM-Rapid eye movement Furthermore, some epochs could be labelled "movement time" if movement prevents accurate identification of sleep stages.
Over time, as new evidence about sleep emerged, there was a need to establish a new guide to terminology, recording methodology and scoring rules for sleep. This was done 1 3 in 2007 with the release of a new guideline by the American Academy of Sleep Medicine (AASM) [7], which has been continuously updated since its release. The most recent version is 2.6, published in 2020 [8]. It proposed, among other things, a new classification of sleep stages: • Stage W-wakefulness • Stage N1/NREM1 (formerly S1) • Stage N2/NREM2 (formerly S2) • Stage N3/NREM3 (formerly S3 + S4) • Stage R-Rapid eye movement (REM) In fact, inter-standard comparison of sleep stage scoring has been the focus of several scientific publications, among them [9,10].
In sleep analysis practice, sleep stages N1-N3 are, in some cases, combined into one NREM stage. Another subdivision found in scientific papers is Wake/Light Sleep (N1 + N2)/ Deep Sleep (N3)/REM sleep (R). Table 1 summarises these sleep stage classification manners.
The most widely utilised method for assessing sleep behaviour, which has been in use for many years, is polysomnography (PSG) [11]. This approach involves measuring several signals, which are then evaluated: • Electroencephalography (EEG) records brain activity. • Electrocardiography (ECG) is a method that reflects the electrical activity of the heart over time. • Electromyography (EMG) is used to record muscle activity. • Electrooculography (EOG), on the other hand, uses electrodes to detect and measure the potential between the human eye's back and front to record eye movements. • The oxygen saturation of the blood is measured with pulse oximetry. • Often, both respiratory flow and respiratory effort are measured. • Moreover, other signals can be recorded, such as the person's position or a video recording for a detailed analysis.
The recorded signals are then stored and finally analysed manually by trained sleep experts, scoring the stages of sleep. Mainly EEG, EOG, and EMG signals provide the required input.
This manual evaluation of the recordings enables accurate scoring of the sleep stages. However, it also entails a high expenditure of time and financial resources to complete this task [12]. It should also be noted that despite following the established scoring guidelines, manual analysis introduces a certain level of subjectivity, also known as interscorer/ interrater reliability/variability, which has been described in several scientific publications [13][14][15]. Thus, the results of the evaluation of the same sleep recording by different experts may vary. As the meta-analysis of interrater reliability conducted in [16] has demonstrated, Cohen's kappa for manual, overall sleep scoring reaches the value of 0.76.
To summarise the above situation, automatic sleep scoring could result in several benefits. Among other things, it could reduce the financial and human resources needed, thereby supporting the work of sleep physicians and making them available to provide treatment to a larger number of individuals requiring it. A high number of scientific works in the field of automatic sleep stage detection indicate that new developments are constantly being made in this area, and only a thorough analysis can provide a comprehensive overview of the status quo.
The conducted analysis of current research has indicated that the subject of a state-of-the-art review in automatic sleep stage identification has been addressed several times, highlighting the importance of this topic. However, studies in recent years have focused on specific aspects of the problem. For example, in [17], only the algorithms that work with the EEG signal were selected. In [18], consumer sleep technologies (CSTs) were investigated to analyse sleep in combination with artificial intelligence. [19] examined a more extended period of 19 years but with a relatively small number of selected articles (55 in total) for this larger time frame. EEG signals in combination with deep learning algorithms applied directly to raw signals or spectrograms, were the subject of investigation in [20], resulting in a more refined but smaller sample of 14 studies. Deep learning techniques for sleep phase detection were investigated in [21], with 36 studies selected from the period between 2010 and 2020. The number of studies reviewed in [22] is also relatively small, below 30, which is in part caused by the research question in which EEG signal-based algorithms were investigated. Also, if one looks further into the past, it can be seen that the matter of automatic detection of sleep phases was addressed considerably earlier, and was, for example, already investigated in a review article in 2012 [23]. In general, the principles of automatic sleep scoring were even reported in 2000 [24], indicating the topic's importance and demonstrating a long history of research in this scientific area. Due to the major importance of the topic, knowing the enormous number of recent developments in the field of automatic sleep scoring and considering the gaps in existing review articles in recent years, the decision was made to conduct a detailed investigation of state-of-the-art and to report the results in a review article. The aim is to prepare a comprehensive overview of the state-of-the-art in the field of automatic sleep stage identification and provide researchers with a consolidated summary of the current developments, which should increase the efficiency of their scientific research.
Several approaches to literature reviews exist. One of the classifications proposes the following groups: systematic, semi-systematic (or narrative) and integrative, according to [25]. In our case, it seems appropriate to use the semi-systematic approach, as it facilitates more flexibility and is commonly used for overview publications. In addition to aiming to provide an overview of a topic, a semi-systematic review often examines how research in a particular area has developed over time [25]. This type of review can be a highly effective way to cover more areas and broader topics than a systematic review can address, especially when multiple subject domains (such as computer science and medicine) are in the field of interest [26].
Three main research questions are being addressed in our article: RQ1. Which level of quality is achieved at the current state of development of automatic sleep scoring approaches? RQ2. How has the focus on particular signal analysis methods evolved over time? RQ3. How has the choice of signals used for analysis varied over time? When a research group has published several consecutive articles that represent a further development of particular methods, only the most recent article has been included in this review. If the reported approaches represent significantly different procedures or use diverse input signals, all of them are included in the selection.

Eligibility criteria
A more detailed explanation of the inclusion criterion "Automatic approach for sleep scoring is described in a manuscript" appears to be reasonable. As one of the important aspects for the developed sleep stage identification system is its applicability in practice and usability, only those publications were included in the review that allow fully automatic scoring, i.e. the systems that require manual intervention in use, such as partial manual scoring, were not considered in this work.
In 2007, the AASM guidelines [7] replaced the Rechtschaffen & Kales rules [6] as the new standard for sleep assessment. Therefore, in the selection of articles, the publications were included in which the sleep phase detection of the reference measurement is done according to AASM guidelines, as only the manuscripts since 2018 were considered. However, in order to be able to evaluate the methods that were using the older databases already available, the decision was made also to consider the articles that were using the existing recordings according to R&K rules, but adapting them to the sleep stages defined in AASM.

Information collection strategy
After analysing the requirements for articles to be included in the review and considering the interdisciplinary nature of the topic, the following selection of databases was made to ensure the comprehensive collection of information: The following query (with necessary adaptions for specific databases) was determined to search for articles that match the defined criteria: (((automatic OR automated OR automatically) AND ("sleep stage" OR "sleep stages" OR "sleep phase" OR "sleep phases")) AND (scoring OR classification OR determination OR identification OR recognition OR detection)). The search using this defined query took place on the articles' titles, abstracts and keywords in databases. In addition, a restriction of years (2018-2022) was used, either directly in the query, if the database provided this option, or through the later application of an appropriate filter.

Data refining procedure
The next step after applying the designed query in the selected databases was to refine the collected data. This involved several steps. First, duplicates were removed, and all non-journal articles were excluded from the list of articles. This was followed by three phases of initial screening: title, abstract and diagonal reading. During this process, the eligibility of the manuscripts was checked according to the inclusion/exclusion criteria, and if they did not meet them, they were eliminated. At last, the final screening was applied, where the full reading of the papers was performed. In this step, the last articles that did not meet the established criteria were removed to generate the final list of manuscripts for further analysis.

Synthesis methods
A set of Python packages (Pandas, NumPy, Seaborn, Matplotlib and SciPy) were used for the statistical analysis carried out in this scientific work. Statistical data visualisations were presented to aid the understanding of the results. Where possible, statistics were proposed to explain relevant information gathered during the review of the articles.

Results
The conducted in-depth analysis of the current research outcomes over the last five years has led us to create a comprehensive overview, which is presented in the following. It is important to point out that not all descriptions of the implementation and evaluation approaches provide sufficient information to estimate the results' quality and correctness. Nevertheless, this review has included these articles according to inclusion/exclusion criteria.

Study selection
The article selection process carried out throughout this review of scientific papers dealing with the classification of sleep stages is shown in Fig. 1. The number of publications included and excluded during each review stage is also shown.

Which level of quality is achieved at the current state of development of automatic sleep scoring approaches?
A summary table was prepared to provide a comprehensive overview of the available approaches for automatic sleep stage scoring, including all articles selected according to inclusion criteria and after refining the set according to exclusion criteria. The results can be seen in Table 2, attached to this manuscript. Due to its comprehensive content, inserting it directly into the text would significantly decrease the manuscript's readability, which should be avoided.
The best accuracies reported in the reviewed manuscripts achieve over 90% when analysing EEG signal, being an excellent result. As reported in several research papers, the best conducted Cohen's kappa values are over 0.80, and F1 values achieve up to 85-90%.
The differences in performance of some algorithms can vary significantly depending on the composition of the test group/dataset. In particular, if the training was done on healthy subjects only, testing on a group with sleep disorders (e.g., sleep apnoea) would be associated with a high probability of a reduction in sleep stage detection accuracy, as indicated in, e.g. [27]. Therefore, the provided table includes information on a targeted population of the developed algorithms.
In the column "Algorithm/Method" in the table, only the primary used approach for sleep scoring is provided. Feature extraction/filtering procedures, being another relevant characteristic of the method, are not described due to the large variability and individuality of utilised approaches which complicates their classification and would excessively enlarge the table, decreasing its clarity.

How has the focus on particular signal analysis methods evolved over time?
Trends in research are changing over time, among other things, as new knowledge is gained and new priorities may be set. This raises the question of how the choice of methods for sleep stage estimation has evolved over the last few years. Figure 2 depicts the list of publications that met the inclusion/exclusion criteria and were selected for statistical analysis separately for the years 2018-2022, as well as

How has the choice of signals used for analysis varied over time?
The selection of signals for analysis with the subsequent scoring of sleep stages and their variation over time also provides interest for the research. Therefore, we have graphically presented the selection of signals by year in Fig. 3 to facilitate an overview and to illustrate possible trends.
With the aim of providing more insight into the selection of signals used to score sleep stages in the articles included in the review, the detailed representation per input signal was generated and can be seen in Fig. 4.
As shown in Fig. 4, the EEG signal is the most widely used as opposed to other signal sets, with 69 scientific publications using standalone EEG and being a part of numerous other combinations with other signals.

Discussion
The findings of the conducted review represent, to the best of our knowledge, the most comprehensive research overview in the field of automatic sleep stage scoring in recent years. This allows a thorough analysis of current developments in this domain and can serve as a basis for further research.
Most of the information for the analysis can be found in Table 2. For example, it can be seen that the majority of publications have considered the detection of five sleep stages, which corresponds to the AASM standard. Only occasionally did the methods target fewer sleep stages.
The most common method of validation, as indicated in Table 2, is cross-validation. More specifically, tenfold and 20-fold cross-validation are the most commonly used techniques in the reviewed articles. In addition, other x-fold cross-validation and leave-one-subject-out approaches can be found in publications. Direct strict separation of training and validation/test datasets is also reported in several articles. It is important to note that using different epochs of the same recording for both training and evaluation would affect the algorithm's performance in terms of increasing accuracy. However, using epochs from the same recording for both training and evaluation does not allow us to assess whether the algorithm used would perform similarly if the subjects for the training and test datasets were strictly separated. Therefore, we have tried to extract this relevant information from the reviewed articles. Unfortunately, far from all articles provided this important detail. For those articles where this information was provided, or where the method of validation (e.g. leave-one-subject-out) allowed us to obtain the required data, we can say that in the majority of cases there was a strict separation of subjects into training and test datasets, and therefore the epochs of the same recording were mostly not used for both training and evaluation.
We can highlight some important points by looking at the quality parameters reported in the research papers analysed. By far, the most widely used signal for sleep stage scoring is the EEG, and the combination of EEG, EOG and EMG is the second most frequently used, as can be recognised from Table 2 and Fig. 3. Together, these two sets account for more than two-thirds of all publications. They also yield the best results in scoring-Cohen's kappa up to over 0.80 (substantial to almost perfect according to [28]) and accuracy up to over 90%, which is a very good performance, especially considering that even when evaluating the same recording by different scorers, a kappa of 0.71-0.81 is obtained, as studied in [16]. In general, the use of automated sleep phase detection methods could address the problem of interrater reliability when the same approach is used for analysis in different sleep laboratories (or even in one sleep laboratory instead of multiple experts). This could help to free up clinicians' time for other clinical tasks.
At the same time, it should be noted that there has been no breakthrough in the quality of the results when analysing these signals (EEG, EOG, EMG) over the last five years in terms of performance. This can be explained, among other things, by the high scores already achieved, as mentioned above. If one is looking for research topics with a high innovation potential, other alternatives should perhaps be considered. An example of this could be systems that work with other signals. These could be, for example, cardiac, respiratory and movement signals, as they have the potential to be recorded with more comfort for the user and possibly without contact [29,30]. The number of works using these alternative signals for sleep scoring is significantly lower, according to the research conducted, and there is noticeable room for improvement. Nevertheless, looking at Fig. 3, we can see that the number of publications using other than classical signal sets as input to the algorithm has increased over the years.
Another important issue in the development of sleep scoring systems is usability and suitability for practical use. Unfortunately, these aspects are not always explicitly considered in publications, although they are of great importance for the transfer of research into practical implementation. For example, it may be advantageous in terms of usability if the algorithms work fully automatically, i.e. do not require manual pre-classification or pre-processing. Another point that could improve the usability of the systems in a practical application is the determination of the measurement uncertainty during sleep stage detection. This would make it possible to see where the results of the automatic evaluation might need to be re-examined because the classification is not unambiguous, which has been recognised by the software, also known as confidence estimation or prediction certainities. Some work has already been done towards this functionality and usability improvement, e.g. [31][32][33][34]. In summary, further research aimed at improving the usability of systems in real-world environments can only be welcomed, as the potential is not yet exhausted, as this review has shown.
The explainability/interpretability of algorithms and their results is also a topic with great potential for further research, as it could at least partially help to solve the "black box" problem, especially in deep learning applications. This issue has already been addressed in some of the reviewed articles [32,[35][36][37][38], but there is still much room for further investigation to overcome the current challenges. In general, due to the fact that there is an impressive number of articles on sleep scoring, but at the same time a significantly smaller number of practical applications, the question of usability and practicality seems to be one of the relevant ones and should be further investigated.
Another area of research we observed in some of the manuscripts was the use of algorithms that adapt to the signal or recalibrate the features used [39], or incorporate the probabilities of transitions between sleep stages [40][41][42]. The inclusion of these additional steps in the algorithms to analyse the signal in an individualised way, while taking into account the whole sleep structure, may also lead to more accurate and higher quality sleep stage detection, and we encourage further research in this direction.
It is pretty common that more than one dataset is used for evaluation in the published articles, as can be seen in Table 2. In this case, however, there are often only two datasets used, which are additionally not always large. Though, the essential point to consider is that the algorithms developed should be evaluated on several, preferable large datasets to ensure a high-quality general scoring that is not only dataset specific, as there may be peculiarities in the signals (due to different equipment, pre-processing, etc.) in other series of recordings. This has been addressed in a number of the papers included in the selection but not in all cases. In order to provide a reliable solution that can be applied to a wide range of situations, it is crucial to consider the universality mentioned above and evaluate with multiple data sets from diverse sleep laboratories and recorded with different devices.
An analysis of Fig. 2 reveals that there is a tendency for the number of developments in the area of sleep stage classification to increase steadily, although there are minor fluctuations in this trend from year to year. Proportionally, deep learning methods stand out from other machine learning techniques in this topic area from year to year, taking an increasingly larger share. This may be partly caused by a general increase in the popularity of deep learning methods in research and partly by progress in developing new, more effective techniques that may produce higher-quality results.
While Fig. 3 gives a rather general picture regarding the signals used in literature, Fig. 4 provides more detail. Here it can be seen that, in addition to the two most frequently used signal sets already mentioned (EEG and EEG + EOG + EMG), a third combination occurs significantly more frequently than other signals-EEG + EOG, although significantly less frequently than the first two. In general, it can be said that a combination with the EEG signal occurs more often than possible combinations with other signals in the research. This can be explained by the fact that the EEG traditionally plays a key role in the detection of sleep stages and also provides crucial information. Apart from these three signals, only the ECG is used significantly more often than the remaining signals. Besides that, other signals were only sporadically included in the approaches reported in the reviewed articles. Another interesting finding from the information presented in Table 2 and Fig. 4 is that the majority of the required signals are recorded with electrodes (EEG, EOG, EMG, ECG), which requires direct contact with the body and is not necessarily the most comfortable for users. This shows that methods that allow nonobtrusive measurement of physiological signals are not yet widely used in sleep stage assessment.
Another important point is that while automated analysis can free up physicians' time for other clinical tasks, even with fully automated sleep evaluation there is still a need for the clinician to review the signals to avoid missing potentially important points. It is known that some specific pathologies can only be detected by looking for specific patterns in the signals that may currently be missed by automated scoring systems.

Conclusions
Analysing the review findings, in the last years, no breakthrough progress in the quality of sleep scoring approaches analysing EEG, EOG and EMG signals could be observed in terms of perfomance, although the most advanced of the current studies demonstrate a good level of quality in the detection of sleep stages. Nonetheless, a number of topics still remain with some gaps and have excellent innovation potential. These include, among others, the application of algorithms to categorise sleep stages that utilise less traditional signals, such as, for instance, breathing, body movement or heart signals. These signals could possibly be recorded in a contactless and more accessible way than the classical PSG approach and with more comfort for users/ patients. Therefore, if an acceptable level of measurement quality could be achieved, it could simplify the measurement process and promote the widespread use of sleep scoring systems. This, in turn, would favour the detection of sleep disorders at an early stage.
Another critical issue that still needs to be further addressed is the matter of interoperability of the developed algorithms with different datasets in order to have the capability to apply the designed sleep stage identification systems universally and not only to one specific dataset. This point was addressed in several of reviewed articles (e.g. [43,44]), where a significant number of datasets was used for the evaluation. Nevertheless, this topic requires further investigation and assessment.
The issues of explainability, usability, practicality and the use of adaptive algorithms are other areas of research that have been addressed in recent years, but are still not sufficiently explored and have significant potential.
In general, automatic sleep scoring has the potential to create an objective approach devoid of some level of subjectivity and, consequently, variance in scoring, present by manual scorers, considered in the literature as interrater reliability. This point was addressed in [45]. The analysis of current research in the area of sleep scoring presented in our article leads us to the conclusion that automated scoring of sleep stages could become a powerful tool supporting physicians in their work and helping to reduce sleep scoring ambiguity by decreasing the level of subjectivity in the analysis process. It is also noteworthy that the use of automated sleep scoring systems could lead to a saving of resources (both human and financial) that could be allocated to provide a more comprehensive medical service to the general public by medical professionals or to treat a larger number of patients with increased time capacity.

Appendix 1
See Table 2. Funding Open Access funding enabled and organized by Projekt DEAL. This research was partially funded by Carl Zeiss Foundation: Project "Non-invasive system for measuring parameters relevant to sleep quality" (Project Number: P2019-03-003).

Competing interests
The authors have no relevant financial or nonfinancial interests to disclose.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.