Skip to main content

Process and outcome for international reliability in sleep scoring



The aim was to evaluate the inter-rater reliability in scoring sleep stages in two sleep labs in Berlin Germany and Beijing China.


The subjects consist of polysomnography (PSGs) from 15 subjects in a German sleep laboratory, with 7 mild to moderate sleep apnea hypopnea syndrome (SAHS) patients and 8 healthy controls, and PSGs from 15 narcolepsy patients in a Chinese sleep laboratory. Five experienced technologists including two Chinese and three Germans without common training scored the PSGs following the 2007 AASM manual except the EEG signals included only two EEG leads (C3/A2 and C4/A1). Differences in inter-scorer agreement were analyzed based on epoch-by-epoch comparison by means of Cohen’s κ, and quantitative sleep parameters by means of intra-class correlation coefficients.


Inter-laboratory epoch-by-epoch agreement comparison between scorers from the two countries yielded a moderate agreement with a mean κ value of 0.57 for controls, 0.58 for SAHS, and 0.54 for narcolepsy. When compared with controls, the inter-scoring agreement is higher for wake and N3 stage scoring in SAHS and N1 and N3 scoring in narcolepsy (p < 0.05). The only sleep stage with lower scoring agreement in both SAHS (κ 0.69 vs. 0.79, p = 0.034) and narcolepsy (0.66 vs 0.79, p = 0.022) was stage REM. Inter-laboratory comparisons showed that the most common combinations of deviating scorings were N1 and N2, N2 and N3, and N1 and wake. A 6.5 % deviating scoring rate of wake and REM and a 13.4 % deviating scoring rate of N1 and REM indicated that inter-laboratory scoring in narcolepsy was about twice as in SAHS and controls confused. This was further confirmed by agreement analysis of quantitative parameters using intra-class correlation coefficients ICC(2,1) indicating REM sleep scoring agreement was lower in narcolepsy than in controls (p < 0.05).


Low REM stage scoring agreement exists for narcoleptics and SAHS, indicating the necessity to study sleep stage scoring agreement for a specific sleep disorder. Intensive training is needed for the scoring of sleep in international multiple center studies to improve the scoring agreement.

This is a preview of subscription content, access via your institution.

Fig. 1


  1. Rechtschaffen A, Kales A (1968) A manual of standardized terminology, techniques and scoring system for sleep stages of human subjects. US Department of health, Education and Welfare Public Health Service—NIH/NIND, Washington, DC

    Google Scholar 

  2. Iber C, Ancoli-Israel S, Chesson A, Quan S, for the American Academy of Sleep Medicine (2007) The AASM manual for the scoring of sleep and associated events: rules, terminology and technical specifications. American Academy of Sleep Medicine, Westchester

    Google Scholar 

  3. Berry RB, Brooks R, Gamaldo CE, Harding SM, Marcus CL, Vaughn BV, for the American Academy of Sleep Medicine (2012) The AASM manual for the scoring of sleep and associated events: rules, terminology and technical specifications, version 2.0. American Academy of Sleep Medicine, Darien,

    Google Scholar 

  4. Penzel T, Zhang X, Fietze I (2013) Inter-scorer reliability between sleep centers can teach us what to improve in the scoring rules. J Clin Sleep Med 9:89–91

    PubMed Central  PubMed  Google Scholar 

  5. Rosenberg RS, Van Hout S (2013) The American Academy of Sleep Medicine interscorer reliability program: sleep stage scoring. J Clin Sleep Med 9:81–87

    PubMed Central  PubMed  Google Scholar 

  6. Danker-Hopfe H, Anderer P, Zeitlhofer J et al (2009) Interrater reliability for sleep scoring according to the Rechtschaffen & Kales and the new AASM standard. J Sleep Res 18:74–84

    Article  PubMed  Google Scholar 

  7. Ruehland WR, O’Donoghue FJ, Pierce RJ et al (2011) The 2007 AASM recommendations for EEG electrode placement in polysomnography: impact on sleep and cortical arousal scoring. Sleep 34:73–81

    PubMed Central  PubMed  Google Scholar 

  8. Norman RG, Pal I, Stewart C et al (2000) Interobserver agreement among sleep scorers from different centers in a large dataset. Sleep 23:901–908

    CAS  PubMed  Google Scholar 

  9. Whitney CW, Gottlieb DJ, Redline S et al (1998) Reliability of scoring respiratory disturbance indices and sleep staging. Sleep 21:749–757

    CAS  PubMed  Google Scholar 

  10. Danker-Hopfe H, Kunz D, Gruber G et al (2004) Interrater reliability between scorers from eight European sleep laboratories in subjects with different sleep disorders. J Sleep Res 13:63–69

    Article  PubMed  Google Scholar 

  11. Magalang UJ, Chen NH, Cistulli PA et al (2013) Agreement in the scoring of respiratory events and sleep among international sleep centers. Sleep 36:591–596

    PubMed Central  PubMed  Google Scholar 

  12. American Academy of Sleep Medicine (2005) International classification of sleep disorders: diagnostic and coding manual, 2nd edn. American Academy of Sleep Medicine, Westchester

    Google Scholar 

  13. Roth T, Dauvilliers Y, Mignot E et al (2013) Disrupted nighttime sleep in narcolepsy. J Clin Sleep Med 9:955–965

    PubMed Central  PubMed  Google Scholar 

  14. Chen L, Ho CK, Lam VK et al (2008) Interrater and intrarater reliability in multiple sleep latency test. J Clin Neurophysiol 25:218–221

    Article  PubMed  Google Scholar 

  15. Landis JR, Koch GG (1977) The measurement of observer agreement for categorical data. Biometrics 33:159–174

    Article  CAS  PubMed  Google Scholar 

  16. Munro BH (2005) Statistical methods for health care research, 5th edn. Lippincott Williams Wilkins, Philadelphia, pp 248–249

    Google Scholar 

Download references


This work was supported by research grants of the International Science and Technology Cooperation Program of China (2014DFA31500), Beijing Municipal Science and Technology Commission (Z131107000413113), and the Sino-German Center for Research Promotion (GZ538), which also supported a research visit of XZ to Germany. JK acknowledges funding from the German Research Society (DFG, grant KA 1676/4).

Conflict of interest

The authors have indicated no financial conflicts of interest.

Author information

Authors and Affiliations


Corresponding authors

Correspondence to Thomas Penzel or Fang Han.

Additional information

Xiaozhe Zhang and Xiaosong Dong equally contributed to the paper.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Zhang, X., Dong, X., Kantelhardt, J.W. et al. Process and outcome for international reliability in sleep scoring. Sleep Breath 19, 191–195 (2015).

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI:


  • Sleep stage
  • Scoring
  • Narcolepsy