Annals of Biomedical Engineering

, Volume 42, Issue 4, pp 871–884 | Cite as

Crowd-Sourced Annotation of ECG Signals Using Contextual Information

  • Tingting Zhu
  • Alistair E. W. Johnson
  • Joachim Behar
  • Gari D. Clifford


For medical applications, the ground truth is ascertained through manual labels by clinical experts. However, significant inter-observer variability and various human biases limit accuracy. A probabilistic framework addresses these issues by comparing aggregated human and automated labels to provide a reliable ground truth, with no prior knowledge of the individual performance. As an alternative to median or mean voting strategies, novel contextual features (signal quality and physiology) were introduced to allow the Probabilistic Label Aggregator (PLA) to weight an algorithm or human based on its performance. As a proof of concept, the PLA was applied to QT interval (pro-arrhythmic indicator) estimation from the electrocardiogram using labels from 20 humans and 48 algorithms crowd-sourced from the 2006 PhysioNet/Computing in Cardiology Challenge database. For automatic annotations, the root mean square error of the PLA was 13.97 ± 0.46 ms, significantly outperforming the best Challenge entry (16.36 ms) as well as mean and median voting strategies (17.67 ± 0.56 ms and 14.44 ± 0.52 ms respectively with p < 0.05). When selecting three annotators, the PLA improved the annotation accuracy over median aggregation by 10.7% for human annotators and 14.4% for automated algorithms. The PLA could therefore provide an improved “gold standard” for medical annotation tasks even when ground truth is not available.


Probabilistic analysis Crowd-sourcing Unsupervised learning ECG QT estimation Signal quality 



TZ and AJ acknowledge the support of the RCUK Digital Economy Programme grant number EP/G036861/1 (Oxford Centre for Doctoral Training in Healthcare Innovation). TZ also acknowledges the support of China Mobile Research Institute. JB is supported by the UK EPSRC, the Balliol French Anderson Scholarship Fund, and MindChild Medical Inc. (North Andover, MA).


  1. 1.
    Bousseljot, R., D. Kreiseler, and A. Schnabel. Nutzung der EKG-Signaldatenbank CARDIODAT der PTB uber das Internet. Biomed. Tech. 40(1):317–318, 1995.Google Scholar
  2. 2.
    Cholleti, S. R., S. A. Goldman, A. Blum, D. G. Politte, and S. Don. Veritas: combining expert opinions without labeled data. In: Proceedings of 20th IEEE International Conference on Tools with Artificial Intelligence, Vol. 1, 2008, pp. 45–52.Google Scholar
  3. 3.
    Christov, I., I. Dotsinsky, I. Simova, R. Prokopova, E. Trendafilova, and S. Naydenov. Dataset of manually measured QT intervals in the electrocardiogram. Biomed. Eng. Online 5:31, 2006.Google Scholar
  4. 4.
    Clifford, G. D., F. Azuaje, and P. E. McSharry. Advanced Methods and Tools for ECG Analysis. Engineering in Medicine and Biology. Norwood, MA: Artech House, 2006.Google Scholar
  5. 5.
    Clifford, G. D., J. Behar, Q. Li, and I. Rezek. Signal quality indices and data fusion for determining clinical acceptability of electrocardiograms. Physiol. Meas. 33(9):1419–1433, 2012.PubMedCrossRefGoogle Scholar
  6. 6.
    Clifford, G. D., and M. C. Villarroel. Model-based determination of QT intervals. Comput. Cardiol. 33:357–360, 2006.Google Scholar
  7. 7.
    Dawid, A. P., and A. M. Skene. Maximum likelihood estimation of observer error-rates using the EM algorithm. J. R. Stat. Soc. Ser. C Appl. Stat. 28(1):20–28, 1979.Google Scholar
  8. 8.
    Dempster, A. P., N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. B 39(1):1–38, 1977.Google Scholar
  9. 9.
    Ehlert, F. A., J. J. Goldberger, J. E. Rosenthal, and A. H. Kadish. Relation between QT and RR intervals during exercise testing in atrial fibrillation. Am. J. Cardiol. 70(3):332–338, 1992.PubMedCrossRefGoogle Scholar
  10. 10.
    Franz, M. R., and M. Zabel. Electrophysiological basis of QT dispersion measurements. Prog. Cardiovasc. Dis. 42(5):311–324, 2000.PubMedCrossRefGoogle Scholar
  11. 11.
    Friesen, G. M., T. C. Jannett, M. A. Jadallah, S. L. Yates, S. R. Quint, and H. T. Nagle. A comparison of the noise sensitivity of nine QRS detection algorithms. IEEE Trans. Biomed. Eng. 37(1):85–98, 1990.Google Scholar
  12. 12.
    Hamilton, P. S., and W. J. Quantitative investigation of QRS detection rules using the MIT/BIH arrhythmia database. IEEE Trans. Biomed. Eng. 33(12):1157–1165, 1986.Google Scholar
  13. 13.
    International Conference on Harmonization of Technical Requirements for Registration of Pharmaceuticals for Human Use: Guidance for Industry E14: Clinical Evaluation of QT/ QTc Interval Prolongation and Proarrhythmic Potential for Non-Antiarrhythmic Drugs.
  14. 14.
    Jarque, C. M., and A. K. Bera. A test for normality of observations and regression residuals. Int. Stat. Rev. 55(2):163–172, 1987.CrossRefGoogle Scholar
  15. 15.
    Jin, R., and Z. Ghahramani. Learning with multiple labels. In: Advances in Neural Information Processing Systems, Vol. 15, edited by S. Becker, S. Thrun, and K. Obermayer. Cambridge: MIT Press, 2003, pp. 897–904.Google Scholar
  16. 16.
    Malik, M. Errors and misconceptions in ECG measurement used for the detection of drug induced QT interval prolongation. J. Electrocardiol. 37(Supplement):25–33, 2004.Google Scholar
  17. 17.
    Moody, G. B., H. Koch, and U. Steinhoff. The Physio Net/Computers in Cardiology Challenge 2006: QT interval measurement. In: Computers in Cardiology, 2006, pp. 313–316.Google Scholar
  18. 18.
    Ofer Dekel, O. S. Good learners for evil teachers. In: Proceedings of 26th International Conference on Machine Learning, 2009.Google Scholar
  19. 19.
    Pueyo, E., P. Smetana, P. Laguna, and M. Malik. Estimation of the QT/RR hysteresis lag. J. Electrocardiol. 36: 187–190, 2003.PubMedCrossRefGoogle Scholar
  20. 20.
    Raykar, V. C., S. Yu, L. H. Zhao, G. H. Valadez, C. Florin, L. Bogoni, L. Moy, and D. Blei. Learning from crowds. J. Mach. Learn. Res. 11:1297–1322, 2010.Google Scholar
  21. 21.
    Salerno, S. M., P. C. Alguire, and H. S. Waxman. Competency in interpretation of 12-lead electrocardiograms: a summary and appraisal of published evidence. Ann. Intern. Med. 138(9):751–760, 2003.PubMedCrossRefGoogle Scholar
  22. 22.
    Viskin, S., U. Rosovski, A. J. Sands, E. Chen, P. M. Kistler, J. M. Kalman, L. Rodriguez Chavez, P. Iturralde Torres, F. E. S. Cruz F, O. A. Centurin, A. Fujiki, P. Maury, X. Chen, A. D. Krahn, F. Roithinger, L. Zhang, G. M. Vincent, and D. Zeltser. Inaccurate electrocardiographic interpretation of long QT: the majority of physicians cannot recognize a long QT when they see one. Heart Rhythm 2:569–574, 2005.Google Scholar
  23. 23.
    Warfield, S. K., K. H. Zou, and W. M. Wells. Simultaneous truth and performance level estimation (STAPLE): an algorithm for the validation of image segmentation. IEEE Trans. Med. Imaging 23(7):903–921, 2004.PubMedCentralPubMedCrossRefGoogle Scholar
  24. 24.
    Warfield, S. K., K. H. Zou, and W. M. Wells. Validation of image segmentation by estimating rater bias and variance. Philos. Trans. A Math. Phys. Eng. Sci. 366:2361–2375, 2008.PubMedCentralPubMedCrossRefGoogle Scholar
  25. 25.
    Willems, J., P. Arnaud, J. van Bemmel, P. Bourdillon, C. Brohet, S. Dalla Volta, J. Andersen, R. Degani, B. Denis, M. Demeester, et al. Assessment of the performance of electrocardiographic computer programs with the use of a reference data base. Circulation 71(3):523–534, 1985.PubMedCrossRefGoogle Scholar
  26. 26.
    Zong, W., G. Moody, and D. Jiang. A robust open-source algorithm to detect onset and duration of QRS complexes. Comput. Cardiol. 30:737–740, 2003.Google Scholar

Copyright information

© Biomedical Engineering Society 2013

Authors and Affiliations

  • Tingting Zhu
    • 1
  • Alistair E. W. Johnson
    • 1
  • Joachim Behar
    • 1
  • Gari D. Clifford
    • 1
  1. 1.Intelligent Patient Monitoring Group, Institute of Biomedical Engineering, Department of Engineering ScienceUniversity of OxfordOxfordUK

Personalised recommendations