Data Mining and Knowledge Discovery

, Volume 29, Issue 4, pp 950–975 | Cite as

Probabilistic change detection and visualization methods for the assessment of temporal stability in biomedical data quality

  • Carlos SáezEmail author
  • Pedro Pereira Rodrigues
  • João Gama
  • Montserrat Robles
  • Juan M. García-Gómez


Knowledge discovery on biomedical data can be based on on-line, data-stream analyses, or using retrospective, timestamped, off-line datasets. In both cases, changes in the processes that generate data or in their quality features through time may hinder either the knowledge discovery process or the generalization of past knowledge. These problems can be seen as a lack of data temporal stability. This work establishes the temporal stability as a data quality dimension and proposes new methods for its assessment based on a probabilistic framework. Concretely, methods are proposed for (1) monitoring changes, and (2) characterizing changes, trends and detecting temporal subgroups. First, a probabilistic change detection algorithm is proposed based on the Statistical Process Control of the posterior Beta distribution of the Jensen–Shannon distance, with a memoryless forgetting mechanism. This algorithm (PDF-SPC) classifies the degree of current change in three states: In-Control, Warning, and Out-of-Control. Second, a novel method is proposed to visualize and characterize the temporal changes of data based on the projection of a non-parametric information-geometric statistical manifold of time windows. This projection facilitates the exploration of temporal trends using the proposed IGT-plot and, by means of unsupervised learning methods, discovering conceptually-related temporal subgroups. Methods are evaluated using real and simulated data based on the National Hospital Discharge Survey (NHDS) dataset.


Data quality Change detection Information theory  Information geometry Visual analytics Biomedical data 



The work by C Sáez has been supported by an Erasmus Lifelong Learning Programme 2013 Grant. This work has been supported by own IBIME funds. The authors thank Dr. Gregor Stiglic, from the Univeristy of Maribor, Slovenia, for his support on the NHDS data.

Supplementary material

10618_2014_378_MOESM1_ESM.pdf (90 kb)
Supplementary material 1 (pdf 91 KB)


  1. Aggarwal C (2003) A framework for diagnosing changes in evolving data streams. In Proceedings of the International Conference on Management of Data ACM SIGMOD, pp 575–586Google Scholar
  2. Amari SI, Nagaoka H (2007) Methods of information geometry. American Mathematical Society, Providence, RIGoogle Scholar
  3. Arias E (2014) United states life tables, 2009. Natl Vital Statist Rep 62(7): 1–63Google Scholar
  4. Aspden P, Corrigan JM, Wolcott J, Erickson SM (2004) Patient safety: achieving a new standard for care. Committee on data standards for patient safety. The National Academies Press, Washington, DCGoogle Scholar
  5. Basseville M, Nikiforov IV (1993) Detection of abrupt changes: theory and application. Prentice-Hall Inc, Upper Saddle River, NJGoogle Scholar
  6. Borg I, Groenen PJF (2010) Modern multidimensional scaling: theory and applications. Springer, BerlinGoogle Scholar
  7. Bowman AW, Azzalini A (1997) Applied smoothing techniques for data analysis: the Kernel approach with S-plus illustrations (Oxford statistical science series). Oxford University Press, OxfordzbMATHGoogle Scholar
  8. Brandes U, Pich C (2007) Eigensolver methods for progressive multidimensional scaling of large data. In: Kaufmann M, Wagner D (eds) Graph drawing. Lecture notes in computer science, vol 4372. Springer, Berlin, pp 42–53Google Scholar
  9. Brockwell P, Davis R (2009) Time series: theory and methods., Springer series in statisticsSpringer, BerlinGoogle Scholar
  10. Cesario SK (2002) The “Christmas Effect” and other biometeorologic influences on childbearing and the health of women. J Obstet Gynecol Neonatal Nurs 31(5):526–535CrossRefGoogle Scholar
  11. Chakrabarti K, Garofalakis M, Rastogi R, Shim K (2001) Approximate query processing using wavelets. VLDB J 10(2–3):199–223zbMATHGoogle Scholar
  12. Cruz-Correia RJ, Pereira Rodrigues P, Freitas A, Canario Almeida F, Chen R, Costa-Pereira A (2010) Data quality and integration issues in electronic health records. Information discovery on electronic health records, pp 55–96Google Scholar
  13. Csiszár I (1967) Information-type measures of difference of probability distributions and indirect observations. Studia Sci Math Hungar 2:299–318zbMATHMathSciNetGoogle Scholar
  14. Dasu T, Krishnan S, Lin D, Venkatasubramanian S, Yi K (2009) Change (detection) you can believe. In: Finding distributional shifts in data streams. In: Proceedings of the 8th international symposium on intelligent data analysis: advances in intelligent data analysis VIII, IDA ’09. Springer, Berlin, pp 21–34Google Scholar
  15. Endres D, Schindelin J (2003) A new metric for probability distributions. IEEE Trans Inform Theory 49(7):1858–1860zbMATHMathSciNetCrossRefGoogle Scholar
  16. Gama J, Gaber MM (2007) Learning from data streams: processing techniques in sensor networks. Springer, BerlinCrossRefGoogle Scholar
  17. Gama J, Medas P, Castillo G, Rodrigues P (2004) Learning with drift detection. In: Bazzan A, Labidi S (eds) Advances in artificial intelligence—SBIA 2004., Lecture notes in computer scienceSpringer, Berlin, pp 286–295CrossRefGoogle Scholar
  18. Gama J (2010) Knowledge discovery from data streams, 1st edn. Chapman & Hall, LondonzbMATHCrossRefGoogle Scholar
  19. Gehrke J, Korn F, Srivastava D (2001) On computing correlated aggregates over continual data streams. SIGMOD Rec 30(2):13–24CrossRefGoogle Scholar
  20. Guha S, Shim K, Woo J (2004) Rehist: relative error histogram construction algorithms. In: Proceedings of the thirtieth international conference on very large data bases VLDB, pp 300–311Google Scholar
  21. Han J, Kamber M, Pei J (2012) Data mining: concepts and techniques. Morgan Kaufmann, Elsevier, Burlington, MACrossRefGoogle Scholar
  22. Howden LM, Meyer JA, (2011) Age and sex composition. 2010 Census Briefs US Department of Commerce. Economics and Statistics Administration, US Census BureauGoogle Scholar
  23. Hrovat G, Stiglic G, Kokol P, Ojstersek M (2014) Contrasting temporal trend discovery for large healthcare databases. Comput Methods Program Biomed 113(1):251–257CrossRefGoogle Scholar
  24. Keim DA (2000) Designing pixel-oriented visualization techniques: theory and applications. IEEE Trans Vis Comput Graph 6(1):59–78CrossRefGoogle Scholar
  25. Kifer D, Ben-David S, Gehrke J (2004) Detecting change in data streams. In: Proceedings of the thirtieth international conference on Very large data bases, VLDB Endowment, VLDB ’04, vol 30, pp 180–191Google Scholar
  26. Klinkenberg R, Renz I (1998) Adaptive information filtering: Learning in the presence of concept drifts. In: Workshop notes of the ICML/AAAI-98 workshop learning for text categorization. AAAI Press, Menlo Park, pp 33–40Google Scholar
  27. Kohonen T (1982) Self-organized formation of topologically correct feature maps. Biolog Cybern 43(1):59–69Google Scholar
  28. Lin J (1991) Divergence measures based on the Shannon entropy. IEEE Trans Inform Theory 37:145–151zbMATHMathSciNetCrossRefGoogle Scholar
  29. Mitchell TM, Caruana R, Freitag D, McDermott J, Zabowski D (1994) Experience with a learning personal assistant. Commun ACM 37(7):80–91CrossRefGoogle Scholar
  30. Mouss H, Mouss D, Mouss N, Sefouhi L (2004) Test of page-hinckley, an approach for fault detection in an agro-alimentary production system. In: Proceedings of the 5th Asian Control Conference, vol 2, pp 815–818Google Scholar
  31. National Research Council (2011) Explaining different levels of longevity in high-income countries. The National Academies Press, Washington, DCGoogle Scholar
  32. NHDS (2010) United states department of health and human services. Centers for disease control and prevention. National center for health statistics. National hospital discharge survey codebookGoogle Scholar
  33. NHDS (2014) National Center for Health Statistics, National Hospital Discharge Survey (NHDS) data, US Department of Health and Human Services, Centers for Disease Control and Prevention, National Center for Health Statistics, Hyattsville, Maryland.
  34. Papadimitriou S, Sun J, Faloutsos C (2005) Streaming pattern discovery in multiple time-series. In: Proceedings of the 31st international conference on very large data bases, VLDB endowment, VLDB ’05, pp 697–708Google Scholar
  35. Parzen E (1962) On estimation of a probability density function and mode. Ann Math Statist 33(3):1065–1076zbMATHMathSciNetCrossRefGoogle Scholar
  36. Ramsay JO, Silverman BW (2005) Functional data analysis. Springer, New YorkCrossRefGoogle Scholar
  37. Rodrigues P, Correia R (2013) Streaming virtual patient records. In: Krempl G, Zliobaite I, Wang Y, Forman G (eds) Real-world challenges for data stream mining. University Magdeburg, Otto-von-Guericke, pp 34–37Google Scholar
  38. Rodrigues P, Gama J, Pedroso J (2008) Hierarchical clustering of time-series data streams. IEEE Trans Knowl Data Eng 20(5):615–627CrossRefGoogle Scholar
  39. Rodrigues PP, Gama Ja (2010) A simple dense pixel visualization for mobile sensor data mining. In: Proceedings of the second international conference on knowledge discovery from sensor data, sensor-KDD’08. Springer, Berlin, pp 175–189Google Scholar
  40. Rodrigues PP, Gama J, Sebastiã o R (2010) Memoryless fading windows in ubiquitous settings. In Proceedings of ubiquitous data mining (UDM) workshop in conjunction with the 19th european conference on artificial intelligence—ECAI 2010, pp 27–32Google Scholar
  41. Rodrigues PP, Sebastiã o R, Santos CC (2011) Improving cardiotocography monitoring: a memory-less stream learning approach. In: Proceedings of the learning from medical data streams workshop. Bled, SloveniaGoogle Scholar
  42. Rubner Y, Tomasi C, Guibas L (2000) The earth mover’s distance as a metric for image retrieval. Int J Comput Vision 40(2):99–121zbMATHCrossRefGoogle Scholar
  43. Sebastião R, Gama J (2009) A study on change detection methods. In: 4th Portuguese conference on artificial intelligenceGoogle Scholar
  44. Sebastião R, Gama J, Rodrigues P, Bernardes J (2010) Monitoring incremental histogram distribution for change detection in data streams. In: Gaber M, Vatsavai R, Omitaomu O, Gama J, Chawla N, Ganguly A (eds) Knowledge discovery from sensor data, vol 5840., Lecture notes in computer science. Springer, Berlin, pp 25–42Google Scholar
  45. Sebastião R, Silva M, Rabiço R, Gama J, Mendonça T (2013) Real-time algorithm for changes detection in depth of anesthesia signals. Evol Syst 4(1):3–12CrossRefGoogle Scholar
  46. Sáez C, Martínez-Miranda J, Robles M, García-Gómez JM (2012) O rganizing data quality assessment of shifting biomedical data. Stud Health Technol Inform 180:721–725Google Scholar
  47. Sáez C, Robles M, García-Gómez JM (2013) Comparative study of probability distribution distances to define a metric for the stability of multi-source biomedical research data. In: Engineering in medicine and biology society (EMBC), 2013 35th annual international conference of the IEEE, pp 3226–3229Google Scholar
  48. Sáez C, Robles M, García-Gómez JM (2014) Stability metrics for multi-source biomedical data based on simplicial projections from probability distribution distances. Statist Method Med Res (forthcoming)Google Scholar
  49. Shewhart WA, Deming WE (1939) Statistical method from the viewpoint of quality control. Graduate School of the Department of Agriculture, Washington, DCGoogle Scholar
  50. Shimazaki H, Shinomoto S (2010) Kernel bandwidth optimization in spike rate estimation. J Comput Neurosci 29(1–2):171–182MathSciNetCrossRefGoogle Scholar
  51. Solberg LI, Engebretson KI, Sperl-Hillen JM, Hroscikoski MC, O’Connor PJ (2006) Are claims data accurate enough to identify patients for performance measures or quality improvement? the case of diabetes, heart disease, and depression. Am J Med Qual 21(4):238–245CrossRefGoogle Scholar
  52. Spiliopoulou M, Ntoutsi I, Theodoridis Y, Schult R (2006) monic: modeling and monitoring cluster transitions. In: Proceedings of the 12th ACm SIGKDD international conference on knowledge discovery and data mining, KDD ’06. ACm, New York, NY, pp 706–711Google Scholar
  53. Stiglic G, Kokol P (2011) Interpretability of sudden concept drift in medical informatics domain. In Proceedings of the 2010 IEEE international conference on data mining workshops, pp 609–613Google Scholar
  54. Torgerson W (1952) Multidimensional scaling: I theory and method. Psychometrika 17(4):401–419zbMATHMathSciNetCrossRefGoogle Scholar
  55. Wang RY, Strong DM (1996) Beyond accuracy: what data quality means to data consumers. J Manage Inform Syst 12(4):5–33zbMATHGoogle Scholar
  56. Weiskopf NG, Weng C (2013) M ethods and dimensions of electronic health record data quality assessment: enabling reuse for clinical research. J Am Med Inform Assoc 20(1):144–151CrossRefGoogle Scholar
  57. Wellings K, Macdowall W, Catchpole M, Goodrich J (1999) Seasonal variations in sexual activity and their implications for sexual health promotion. J R Soc Med 92(2):60–64Google Scholar
  58. Westgard JO, Barry PL (2010) Basic QC practices: training in statistical quality control for medical laboratories. Westgard Quality Corporation, Madison, WIGoogle Scholar
  59. Widmer G, Kubat M (1996) Learning in the presence of concept drift and hidden contexts. Mach Learn 23(1):69–101Google Scholar

Copyright information

© The Author(s) 2014

Authors and Affiliations

  • Carlos Sáez
    • 1
    • 2
    Email author
  • Pedro Pereira Rodrigues
    • 2
    • 3
  • João Gama
    • 3
  • Montserrat Robles
    • 1
  • Juan M. García-Gómez
    • 1
  1. 1.Grupo de Informática Biomédica (IBIME), Instituto de Aplicaciones de las Tecnologías de la Información y de las Comunicaciones Avanzadas (ITACA)Universitat Politècnica de ValènciaValenciaSpain
  2. 2.Center for Health Technology and Services Research (CINTESIS), Faculdade de Medicina daUniversidade do PortoPortoPortugal
  3. 3.Laboratório de Inteligência Artificial e Apoio à Decisão (LIAAD)-INESCUniversidade do PortoPortoPortugal

Personalised recommendations