Skip to main content

A Classification Algorithm Utilizing the Lempel-Ziv Complexity Score for Missing Data

  • Conference paper
  • First Online:
Proceedings of the Second International Conference on Innovations in Computing Research (ICR’23)

Part of the book series: Lecture Notes in Networks and Systems ((LNNS,volume 721))

  • 328 Accesses

Abstract

Informative data analysis relies heavily on the quality of the underlying data. Unfortunately, often in our research, the data to be analyzed contains many missing values. While we have methods to mitigate the missing data – listwise deletion, multiple imputation, etc. - these methods are only appropriate for use when data are missing at random. When data are missing not at random, use of these methods leads to erroneous analyses. Determining whether a data set contains random or non-random missing data is an open challenge in our field. An algorithm to categorize missing data utilizing the Lempel-Ziv (LZ) complexity score is proposed by the authors and initial results from its use in both generated and publicly available data are analyzed. The authors’ algorithm contains many positive features. It is useful with data sets of all compositions (string, numerical, graphics, mixed), yields easily interpreted results, and can be used autonomously to determine the type of missingness (random versus non-random). The authors review related literature, explain the algorithm, and interpret initial results of its use with data from canonical Bayesian networks, United States census data, and data sets from the University of California, Irvine machine learning repository. Further usages in the field of bioinformatics and pathways for future research are discussed.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Groenwold, R.H.H.: Informative missingness in electronic health record systems: the curse of knowing. Diagn. Progn. Res. 4, 8 (2020). https://doi.org/10.1186/s41512-020-00077

    Article  Google Scholar 

  2. Sterne, J.A., et al.: Multiple imputation for missing data in epidemiological and clinical research: potential and pitfalls. BMJ (2009)

    Google Scholar 

  3. Little, R., Rubin, D.: Statistical Analysis with Missing Data, 3rd edn. Wiley, Hoboken (2019)

    MATH  Google Scholar 

  4. Soley-Bori, M.: Dealing with missing data: key assumptions and methods for applied analysis (2013). https://www.bu.edu/sph/files/2014/05/Marina-tech-report.pdf

  5. Swalin, A.: How to Handle Missing Data (2018). https://towardsdatascience.com/how-to-handle-missing-data-8646b18db0d4

  6. Sessions, V., Perrine, S., Grieves, J.: A technique for incorporating data missing not at random (MNAR) into Bayesian networks. ICIQ 2016, Article 12, Publication date: June 22nd, 2016 (2016)

    Google Scholar 

  7. Yang Lee, L., Pipino, J.F., Wang, R.: Journey to Data Quality. The MIT Press, Cambridge (2006)

    Google Scholar 

  8. Horton, N., Klienman, K.P.: Much ado about nothing: a comparison of missing data methods and software to fit incomplete data regression models. Am. Stat. 61, 79–90 (2007)

    Article  MathSciNet  Google Scholar 

  9. Patrick McKnight, K.M., McKnight, S.S., Figueredo, A.: Missing Data: A Gentle Introduction. Guilford Oress, New York (2007)

    Google Scholar 

  10. Almedar, M.: A Monte Carlo Study: The Impact of Missing Data in Cross-Classification Random Effects Models. Educational Policy Studies Dissertations. Paper 34 (2009)

    Google Scholar 

  11. Lin, J., Haug, P.: Exploiting missing clinical data in Bayesian network modeling for predicting medical problems. J. Biomed. Inform. 41, 1–14 (2008)

    Article  Google Scholar 

  12. Lempel, A., Ziv, J.: On the complexity of finite sequences. IEEE Trans. Inf. Theory 22(1), 75–81 (1976)

    Google Scholar 

  13. Rosas, F., Mediano, P.: When and how to use Lempel-Ziv complexity Jun 26, 2019 (2019). https://information-dynamics.github.io/

  14. Zhang, X.S., Roy, R.J., Jensen, E.W.: EEG complexity as a measure of depth of anesthesia for patients. IEEE Trans. Biomed. Eng. 48(12), 1424–1433 (2001)

    Article  Google Scholar 

  15. Gusev, V.D., Nemytikova, L.A., Chuzhanova, N.A.: On the complexity measures of genetic sequences. Bioinformatics 15(12), 994–999 (1999)

    Article  Google Scholar 

  16. Shmulevich, I., Povel, D.J.: Complexity measures of musical rhythms. In: Desain, P., Windsor, L. (eds.) Rhythm Perception and Production, pp. 239–244. Swets & Zeitlinger, Lisse (2000)

    Google Scholar 

  17. Robert Cowell, G., Dawid, S.L., Spiegalhalter, D.: Probabilistic Networks and Expert Systems. Springer, New York (1999). https://doi.org/10.1007/b97670

    Book  Google Scholar 

  18. Jensen, F.: Bayesian Networks and Decision Graphs. Springer, New York (2001). https://doi.org/10.1007/978-0-387-68282-2

  19. Neapolitan, R.: Learning Bayesian Networks. Pearson Education Inc, Upper Saddle River, NJ (2004)

    Google Scholar 

  20. Olesen, K., Lauritzen, S., Jensen, F.: aHUGIN: a system creating adaptive causal probabilistic networks. In: Proceedings of the Eighth Conference on Uncertainty in Artificial Intelligence, pp. 223–229 (1992)

    Google Scholar 

  21. Lauritzen, S., Spielgelhalter, D.J.: Local computation with probabilities in graphical structure and their applications to expert systems. J. Roy. Stat. Soc. B, 50(2) (1988)

    Google Scholar 

  22. Sevinc, V., Kucuk, O., Goltas, M.: A Bayesian network model for prediction and analysis of possible forest fire causes. Forest Ecol. Manag. 457, 17723 (2020). ISSN 0378-1127, https://doi.org/10.1016/j.foreco.2019.117723

  23. Henrik Bengtsson Bayesian networks - a self-contained introduction with implementation remarks. https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.100.6096&rep=rep1&type=pdf. Accessed 01 Dec 2022

  24. Dua, D., Graff, C.: UCI Machine Learning Repository. Irvine, CA: University of California, School of Information and Computer Science (2019). http://archive.ics.uci.edu/ml

  25. U.S. Census Bureau. SAIPE data sets (2020). https://www2.census.gov/programs-surveys/saipe/datasets/time-series/model-tables/

  26. Kohavi, R.: Scaling up the accuracy of naive-bayes classifiers: a decision-tree hybrid. In: Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (1996)

    Google Scholar 

  27. De Vito, S., Massera, E., Piga, M., Martinotto, L., Di Francia, G.: On field calibration of an electronic nose for benzene estimation in an urban pollution monitoring scenario. Sens. Actuators B: Chem. 129(2), 750–757 (2008). ISSN 0925-4005

    Google Scholar 

  28. Hooda, N., Bawa, S., Rana, P.S.: Fraudulent firm classification: a case study of an external audit. Appl. Artif. Intell. 32(1), 48–64 (2018)

    Google Scholar 

  29. Quinlan: Simplifying decision trees. .Int J. Man-Mach. Stud. 27, 221–234 (1987)

    Google Scholar 

  30. Salzberg, S.: Exemplar-based learning: theory and implementation (Technical report TR-10–88). Harvard University, Center for Research in Computing Technology, Aiken Computation Laboratory (33 Oxford Street; Cambridge, MA 02138) (1988)

    Google Scholar 

  31. Kaspar, F., Schuster, H.G.: Easily-calculable measure for the complexity of spatiotemporal patterns. Phys. Rev. A 36(2) (1987)

    Google Scholar 

  32. Tremblay, M., Dutta, K., Vandermeer, D.: Using data mining techniques to discover bias patterns in missing data. ACM J. Data Inf. Qual. 2(1), Article 2 (2010)

    Google Scholar 

  33. Van Lieshout, R.J., Layton, H., Savoy, C.D., et al.: Effect of online 1-day cognitive behavioral therapy–based workshops plus usual care vs usual care alone for postpartum depression: a randomized clinical trial. JAMA Psychiatry (2021)

    Google Scholar 

  34. Toyomoto, R., Funada, S., Furukawa, T.A.: Some concerns about imputation methods for missing data. JAMA Psychiatry (2022)

    Google Scholar 

  35. The Python Standard Library. Python Software Foundation. https://docs.python.org/3/library/random.html

  36. Ramoni, M., Sebastiani, P.: Learning conditional probabilities from incomplete data: an experimental comparison. In: Proceedings of the Seventh International Workshop on Artificial Intelligence and Statistics, pp. 260–265 (1999)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Valerie Sessions .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Sessions, V., Grieves, J., Perrine, S. (2023). A Classification Algorithm Utilizing the Lempel-Ziv Complexity Score for Missing Data. In: Daimi, K., Al Sadoon, A. (eds) Proceedings of the Second International Conference on Innovations in Computing Research (ICR’23). Lecture Notes in Networks and Systems, vol 721. Springer, Cham. https://doi.org/10.1007/978-3-031-35308-6_1

Download citation

Publish with us

Policies and ethics