Abstract
Informative data analysis relies heavily on the quality of the underlying data. Unfortunately, often in our research, the data to be analyzed contains many missing values. While we have methods to mitigate the missing data – listwise deletion, multiple imputation, etc. - these methods are only appropriate for use when data are missing at random. When data are missing not at random, use of these methods leads to erroneous analyses. Determining whether a data set contains random or non-random missing data is an open challenge in our field. An algorithm to categorize missing data utilizing the Lempel-Ziv (LZ) complexity score is proposed by the authors and initial results from its use in both generated and publicly available data are analyzed. The authors’ algorithm contains many positive features. It is useful with data sets of all compositions (string, numerical, graphics, mixed), yields easily interpreted results, and can be used autonomously to determine the type of missingness (random versus non-random). The authors review related literature, explain the algorithm, and interpret initial results of its use with data from canonical Bayesian networks, United States census data, and data sets from the University of California, Irvine machine learning repository. Further usages in the field of bioinformatics and pathways for future research are discussed.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Groenwold, R.H.H.: Informative missingness in electronic health record systems: the curse of knowing. Diagn. Progn. Res. 4, 8 (2020). https://doi.org/10.1186/s41512-020-00077
Sterne, J.A., et al.: Multiple imputation for missing data in epidemiological and clinical research: potential and pitfalls. BMJ (2009)
Little, R., Rubin, D.: Statistical Analysis with Missing Data, 3rd edn. Wiley, Hoboken (2019)
Soley-Bori, M.: Dealing with missing data: key assumptions and methods for applied analysis (2013). https://www.bu.edu/sph/files/2014/05/Marina-tech-report.pdf
Swalin, A.: How to Handle Missing Data (2018). https://towardsdatascience.com/how-to-handle-missing-data-8646b18db0d4
Sessions, V., Perrine, S., Grieves, J.: A technique for incorporating data missing not at random (MNAR) into Bayesian networks. ICIQ 2016, Article 12, Publication date: June 22nd, 2016 (2016)
Yang Lee, L., Pipino, J.F., Wang, R.: Journey to Data Quality. The MIT Press, Cambridge (2006)
Horton, N., Klienman, K.P.: Much ado about nothing: a comparison of missing data methods and software to fit incomplete data regression models. Am. Stat. 61, 79–90 (2007)
Patrick McKnight, K.M., McKnight, S.S., Figueredo, A.: Missing Data: A Gentle Introduction. Guilford Oress, New York (2007)
Almedar, M.: A Monte Carlo Study: The Impact of Missing Data in Cross-Classification Random Effects Models. Educational Policy Studies Dissertations. Paper 34 (2009)
Lin, J., Haug, P.: Exploiting missing clinical data in Bayesian network modeling for predicting medical problems. J. Biomed. Inform. 41, 1–14 (2008)
Lempel, A., Ziv, J.: On the complexity of finite sequences. IEEE Trans. Inf. Theory 22(1), 75–81 (1976)
Rosas, F., Mediano, P.: When and how to use Lempel-Ziv complexity Jun 26, 2019 (2019). https://information-dynamics.github.io/
Zhang, X.S., Roy, R.J., Jensen, E.W.: EEG complexity as a measure of depth of anesthesia for patients. IEEE Trans. Biomed. Eng. 48(12), 1424–1433 (2001)
Gusev, V.D., Nemytikova, L.A., Chuzhanova, N.A.: On the complexity measures of genetic sequences. Bioinformatics 15(12), 994–999 (1999)
Shmulevich, I., Povel, D.J.: Complexity measures of musical rhythms. In: Desain, P., Windsor, L. (eds.) Rhythm Perception and Production, pp. 239–244. Swets & Zeitlinger, Lisse (2000)
Robert Cowell, G., Dawid, S.L., Spiegalhalter, D.: Probabilistic Networks and Expert Systems. Springer, New York (1999). https://doi.org/10.1007/b97670
Jensen, F.: Bayesian Networks and Decision Graphs. Springer, New York (2001). https://doi.org/10.1007/978-0-387-68282-2
Neapolitan, R.: Learning Bayesian Networks. Pearson Education Inc, Upper Saddle River, NJ (2004)
Olesen, K., Lauritzen, S., Jensen, F.: aHUGIN: a system creating adaptive causal probabilistic networks. In: Proceedings of the Eighth Conference on Uncertainty in Artificial Intelligence, pp. 223–229 (1992)
Lauritzen, S., Spielgelhalter, D.J.: Local computation with probabilities in graphical structure and their applications to expert systems. J. Roy. Stat. Soc. B, 50(2) (1988)
Sevinc, V., Kucuk, O., Goltas, M.: A Bayesian network model for prediction and analysis of possible forest fire causes. Forest Ecol. Manag. 457, 17723 (2020). ISSN 0378-1127, https://doi.org/10.1016/j.foreco.2019.117723
Henrik Bengtsson Bayesian networks - a self-contained introduction with implementation remarks. https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.100.6096&rep=rep1&type=pdf. Accessed 01 Dec 2022
Dua, D., Graff, C.: UCI Machine Learning Repository. Irvine, CA: University of California, School of Information and Computer Science (2019). http://archive.ics.uci.edu/ml
U.S. Census Bureau. SAIPE data sets (2020). https://www2.census.gov/programs-surveys/saipe/datasets/time-series/model-tables/
Kohavi, R.: Scaling up the accuracy of naive-bayes classifiers: a decision-tree hybrid. In: Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (1996)
De Vito, S., Massera, E., Piga, M., Martinotto, L., Di Francia, G.: On field calibration of an electronic nose for benzene estimation in an urban pollution monitoring scenario. Sens. Actuators B: Chem. 129(2), 750–757 (2008). ISSN 0925-4005
Hooda, N., Bawa, S., Rana, P.S.: Fraudulent firm classification: a case study of an external audit. Appl. Artif. Intell. 32(1), 48–64 (2018)
Quinlan: Simplifying decision trees. .Int J. Man-Mach. Stud. 27, 221–234 (1987)
Salzberg, S.: Exemplar-based learning: theory and implementation (Technical report TR-10–88). Harvard University, Center for Research in Computing Technology, Aiken Computation Laboratory (33 Oxford Street; Cambridge, MA 02138) (1988)
Kaspar, F., Schuster, H.G.: Easily-calculable measure for the complexity of spatiotemporal patterns. Phys. Rev. A 36(2) (1987)
Tremblay, M., Dutta, K., Vandermeer, D.: Using data mining techniques to discover bias patterns in missing data. ACM J. Data Inf. Qual. 2(1), Article 2 (2010)
Van Lieshout, R.J., Layton, H., Savoy, C.D., et al.: Effect of online 1-day cognitive behavioral therapy–based workshops plus usual care vs usual care alone for postpartum depression: a randomized clinical trial. JAMA Psychiatry (2021)
Toyomoto, R., Funada, S., Furukawa, T.A.: Some concerns about imputation methods for missing data. JAMA Psychiatry (2022)
The Python Standard Library. Python Software Foundation. https://docs.python.org/3/library/random.html
Ramoni, M., Sebastiani, P.: Learning conditional probabilities from incomplete data: an experimental comparison. In: Proceedings of the Seventh International Workshop on Artificial Intelligence and Statistics, pp. 260–265 (1999)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Sessions, V., Grieves, J., Perrine, S. (2023). A Classification Algorithm Utilizing the Lempel-Ziv Complexity Score for Missing Data. In: Daimi, K., Al Sadoon, A. (eds) Proceedings of the Second International Conference on Innovations in Computing Research (ICR’23). Lecture Notes in Networks and Systems, vol 721. Springer, Cham. https://doi.org/10.1007/978-3-031-35308-6_1
Download citation
DOI: https://doi.org/10.1007/978-3-031-35308-6_1
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-35307-9
Online ISBN: 978-3-031-35308-6
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)