A Classification Algorithm Utilizing the Lempel-Ziv Complexity Score for Missing Data

Sessions, Valerie; Grieves, Justin; Perrine, Stanley

doi:10.1007/978-3-031-35308-6_1

Valerie Sessions¹¹,
Justin Grieves¹¹ &
Stanley Perrine¹²

Part of the book series: Lecture Notes in Networks and Systems ((LNNS,volume 721))

328 Accesses

Abstract

Informative data analysis relies heavily on the quality of the underlying data. Unfortunately, often in our research, the data to be analyzed contains many missing values. While we have methods to mitigate the missing data – listwise deletion, multiple imputation, etc. - these methods are only appropriate for use when data are missing at random. When data are missing not at random, use of these methods leads to erroneous analyses. Determining whether a data set contains random or non-random missing data is an open challenge in our field. An algorithm to categorize missing data utilizing the Lempel-Ziv (LZ) complexity score is proposed by the authors and initial results from its use in both generated and publicly available data are analyzed. The authors’ algorithm contains many positive features. It is useful with data sets of all compositions (string, numerical, graphics, mixed), yields easily interpreted results, and can be used autonomously to determine the type of missingness (random versus non-random). The authors review related literature, explain the algorithm, and interpret initial results of its use with data from canonical Bayesian networks, United States census data, and data sets from the University of California, Irvine machine learning repository. Further usages in the field of bioinformatics and pathways for future research are discussed.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Groenwold, R.H.H.: Informative missingness in electronic health record systems: the curse of knowing. Diagn. Progn. Res. 4, 8 (2020). https://doi.org/10.1186/s41512-020-00077
Article Google Scholar
Sterne, J.A., et al.: Multiple imputation for missing data in epidemiological and clinical research: potential and pitfalls. BMJ (2009)
Google Scholar
Little, R., Rubin, D.: Statistical Analysis with Missing Data, 3rd edn. Wiley, Hoboken (2019)
MATH Google Scholar
Soley-Bori, M.: Dealing with missing data: key assumptions and methods for applied analysis (2013). https://www.bu.edu/sph/files/2014/05/Marina-tech-report.pdf
Swalin, A.: How to Handle Missing Data (2018). https://towardsdatascience.com/how-to-handle-missing-data-8646b18db0d4
Sessions, V., Perrine, S., Grieves, J.: A technique for incorporating data missing not at random (MNAR) into Bayesian networks. ICIQ 2016, Article 12, Publication date: June 22nd, 2016 (2016)
Google Scholar
Yang Lee, L., Pipino, J.F., Wang, R.: Journey to Data Quality. The MIT Press, Cambridge (2006)
Google Scholar
Horton, N., Klienman, K.P.: Much ado about nothing: a comparison of missing data methods and software to fit incomplete data regression models. Am. Stat. 61, 79–90 (2007)
Article MathSciNet Google Scholar
Patrick McKnight, K.M., McKnight, S.S., Figueredo, A.: Missing Data: A Gentle Introduction. Guilford Oress, New York (2007)
Google Scholar
Almedar, M.: A Monte Carlo Study: The Impact of Missing Data in Cross-Classification Random Effects Models. Educational Policy Studies Dissertations. Paper 34 (2009)
Google Scholar
Lin, J., Haug, P.: Exploiting missing clinical data in Bayesian network modeling for predicting medical problems. J. Biomed. Inform. 41, 1–14 (2008)
Article Google Scholar
Lempel, A., Ziv, J.: On the complexity of finite sequences. IEEE Trans. Inf. Theory 22(1), 75–81 (1976)
Google Scholar
Rosas, F., Mediano, P.: When and how to use Lempel-Ziv complexity Jun 26, 2019 (2019). https://information-dynamics.github.io/
Zhang, X.S., Roy, R.J., Jensen, E.W.: EEG complexity as a measure of depth of anesthesia for patients. IEEE Trans. Biomed. Eng. 48(12), 1424–1433 (2001)
Article Google Scholar
Gusev, V.D., Nemytikova, L.A., Chuzhanova, N.A.: On the complexity measures of genetic sequences. Bioinformatics 15(12), 994–999 (1999)
Article Google Scholar
Shmulevich, I., Povel, D.J.: Complexity measures of musical rhythms. In: Desain, P., Windsor, L. (eds.) Rhythm Perception and Production, pp. 239–244. Swets & Zeitlinger, Lisse (2000)
Google Scholar
Robert Cowell, G., Dawid, S.L., Spiegalhalter, D.: Probabilistic Networks and Expert Systems. Springer, New York (1999). https://doi.org/10.1007/b97670
Book Google Scholar
Jensen, F.: Bayesian Networks and Decision Graphs. Springer, New York (2001). https://doi.org/10.1007/978-0-387-68282-2
Neapolitan, R.: Learning Bayesian Networks. Pearson Education Inc, Upper Saddle River, NJ (2004)
Google Scholar
Olesen, K., Lauritzen, S., Jensen, F.: aHUGIN: a system creating adaptive causal probabilistic networks. In: Proceedings of the Eighth Conference on Uncertainty in Artificial Intelligence, pp. 223–229 (1992)
Google Scholar
Lauritzen, S., Spielgelhalter, D.J.: Local computation with probabilities in graphical structure and their applications to expert systems. J. Roy. Stat. Soc. B, 50(2) (1988)
Google Scholar
Sevinc, V., Kucuk, O., Goltas, M.: A Bayesian network model for prediction and analysis of possible forest fire causes. Forest Ecol. Manag. 457, 17723 (2020). ISSN 0378-1127, https://doi.org/10.1016/j.foreco.2019.117723
Henrik Bengtsson Bayesian networks - a self-contained introduction with implementation remarks. https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.100.6096&rep=rep1&type=pdf. Accessed 01 Dec 2022
Dua, D., Graff, C.: UCI Machine Learning Repository. Irvine, CA: University of California, School of Information and Computer Science (2019). http://archive.ics.uci.edu/ml
U.S. Census Bureau. SAIPE data sets (2020). https://www2.census.gov/programs-surveys/saipe/datasets/time-series/model-tables/
Kohavi, R.: Scaling up the accuracy of naive-bayes classifiers: a decision-tree hybrid. In: Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (1996)
Google Scholar
De Vito, S., Massera, E., Piga, M., Martinotto, L., Di Francia, G.: On field calibration of an electronic nose for benzene estimation in an urban pollution monitoring scenario. Sens. Actuators B: Chem. 129(2), 750–757 (2008). ISSN 0925-4005
Google Scholar
Hooda, N., Bawa, S., Rana, P.S.: Fraudulent firm classification: a case study of an external audit. Appl. Artif. Intell. 32(1), 48–64 (2018)
Google Scholar
Quinlan: Simplifying decision trees. .Int J. Man-Mach. Stud. 27, 221–234 (1987)
Google Scholar
Salzberg, S.: Exemplar-based learning: theory and implementation (Technical report TR-10–88). Harvard University, Center for Research in Computing Technology, Aiken Computation Laboratory (33 Oxford Street; Cambridge, MA 02138) (1988)
Google Scholar
Kaspar, F., Schuster, H.G.: Easily-calculable measure for the complexity of spatiotemporal patterns. Phys. Rev. A 36(2) (1987)
Google Scholar
Tremblay, M., Dutta, K., Vandermeer, D.: Using data mining techniques to discover bias patterns in missing data. ACM J. Data Inf. Qual. 2(1), Article 2 (2010)
Google Scholar
Van Lieshout, R.J., Layton, H., Savoy, C.D., et al.: Effect of online 1-day cognitive behavioral therapy–based workshops plus usual care vs usual care alone for postpartum depression: a randomized clinical trial. JAMA Psychiatry (2021)
Google Scholar
Toyomoto, R., Funada, S., Furukawa, T.A.: Some concerns about imputation methods for missing data. JAMA Psychiatry (2022)
Google Scholar
The Python Standard Library. Python Software Foundation. https://docs.python.org/3/library/random.html
Ramoni, M., Sebastiani, P.: Learning conditional probabilities from incomplete data: an experimental comparison. In: Proceedings of the Seventh International Workshop on Artificial Intelligence and Statistics, pp. 260–265 (1999)
Google Scholar

Download references

Author information

Authors and Affiliations

Charleston Southern University, North Charleston, USA
Valerie Sessions & Justin Grieves
Georgia Gwinnett College, Lawrenceville, USA
Stanley Perrine

Authors

Valerie Sessions
View author publications
You can also search for this author in PubMed Google Scholar
Justin Grieves
View author publications
You can also search for this author in PubMed Google Scholar
Stanley Perrine
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Valerie Sessions .

Editor information

Editors and Affiliations

University of Detroit Mercy, Farmington Hills, MI, USA
Kevin Daimi
Asia Pacific International College, Sydney, NSW, Australia
Abeer Al Sadoon

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Sessions, V., Grieves, J., Perrine, S. (2023). A Classification Algorithm Utilizing the Lempel-Ziv Complexity Score for Missing Data. In: Daimi, K., Al Sadoon, A. (eds) Proceedings of the Second International Conference on Innovations in Computing Research (ICR’23). Lecture Notes in Networks and Systems, vol 721. Springer, Cham. https://doi.org/10.1007/978-3-031-35308-6_1

Download citation

DOI: https://doi.org/10.1007/978-3-031-35308-6_1
Published: 17 June 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-35307-9
Online ISBN: 978-3-031-35308-6
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics