Abstract
In data mining, preprocessing is one of the essential processes which involves data normalization, noise removal, handling missing values, etc. This paper focuses on handling missing values using unsupervised machine learning techniques. Soft computation approaches are combined with the clustering techniques to form a novel method to handle the missing values, which help us to overcome the problems of inconsistency. Rough K-means centroid-based imputation method is proposed and compared with K-means centroid-based imputation method, fuzzy C-means centroid-based imputation method, K-means parameter-based imputation method, fuzzy C-means parameter-based imputation method, and rough K-means parameter-based imputation methods. The experimental analysis is carried out on four benchmark datasets, viz. Dermatology, Pima, Wisconsin, and Yeast datasets, which have taken from UCI data repository. The proposed method proves the efficacy of different datasets, and the results are also promising one.
Similar content being viewed by others
References
Bezdek JC (2013) Pattern recognition with fuzzy objective function algorithms. Springer, Berlin
Cannon RL, Dave JV, Bezdek JC (1986) Efficient implementation of the fuzzy c-means clustering algorithms. IEEE Trans Pattern Anal Mach Intell 2:248–255
Dunn JC (1973) A fuzzy relative of the ISODATA process and its use in detecting compact well-separated clusters. J Cybern 3(3):32–57
Gajawada S, Toshniwal D (2012) Missing value imputation method based on clustering and nearest neighbours. Int J Future Comput Commun 1(2):206
García-Laencina PJ, Sancho-Gómez JL, Figueiras-Vidal AR (2010) Pattern classification with missing data: a review. Neural Comput Appl 19(2):263–282
Han J, Pei J, Kamber M (2011) Data mining: concepts and techniques. Elsevier, Amsterdam
Hathaway RJ, Bezdek JC (2001) Fuzzy c-means clustering of incomplete data. IEEE, Piscataway
Havens TC, Bezdek JC, Leckie C, Hall LO, Palaniswami M (2012) Fuzzy c-means algorithms for very large data. IEEE Trans Fuzzy Syst 20(6):1130–1146
Huang Z (1998) Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Min Knowl Discov 2(3):283–304
Jain AK (2010) Data clustering: 50 years beyond K-means. Pattern Recognit Lett 31(8):651–666
Khan SS, Ahmad A (2004) Cluster center initialization algorithm for K-means clustering. Pattern Recogn Lett 25(11):1293–1302
Kondo Y, Salibian-Barrera M, Zamar R (2012) A robust and sparse K-means clustering algorithm, arXiv preprint arXiv:1201.6082
Li D, Deogun J, Spaulding W, Shuart B (2004) Towards missing data imputation: a study of fuzzy k-means clustering method. InRough Sets Curr Trends Comput 3066:573–579
Li D, Deogun J, Spaulding W, Shuart B (2005) Dealing with missing data: algorithms based on fuzzy set and rough set theories. In: Peters JF, Skowron A (eds) Transactions on rough sets IV. Springer, Berlin, pp 37–57
Lingras P, Peters G (2011) Rough clustering. Wiley Interdiscip Rev Data Min Knowl Discov 1(1):64–72
Liu ZG, Pan Q, Dezert J, Martin A (2016) Adaptive imputation of missing values for incomplete pattern classification. Pattern Recogn 52:85–95
Nelwamondo FV (2008) Computational intelligence techniques for missing data imputation. Doctoral dissertation, University of the Witwatersrand, Johannesburg
Panda S, Sahu S, Jena P, Chattopadhyay S (2012) Comparing fuzzy-C means and K-means clustering techniques: a comprehensive study. In: Wyld DC, Zizka J, Nagamalai D (eds) Proceedings of 2nd international conference on computer science, engineering and applications, vol 166. Advances in computer science, engineering & applications. Springer, Berlin, Heidelberg, pp 451–460
Pawlak Z (1998) Rough set theory and its applications to data analysis. Cybern Syst 29(7):661–688
Peters G (2005) Outliers in rough k-means clustering. InPReMI, pp 702–707
Peters G (2006) Some refinements of rough k-means clustering. Pattern Recognit 39(8):1481–1491
Peters G, Crespo F (2013) An illustrative comparison of rough k-means to classical clustering approaches. InRSFDGrC, pp 337–344
Peters G, Lampart M (2006) A partitive rough clustering algorithm. In: International conference on rough sets and current trends in computing. Springer, Berlin, pp 657–666
Peters G, Lampart M, Weber R (2008) Evolutionary rough k-medoid clustering. Lect Notes Comput Sci 5084:289–306
Rahman MM, Davis DN (2013) Machine learning-based missing value imputation method for clinical datasets. In: Yang G-C, Ao S-I, Gelman L (eds) IAENG transactions on engineering technologies. Springer, Dordrecht, pp 245–257
Rahman MG, Islam MZ (2016) Missing value imputation using a fuzzy clustering-based EM approach. Knowl Inf Syst 46(2):389–422
Raja PS, Thangavel K (2016) Soft clustering based missing value imputation. In: Subramanian S et al (eds) Annual convention of the computer society of India. Springer, Singapore, pp 119–133
Rey-del-Castillo P, Cardeñosa J (2012) Fuzzy min-max neural networks for categorical data: application to missing data imputation. Neural Comput Appl 21(6):1349–1362
Suguna N, Thanushkodi KG (2011) Predicting missing attribute values using k-means clustering. J Comput Sci 7(2):216
Tuikkala J, Elo LL, Nevalainen OS, Aittokallio T (2008) Missing value imputation improves clustering and interpretation of gene expression microarray data. BMC Bioinform 9(1):202
Zhang S, Zhang J, Zhu X, Qin Y, Zhang C (2008) Missing value imputation based on data clustering. In: Gavrilova ML, Tan CJK (eds) Transactions on computational science I. Lecture notes in computer science, vol 4750, pp 128–138
Acknowledgements
Authors would like to thank UGC, New Delhi, for the financial support received under UGC Rajiv Gandhi National Fellowship (F1-17.1/2016-17/RGNF-2015-17-SC-TAM-28324) and UGC Major Research Project (43-274/2014). The authors extend their sincere thanks to the anonymous referees for their suggestions to improve the paper.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Ethical approval
This article does not contain any studies with human participants or animals performed by any of the authors.
Additional information
Communicated by V. Loia.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix
Appendix
Tables 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20 and 21 show the information that is very important to impute the missing values of an object. In centroid-based missing value, imputation method, the missing values are imputed by the information of the closest centroid value of the cluster. Tables 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20 and 21 show the distance between the missing object and K cluster centroid and also show the minimum distance cluster and minimum distance value.
Tables 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32 and 33 show the information that is very important to impute the missing values of an object by various parameter factors. In K-means parameter-based imputation method, the missing values are imputed by the information of the nearest object within the closest cluster.
In fuzzy C-means parameter-based imputation methods, the missing values are imputed by the information of the product of the membership value and centroid value. Tables 26, 27, 28 and 29 show the membership degree of each record for K cluster.
In rough K-means parameter-based imputation methods, the missing values of an object are imputed by the information available in the respective approximation. Tables 30, 31, 32 and 33 show the distance between the missing object and K lower approximation and upper approximation and also show the minimum distance values.
Rights and permissions
About this article
Cite this article
Raja, P.S., Thangavel, K. Missing value imputation using unsupervised machine learning techniques. Soft Comput 24, 4361–4392 (2020). https://doi.org/10.1007/s00500-019-04199-6
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00500-019-04199-6