Skip to main content
Log in

Missing value imputation using unsupervised machine learning techniques

  • Methodologies and Application
  • Published:
Soft Computing Aims and scope Submit manuscript

Abstract

In data mining, preprocessing is one of the essential processes which involves data normalization, noise removal, handling missing values, etc. This paper focuses on handling missing values using unsupervised machine learning techniques. Soft computation approaches are combined with the clustering techniques to form a novel method to handle the missing values, which help us to overcome the problems of inconsistency. Rough K-means centroid-based imputation method is proposed and compared with K-means centroid-based imputation method, fuzzy C-means centroid-based imputation method, K-means parameter-based imputation method, fuzzy C-means parameter-based imputation method, and rough K-means parameter-based imputation methods. The experimental analysis is carried out on four benchmark datasets, viz. Dermatology, Pima, Wisconsin, and Yeast datasets, which have taken from UCI data repository. The proposed method proves the efficacy of different datasets, and the results are also promising one.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16

Similar content being viewed by others

References

  • Bezdek JC (2013) Pattern recognition with fuzzy objective function algorithms. Springer, Berlin

    MATH  Google Scholar 

  • Cannon RL, Dave JV, Bezdek JC (1986) Efficient implementation of the fuzzy c-means clustering algorithms. IEEE Trans Pattern Anal Mach Intell 2:248–255

    Article  Google Scholar 

  • Dunn JC (1973) A fuzzy relative of the ISODATA process and its use in detecting compact well-separated clusters. J Cybern 3(3):32–57

    Article  MathSciNet  Google Scholar 

  • Gajawada S, Toshniwal D (2012) Missing value imputation method based on clustering and nearest neighbours. Int J Future Comput Commun 1(2):206

    Article  Google Scholar 

  • García-Laencina PJ, Sancho-Gómez JL, Figueiras-Vidal AR (2010) Pattern classification with missing data: a review. Neural Comput Appl 19(2):263–282

    Article  Google Scholar 

  • Han J, Pei J, Kamber M (2011) Data mining: concepts and techniques. Elsevier, Amsterdam

    MATH  Google Scholar 

  • Hathaway RJ, Bezdek JC (2001) Fuzzy c-means clustering of incomplete data. IEEE, Piscataway

    Book  Google Scholar 

  • Havens TC, Bezdek JC, Leckie C, Hall LO, Palaniswami M (2012) Fuzzy c-means algorithms for very large data. IEEE Trans Fuzzy Syst 20(6):1130–1146

    Article  Google Scholar 

  • https://archive.ics.uci.edu/ml/datasets/Yeast

  • Huang Z (1998) Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Min Knowl Discov 2(3):283–304

    Article  Google Scholar 

  • Jain AK (2010) Data clustering: 50 years beyond K-means. Pattern Recognit Lett 31(8):651–666

    Article  Google Scholar 

  • Khan SS, Ahmad A (2004) Cluster center initialization algorithm for K-means clustering. Pattern Recogn Lett 25(11):1293–1302

    Article  Google Scholar 

  • Kondo Y, Salibian-Barrera M, Zamar R (2012) A robust and sparse K-means clustering algorithm, arXiv preprint arXiv:1201.6082

  • Li D, Deogun J, Spaulding W, Shuart B (2004) Towards missing data imputation: a study of fuzzy k-means clustering method. InRough Sets Curr Trends Comput 3066:573–579

    Article  MathSciNet  Google Scholar 

  • Li D, Deogun J, Spaulding W, Shuart B (2005) Dealing with missing data: algorithms based on fuzzy set and rough set theories. In: Peters JF, Skowron A (eds) Transactions on rough sets IV. Springer, Berlin, pp 37–57

    Chapter  Google Scholar 

  • Lingras P, Peters G (2011) Rough clustering. Wiley Interdiscip Rev Data Min Knowl Discov 1(1):64–72

    Article  Google Scholar 

  • Liu ZG, Pan Q, Dezert J, Martin A (2016) Adaptive imputation of missing values for incomplete pattern classification. Pattern Recogn 52:85–95

    Article  Google Scholar 

  • Nelwamondo FV (2008) Computational intelligence techniques for missing data imputation. Doctoral dissertation, University of the Witwatersrand, Johannesburg

  • Panda S, Sahu S, Jena P, Chattopadhyay S (2012) Comparing fuzzy-C means and K-means clustering techniques: a comprehensive study. In: Wyld DC, Zizka J, Nagamalai D (eds) Proceedings of 2nd international conference on computer science, engineering and applications, vol 166. Advances in computer science, engineering & applications. Springer, Berlin, Heidelberg, pp 451–460

  • Pawlak Z (1998) Rough set theory and its applications to data analysis. Cybern Syst 29(7):661–688

    Article  Google Scholar 

  • Peters G (2005) Outliers in rough k-means clustering. InPReMI, pp 702–707

  • Peters G (2006) Some refinements of rough k-means clustering. Pattern Recognit 39(8):1481–1491

    Article  Google Scholar 

  • Peters G, Crespo F (2013) An illustrative comparison of rough k-means to classical clustering approaches. InRSFDGrC, pp 337–344

  • Peters G, Lampart M (2006) A partitive rough clustering algorithm. In: International conference on rough sets and current trends in computing. Springer, Berlin, pp 657–666

    Chapter  Google Scholar 

  • Peters G, Lampart M, Weber R (2008) Evolutionary rough k-medoid clustering. Lect Notes Comput Sci 5084:289–306

    Article  Google Scholar 

  • Rahman MM, Davis DN (2013) Machine learning-based missing value imputation method for clinical datasets. In: Yang G-C, Ao S-I, Gelman L (eds) IAENG transactions on engineering technologies. Springer, Dordrecht, pp 245–257

    Chapter  Google Scholar 

  • Rahman MG, Islam MZ (2016) Missing value imputation using a fuzzy clustering-based EM approach. Knowl Inf Syst 46(2):389–422

    Article  Google Scholar 

  • Raja PS, Thangavel K (2016) Soft clustering based missing value imputation. In: Subramanian S et al (eds) Annual convention of the computer society of India. Springer, Singapore, pp 119–133

    Google Scholar 

  • Rey-del-Castillo P, Cardeñosa J (2012) Fuzzy min-max neural networks for categorical data: application to missing data imputation. Neural Comput Appl 21(6):1349–1362

    Article  Google Scholar 

  • Suguna N, Thanushkodi KG (2011) Predicting missing attribute values using k-means clustering. J Comput Sci 7(2):216

    Article  Google Scholar 

  • Tuikkala J, Elo LL, Nevalainen OS, Aittokallio T (2008) Missing value imputation improves clustering and interpretation of gene expression microarray data. BMC Bioinform 9(1):202

    Article  Google Scholar 

  • Zhang S, Zhang J, Zhu X, Qin Y, Zhang C (2008) Missing value imputation based on data clustering. In: Gavrilova ML, Tan CJK (eds) Transactions on computational science I. Lecture notes in computer science, vol 4750, pp 128–138

Download references

Acknowledgements

Authors would like to thank UGC, New Delhi, for the financial support received under UGC Rajiv Gandhi National Fellowship (F1-17.1/2016-17/RGNF-2015-17-SC-TAM-28324) and UGC Major Research Project (43-274/2014). The authors extend their sincere thanks to the anonymous referees for their suggestions to improve the paper.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to P. S. Raja.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Ethical approval

This article does not contain any studies with human participants or animals performed by any of the authors.

Additional information

Communicated by V. Loia.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Appendix

Tables 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20 and 21 show the information that is very important to impute the missing values of an object. In centroid-based missing value, imputation method, the missing values are imputed by the information of the closest centroid value of the cluster. Tables 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20 and 21 show the distance between the missing object and K cluster centroid and also show the minimum distance cluster and minimum distance value.

Table 10 K-means centroid-based imputation method for Dermatology
Table 11 K-means centroid-based imputation method for Pima
Table 12 K-means centroid-based imputation method for Wisconsin
Table 13 K-means centroid-based imputation method for Yeast
Table 14 Fuzzy C-means centroid-based imputation method for Dermatology
Table 15 Fuzzy C-means centroid-based imputation method for Pima
Table 16 Fuzzy C-means centroid-based imputation method for Wisconsin
Table 17 Fuzzy C-means centroid-based imputation method for Yeast
Table 18 Rough K-means centroid-based imputation method for Dermatology
Table 19 Rough K-means centroid-based imputation method for Pima
Table 20 Rough K-means centroid-based imputation method for Wisconsin
Table 21 Rough K-means centroid-based imputation method for Yeast

Tables 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32 and 33 show the information that is very important to impute the missing values of an object by various parameter factors. In K-means parameter-based imputation method, the missing values are imputed by the information of the nearest object within the closest cluster.

Table 22 K-means parameter-based imputation method for Dermatology
Table 23 K-means parameter-based imputation method for Pima
Table 24 K-means parameter-based imputation method for Wisconsin
Table 25 K-means parameter-based imputation method for Yeast
Table 26 Fuzzy C-means parameter-based imputation method for Dermatology
Table 27 Fuzzy C-means parameter-based imputation method for Pima
Table 28 Fuzzy C-means parameter-based imputation method for Wisconsin
Table 29 Fuzzy C-means parameter-based imputation method for Yeast
Table 30 Rough K-means parameter-based imputation method for Dermatology
Table 31 Rough K-means parameter-based imputation method for Pima
Table 32 Rough K-means parameter-based imputation method for Wisconsin
Table 33 Rough K-means parameter-based imputation method for Yeast

In fuzzy C-means parameter-based imputation methods, the missing values are imputed by the information of the product of the membership value and centroid value. Tables 26, 27, 28 and 29 show the membership degree of each record for K cluster.

In rough K-means parameter-based imputation methods, the missing values of an object are imputed by the information available in the respective approximation. Tables 30, 31, 32 and 33 show the distance between the missing object and K lower approximation and upper approximation and also show the minimum distance values.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Raja, P.S., Thangavel, K. Missing value imputation using unsupervised machine learning techniques. Soft Comput 24, 4361–4392 (2020). https://doi.org/10.1007/s00500-019-04199-6

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00500-019-04199-6

Keywords

Navigation