Advertisement

A nifty collaborative analysis to predicting a novel tool (DRFLLS) for missing values estimation

  • Samaher Al-JanabiEmail author
  • Ayad F. Alkaim
Methodologies and Application
  • 13 Downloads

Abstract

One of the important trends in an intelligent data analysis will be the growing importance of data processing. But this point faces problems similar to those of data mining (i.e., high-dimensional data, missing value imputation and data integration); one of the challenges in estimation missing value methods is how to select the optimal number of nearest neighbors of those values. This paper, attempting to search the capability of building a novel tool to estimate missing values of various datasets called developed random forest and local least squares (DRFLLS). By developing random forest algorithm, seven categories of similarity measures were defined. These categories are person similarity coefficient, simple similarity, and fuzzy similarity (M1, M2, M3, M4 and M5). They are sufficient to estimate the optimal number of neighborhoods of missing values in this application. Hereafter, local least squares (LLS) has been used to estimate the missing values. Imputation accuracy can be measured in different ways: Pearson correlation (PC) and NRMSE. Then, the optimal number of neighborhoods is associated with the highest value of PC and a smaller value of NRMSE. The experimental results were carried out on six datasets obtained from different disciplines, and DRFLLS proves the dataset which has a small rate of missing values gave the best estimation to the number of nearest neighbors by DRFPC and in the second degree by DRFFSM1 when r = 4, while if the dataset has high rate of missing values, then it gave the best estimation to number of nearest neighbors by DRFFSM5 and in the second degree by DRFFSM3. After that, the missing value was estimated by LLS, and the results accuracy was measured by NRMSE and Pearson correlation. The smallest value of NRMSE for a given dataset is corresponding to DRF correlation function which is a better function for a given dataset. The highest value of PC for a given dataset is corresponding to DRF correlation function which is a better function for a given dataset.

Keywords

Intelligent data analysis Missing values Imputation methods Random forest Local least squares 

Notes

Compliance with ethical standards

Conflict of interest

The authors declare that they have no conflict of interests.

Ethical approval

This article does not contain any studies with human participants or animals performed by any of the authors.

References

  1. Abualigah LMQ, Hanandeh ES (2015) Applying genetic algorithms to information retrieval using vector space model. Int J Comput Sci Eng Appl 5(1):19Google Scholar
  2. Abualigah LM, Khader AT (2017) Unsupervised text feature selection technique based on hybrid particle swarm optimization algorithm with genetic operators for the text clustering. J Supercomput 73(11):4773–4795Google Scholar
  3. Abualigah LM, Khader AT, Al-Betar MA, Alomari OA (2017) Text feature selection with a robust weight scheme and dynamic dimension reduction to text document clustering. Expert Syst Appl 84:24–36Google Scholar
  4. Adam E, Mutanga O, Odindi J, Abdel-Rahman EM (2014) Land-use/cover classification in a heterogeneous coastal landscape using Rapid Eye imagery: evaluating the performance of random forest and support vector machines classifiers. Int J Rem Sens 35(10):3440–3458Google Scholar
  5. Ali SH (2012a) Miner for OACCR: case of medical data analysis in knowledge discovery. In: IEEE, 2012 6th international conference on sciences of electronics, technologies of information and telecommunications (SETIT), Sousse pp 962–975.  https://doi.org/10.1109/setit.2012.6482043
  6. Ali SH (2012b) A novel tool (FP-KC) for handle the three main dimensions reduction and association rule mining. In: IEEE, 2012 6th international conference on sciences of electronics, technologies of information and telecommunications (SETIT), Sousse, pp 951–961.  https://doi.org/10.1109/setit.2012.6482042
  7. Ali SH (2013) Novel approach for generating the key of stream cipher system using random forest data mining algorithm. In: IEEE, 2013 sixth international conference on developments in e-systems engineering, Abu Dhabi, pp 259–269 (2013).  https://doi.org/10.1109/dese.2013.54
  8. Al-Janabi S (2017) Pragmatic miner to risk analysis for intrusion detection (PMRA-ID). In: Mohamed A, Berry M, Yap B (eds) Soft computing in data science. SCDS 2017. Communications in Computer and Information Science, vol 788. Springer, Singapore.  https://doi.org/10.1007/978-981-10-7242-0_23 Google Scholar
  9. Al-Janabi S (2018) Smart system to create optimal higher education environment using IDA and IOTs. Int J Comput Appl.  https://doi.org/10.1080/1206212X.2018.1512460 Google Scholar
  10. Aljarah I, Mafarja M, Heidari AA, Hossam F, Yong Z, Mirjalili S (2018) Asynchronous accelerating multi-leader Salp chains for feature selection. Appl Soft Comput 71:964–979.  https://doi.org/10.1016/j.asoc.2018.07.040 Google Scholar
  11. Bose S, Das C, Chakraborty A, Chattopadhyay S (2013) Effectiveness of different partition based clustering algorithms for estimation of missing values in microarray gene expression data. In: Advances in computing and information technology. Springer, Berlin, pp 37–47Google Scholar
  12. Breiman L (2001) Random forests. Mach Learn 45(1):5–32zbMATHGoogle Scholar
  13. Bruggeman J, Heringa J, Brandt B (2009) PhyloPars: estimation of missing parameter values using phylogeny. Nucleic Acids Res 37(2):W179–W184Google Scholar
  14. Carranza EJM, Laborte AG (2015) Random forest predictive modeling of mineral prospectivity with small number of prospects and data with missing values in Abra (Philippines). Comput Geosci 74:60–70Google Scholar
  15. Center for Machine Learning and Intelligent Systems, USA (2010a) http://archive.ics.uci.edu/ml/datasets/p53+Mutants
  16. Center for Machine Learning and Intelligent Systems, USA (2010b). https://www.nationalgeographic.org/encyclopedia/geographic-information-system-gis
  17. Chiu CC, Chan SY, Wang CC, Wu WS (2013) Missing value imputation for microarray data: a comprehensive comparison study and a web tool. BMC Syst Biol 7(6):S12Google Scholar
  18. Cutler DR, Edwards TC, Beard KH, Cutler A, Hess KT, Gibson J, Lawler JJ (2007) Random forests for classification in ecology. Ecology 88(11):2783–2792Google Scholar
  19. Elyan E, Gaber MM (2016) A fine-grained Random Forests using class decomposition: an application to medical diagnosis. Neural Comput Appl 27(8):2279–2288Google Scholar
  20. Genuer R, Poggi JM, Tuleau-Malot C (2010) Variable selection using random forests. Pattern Recognit Lett 31(14):2225–2236Google Scholar
  21. Golub GH, Kim H, Park H (2005) Missing value estimation for DNA microarray gene expression data: local least squares imputation. Bioinformatics 21(2):187–198Google Scholar
  22. Graham JW (2012) Missing data: analysis and design. Springer, New YorkzbMATHGoogle Scholar
  23. Han J, Kamber M (2006) Data mining: concepts and techniques, 2nd edn. University of Illinois at Urbana-Champaign. San Francisco. Elsevier 2006. www.books.elsevier.com
  24. Hapfelmeier A, Hothorn T, Ulm K (2012) Recursive partitioning on incomplete data using surrogate decisions and multiple imputation. Comput Stat Data Anal 56(6):1552–1565MathSciNetzbMATHGoogle Scholar
  25. Heidari AA, Faris H, Aljarah I, Mirjalili S (2018) An efficient hybrid multilayer perceptron neural network with grasshopper optimization. Soft Comput.  https://doi.org/10.1007/s00500-018-3424-2 Google Scholar
  26. Heidari AA, Aljarah I, Faris H, Chen H, Luo J, Mirjalili S (2019) An enhanced associative learning-based exploratory whale optimizer for global optimization. Neural Comput Appl.  https://doi.org/10.1007/s00521-019-04015-0 Google Scholar
  27. James G, Witten D, Hastie T, Tibshirani R (2013) An introduction to statistical learning. Springer, New YorkzbMATHGoogle Scholar
  28. Kumar V, Wu X, Ross Quinlan J, Ghosh J, Yang Q, Motoda H, McLachlan GJ, Ng A, Liu B, Yu PS, Zhou Z-H, Steinbach M, Hand DJ, Steinberg D (2008) Top 10 algorithms in data mining. Knowl Inf Syst 14:1–37Google Scholar
  29. Liew AWC, Law NF, Yan H (2010) Missing value imputation for gene expression data: computational techniques to recover missing data from available information. Brief Bioinform 12(5):498–513Google Scholar
  30. Mafarja M, Aljarah I, Heidari AA, Faris H, Fournier-Viger P, Li X, Mirjalili S (2018) Binary dragonfly optimization for feature selection using time-varying transfer functions. Knowl Based Syst 161:185–204.  https://doi.org/10.1016/j.knosys.2018.08.003 Google Scholar
  31. McCandless T, Haupt SE, Young G (2011) Replacing missing data for ensemble systems. J Comput 6(2):162–171Google Scholar
  32. Moorthy K, Saberi Mohamad M, Deris S (2014) A review on missing value imputation algorithms for microarray gene expression data. Curr Bioinform 9(1):18–22Google Scholar
  33. Pantanowitz A, Marwala T (2009) Missing data imputation through the use of the random forest algorithm. In: Yu W, Sanchez EN (eds) Advances in computational intelligence. Advances in Intelligent and Soft Computing, vol 116, Springer, Berlin, pp 53–62 Google Scholar
  34. Qi Y, Klein-Seetharaman J, Bar Z (2005) Random forest similarity for protein-protein interaction prediction from multiple sources. Pac Symp Biocomp 10:531–542Google Scholar
  35. Redmond M (2009) Center for machine learning and intelligent systems. Computer Science, La Salle University, Philadelphia, PAGoogle Scholar
  36. Rieger A, Hothorn T, Strobl C (2010) Random forests with missing values in the covariates. Technical Report Number 79, Department of Statistics, Ludwig-Maximilians-Universität, MunichGoogle Scholar
  37. Rubin DB (1976) Inference and missing data. Biometrika 63(3):581–592MathSciNetzbMATHGoogle Scholar
  38. Rubin DB (1996) Multiple imputation after 18 + years. J Am Stat Assoc 91(434):473–489zbMATHGoogle Scholar
  39. Ryan C, Green D, Cagney G, Cunningham P (2010) Missing value imputation for epistatic MAPs. Bioinformatics 11:197Google Scholar
  40. Saul LK, Savage S, Ma J, Voelker GM (2009) Identifying suspicious URLs: an application of large-scale online learning. In: 26th annual international conference on machine learning (ICML), Montreal (2009) pp 681–688Google Scholar
  41. Stekhoven DJ, Bühlmann P (2012) MissForest—non-parametric missing value imputation for mixed-type data. Bioinformatics 28(1):112–118Google Scholar
  42. Verikas A, Gelzinis A, Bacauskiene M (2011) Mining data with random forests: a survey and results of new tests. Pattern Recognit 44(2):330–349Google Scholar
  43. Waljee AK, Mukherjee A, Singal AG, Zhang Y, Warren J, Balis U, Higgins PD (2013) Comparison of imputation methods for missing laboratory data in medicine. BMJ Open 3(8):e002847Google Scholar
  44. Wasito I, Mirkin B (2006) Nearest neighbours in least-squares data imputation algorithms with different missing patterns. Comput Stat Data Anal 50(4):926–949MathSciNetzbMATHGoogle Scholar
  45. Waske B, Chi M, Benediktsson JA, van der Linden S, Koetz B (2010) Algorithms and applications for land cover classification—a review. In: Li D, Shan J, Gong J (eds) Geospatial technology for earth observation. Springer, Boston, MA, pp 203–233Google Scholar
  46. Xie Y, Li X, Ngai EWT, Ying W (2009) Customer churn prediction using improved balanced random forests. Expert Syst Appl 36(3):5445–5449Google Scholar
  47. Zhou Z, Zhang R, Lin Y, Wang R (2015) A comparison of similarity measures of intuitionistic fuzzy sets. In: LISS 2014, pp 1237–1242Google Scholar

Copyright information

© Springer-Verlag GmbH Germany, part of Springer Nature 2019

Authors and Affiliations

  1. 1.Department of Computer Science, Faculty of Science for Women (SCIW)University of BabylonBabylonIraq
  2. 2.Department of Chemistry Science, Faculty of Science for Women (SCIW)University of BabylonBabylonIraq

Personalised recommendations