Advertisement

Soft Computing

, Volume 22, Issue 10, pp 3313–3330 | Cite as

A fuzzy K-nearest neighbor classifier to deal with imperfect data

  • Jose M. Cadenas
  • M. Carmen Garrido
  • Raquel Martínez
  • Enrique Muñoz
  • Piero P. Bonissone
Methodologies and Application

Abstract

The k-nearest neighbors method (kNN) is a nonparametric, instance-based method used for regression and classification. To classify a new instance, the kNN method computes its k nearest neighbors and generates a class value from them. Usually, this method requires that the information available in the datasets be precise and accurate, except for the existence of missing values. However, data imperfection is inevitable when dealing with real-world scenarios. In this paper, we present the kNN\(_{imp}\) classifier, a k-nearest neighbors method to perform classification from datasets with imperfect value. The importance of each neighbor in the output decision is based on relative distance and its degree of imperfection. Furthermore, by using external parameters, the classifier enables us to define the maximum allowed imperfection, and to decide if the final output could be derived solely from the greatest weight class (the best class) or from the best class and a weighted combination of the closest classes to the best one. To test the proposed method, we performed several experiments with both synthetic and real-world datasets with imperfect data. The results, validated through statistical tests, show that the kNN\(_{imp}\) classifier is robust when working with imperfect data and maintains a good performance when compared with other methods in the literature, applied to datasets with or without imperfection.

Keywords

k-nearest neighbors Classification Imperfect data Distance/dissimilarity measures Combination methods 

Notes

Acknowledgements

Supported by the project TIN2014-52099-R (EDISON) granted by the Ministry of Economy and Competitiveness of Spain (including ERDF support).

Compliance with Ethical Standards

Conflict of interest

The authors declare that they have no conflict of interest.

Ethical approval

This article does not contain any studies with human participants or animals performed by any of the authors.

References

  1. Aha DW (1992) Tolerating noisy, irrelevant and novel attributes in instance-based learning algorithms. Int J Man-Mach Stud 36(2):267–287CrossRefGoogle Scholar
  2. Aha DW, Kibler D, Albert KM (1991) Instance-based learning algorithms. Mach Learn 6(1):37–66Google Scholar
  3. Alcalá-Fdez J, Fernández A, Luengo J, Derrac J, García S, Sánchez L, Herrera F (2011) Keel data-mining software tool: data set repository, integration of algorithm and experimental analysis framework. J Mult-Valued Logic Soft Comput 17(2–3):255–287Google Scholar
  4. Barua A, Mudunuri LS, Kosheleva O (2014) Why trapezoidal and triangular membership functions work so well: towards a theoretical explanation. J Uncertain Syst 8(3):164–168Google Scholar
  5. Berlanga F, Rivas AR, del Jesús M, Herrera F (2010) Gp-coach genetic programming-based learning of compact and accurate fuzzy rule-based classification systems for high-dimensional problems. Inf Sci 180(8):1183–1200CrossRefGoogle Scholar
  6. Bezdek J (1981) Pattern recognition with fuzzy objective function algorithms. Plenum, New YorkCrossRefzbMATHGoogle Scholar
  7. Bonissone PP, Cadenas JM, Garrido MC, Díaz-Valladares RA (2010) A fuzzy random forest. Int J Approx Reason 51(7):729–747MathSciNetCrossRefzbMATHGoogle Scholar
  8. Cadenas JM, Garrido MC, Martínez R (2013) Nip—an imperfection processor to data mining datasets. Int J Comput Intell Syst 6(1):3–17CrossRefGoogle Scholar
  9. Cadenas JM, Garrido MC, Martínez R, Bonissone PP (2012) Extending information processing in a fuzzy random forest. Soft Comput 16(6):845–861CrossRefGoogle Scholar
  10. Cano A, Zafra A, Ventura S (2013) Weighted data gravitation classification for standard and imbalanced data. IEEE Trans Cybern 43(6):1672–1687CrossRefGoogle Scholar
  11. Clare A, King R (2001) Knowledge discovery in multi-label phenotype data. In: Proceedings of the 5th European conference on principles of data mining and knowledge discovery, Freiburg, pp 42–53Google Scholar
  12. Cover T, Hart PE (1967) Nearest neighbor pattern classification. IEEE Trans Inf Theory 13(1):21–27CrossRefzbMATHGoogle Scholar
  13. Crockett K, Bandar Z, Mclean D (2001) Growing a fuzzy decision forest. In: Proceedings of the 10th IEEE international conference on fuzzy systems, Melbourne, pp 614–617Google Scholar
  14. DeLuca A, Termini S (1972) A definition of a nonprobabilistic entropy in the setting of fuzzy sets theory. Inf Control 20(4):301–312MathSciNetCrossRefzbMATHGoogle Scholar
  15. Derrac J, García S, Herrera F (2014) Fuzzy nearest neighbor algorithms: taxonomy, experimental analysis and prospects. Inf Sci 260:98–119CrossRefGoogle Scholar
  16. Diamon P, Kloeden P (1994) Metric spaces of fuzzy sets: theory and application. World Scientific Publishing, LondonCrossRefGoogle Scholar
  17. Dombi J, Porkolab L (1991) Measures of fuzziness. Ann Univ Sci Bp Sect Comput 12:69–78MathSciNetzbMATHGoogle Scholar
  18. Dubois D, Parde H (1980) Fuzzy sets and system: theory and applications. Academic Press, New YorkGoogle Scholar
  19. Duda RO, Hart PE, Stork DG (2001) Pattern classification. Wiley, New YorkzbMATHGoogle Scholar
  20. Fernández A, del Jesús M, Herrera F (2009) Hierarchical fuzzy rule based classification systems with genetic rule selection for imbalanced data-sets. Int J Approx Reason 50(3):561577CrossRefzbMATHGoogle Scholar
  21. Fix E, Hodges J (1989) Discriminatory analysis, nonparametric discrimination: consistency properties. Int Stat Rev 57(3):238–247CrossRefzbMATHGoogle Scholar
  22. García S, Fernández A, Luengo J, Herrera F (2009) A study statistical techniques and performance measures for genetics-based machine learning: accuracy and interpretability. Soft Comput 13(10):959–977CrossRefGoogle Scholar
  23. García S, Fernández A, Luengo J, Herrera F (2010) Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: experimental analysis of power. Inf Sci 180(10):2044–2064CrossRefGoogle Scholar
  24. Garrido MC, Cadenas JM, Bonissone PP (2010) A classification and regression technique to handle heterogeneous and imperfect information. Soft Comput 14(11):1165–1185CrossRefGoogle Scholar
  25. Huang Z (2002) A fuzzy k-modes algorithm for clustering categorical data. IEEE Trans Fuzzy Syst 7(4):446–452CrossRefGoogle Scholar
  26. Ihaka R, Gentleman R (1996) R: a language for data analysis and graphics. J Comput Gr Stat 5(3): 299–314. http://www.r-project.org/
  27. Inoue T, Abe S (2001) Fuzzy support vector machines for pattern classification. In: Proceedings of international joint conference on neural networks, Washington, pp 1449–1454Google Scholar
  28. Ishibuchi H, Yamamoto T (2005) Rule weight specification in fuzzy rule-based classification systems. IEEE Trans Fuzzy Syst 13(4):428436CrossRefGoogle Scholar
  29. Jahromi MZ, Parvinnia E, John R (2009) A method of learning weighted similarity function to improve the performance of nearest neighbor. Inf Sci 179(17):2964–2973CrossRefzbMATHGoogle Scholar
  30. Janikow CZ (1998) Fuzzy decision trees: issues and methods. IEEE Trans Syst Man Cybern Part B 28(1):1–14CrossRefGoogle Scholar
  31. Janikow CZ (2003) Fuzzy decision forest. In: Proceedings of the 22nd international conference of the North American fuzzy information processing society, Chicago, pp 480–483Google Scholar
  32. Johanyák ZC, Kovács S (2005) Distance based similarity measures of fuzzy sets. In: Proceedings of the 3rd Slovakian-Hungarian joint symposium on applied machine intelligence, Herlany, pp 265–276Google Scholar
  33. Kaufmann A (1975) Introduction to the theory of fuzzy subsets: fundamental theoretical elements. Academic Press, New YorkzbMATHGoogle Scholar
  34. Lee K, Lee K, Lee J (1999) A fuzzy decision tree induction method for fuzzy data. In: Proceedings of IEEE international fuzzy systems conference, Seoul, pp 16–21Google Scholar
  35. Li D, Gu H, Zhang L (2010) A fuzzy c-means clustering algorithm based on nearest-neighbor intervals for incomplete data. Exp Syst Appl 37(10):6942–6947CrossRefGoogle Scholar
  36. Lichman M (2013) UCI machine learning repository. http://archive.ics.uci.edu/ml, University of California, School of Information and Computer Sciences, Irvine
  37. Lin C, Wang S (2002) Fuzzy support vector machines. IEEE Trans Neural Netw 13(2):464471Google Scholar
  38. Madjarov G, Kocev D, Gjorgjevikj D, Džeroski S (2012) An extensive experimental comparison of methods for multi-label learning. Pattern Recognit 45(9):30843104CrossRefGoogle Scholar
  39. Marsala C (2009) Data mining with ensembles of fuzzy decision trees. In: Proceedings of IEEE symposium on computational intelligence and data mining, Nashville, pp 348–354Google Scholar
  40. Michie D, Spiegelhalter D, Taylor C (1994) Machine learning, neural and statistical classification. Ellis Horwood, Upper Saddle RiverzbMATHGoogle Scholar
  41. Mitra S, Pal SK (1995) Fuzzy multi-layer perceptron, inferencing and rule generation. IEEE Trans Neural Netw 6(1):51–63CrossRefGoogle Scholar
  42. Moore RE (1979) Methods and applications of interval analysis. (SIAM) Studies in Applied Mathematics 2, Soc for Industrial and Applied Math, PhiladelphiaGoogle Scholar
  43. Nauck D, Krusel R (1997) A neuro-fuzzy method to learn fuzzy classification rules from data. Fuzzy Sets Syst 89(3):277–288MathSciNetCrossRefGoogle Scholar
  44. Olaru C, Wehenkel L (2003) A complete fuzzy decision tree technique. Fuzzy Sets Syst 138(2):221–254MathSciNetCrossRefGoogle Scholar
  45. Otero A, Otero J, Sánchez L, Villar JR (2006) Longest path estimation from inherently fuzzy data acquired with GPS using genetic algorithms. In: Proceedings of the international symposium on evolving fuzzy systems, Lancaster, pp 300–305Google Scholar
  46. Palacios AM, Sánchez L, Couso I (2009) Extending a simple genetic cooperative-competitive learning fuzzy classifier to low quality datasets. Evolut Intell 2(1):73–84CrossRefGoogle Scholar
  47. Palacios AM, Sánchez L, Couso I (2010) Diagnosis of dyslexia with low quality data with genetic fuzzy systems. Int J Approx Reason 51(8):993–1009CrossRefGoogle Scholar
  48. Palacios AM, Sánchez L, Couso I (2011) Future performance modeling in athletism with low quality data-based genetic fuzzy systems. J Mult-Valued Logic Soft Comput 17:207–228Google Scholar
  49. Palacios AM, Sánchez L, Couso I (2012) Boosting of fuzzy rules with low quality data. J Mult-Valued Logic Soft Comput 19:591–619MathSciNetGoogle Scholar
  50. Palacios AM, Sánchez L, Couso I (2013) An extension of the furia classification algorithm to low quality data. Hybrid artificial intelligent systems (LNCS 8073). Springer, Berlin, pp 679–688CrossRefGoogle Scholar
  51. Palacios AM, Palacios JL, Sánchez L, Alcalá-Fdez J (2015) Genetic learning of the membership functions for mining fuzzy association rules from low quality data. Inf Sci 295:358–378CrossRefzbMATHGoogle Scholar
  52. Paredes R, Vidal E (2006) Learning prototypes and distances: a prototype reduction technique based on nearest neighbor error minimization. Pattern Recognit 39(2):180–188CrossRefzbMATHGoogle Scholar
  53. Paredes R, Vidal E (2006) Learning weighted metrics to minimize nearest-neighbor classification error. IEEE Trans Pattern Anal Mach Intell 28(7):1100–1110CrossRefGoogle Scholar
  54. Ralescu AL, Ralescu DA (1984) Probability and fuzziness. Inf Sci 34(2):85–92MathSciNetCrossRefzbMATHGoogle Scholar
  55. Rumelhart DE, Mcclelland JL (1986) Parallel distributed processing. MIT Press, CambridgeGoogle Scholar
  56. Stone M (1974) Cross-validatory choice and assessment of statistical predictions. J R Stat Soc Ser B (Methodological) 36(2):111–147MathSciNetzbMATHGoogle Scholar
  57. Torra V (2005) Fuzzy c-means for fuzzy hierarchical clustering. In: Proceedings of the 14th IEEE international conference on fuzzy systems, Reno, pp 646–651Google Scholar
  58. Villar J, Otero A, Otero J, Sánchez L (2009) Taximeter verification using imprecise data from GPS. Eng Appl Artif Intell 22(2):250–260CrossRefGoogle Scholar
  59. Wang J, Neskovic P, Cooper LN (2007) Improving nearest neighbor rule with a simple adaptive distance measure. Pattern Recognit Lett 28(2):207–213CrossRefGoogle Scholar
  60. Wilson DR, Martinez TR (2000) An integrated instance-based learning algorithm. Comput Intell 16(1):1–28MathSciNetCrossRefGoogle Scholar
  61. Witten IH, Frank E, Hall MA (2011) Data mining, 3rd edn. Morgan Kaufmann Publishers, San FranciscoGoogle Scholar
  62. Younes Z, Abdallah F, Denoeux T (2010) Fuzzy multi-label learning under veristic variables. In: Proceedings of the IEEE international conference on fuzzy systems, Yantai, pp 1–8Google Scholar
  63. Zadeh L (1965) Fuzzy sets. Inf Control 8:183–190CrossRefzbMATHGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2017

Authors and Affiliations

  • Jose M. Cadenas
    • 1
  • M. Carmen Garrido
    • 1
  • Raquel Martínez
    • 2
  • Enrique Muñoz
    • 3
  • Piero P. Bonissone
    • 4
  1. 1.Department of Information and Communications EngineeringUniversity of MurciaMurciaSpain
  2. 2.Department of Computer EngineeringCatholic University of MurciaMurciaSpain
  3. 3.Department of Computer ScienceUniversità Degli Studi di MilanoCremaItaly
  4. 4.Piero P Bonissone Analytics, LLCSan DiegoUSA

Personalised recommendations