Skip to main content

Improved nearest neighbor classifiers by weighting and selection of predictors

Abstract

Nearest neighborhood classification is a flexible classification method that works under weak assumptions. The basic concept is to use the weighted or un-weighted sums over class indicators of observations in the neighborhood of the target value. Two modifications that improve the performance are considered here. Firstly, instead of using weights that are solely determined by the distances we estimate the weights by use of a logit model. By using a selection procedure like lasso or boosting the relevant nearest neighbors are automatically selected. Based on the concept of estimation and selection, in the second step, we extend the predictor space. We include nearest neighborhood counts, but also the original predictors themselves and nearest neighborhood counts that use distances in sub dimensions of the predictor space. The resulting classifiers combine the strength of nearest neighbor methods with parametric approaches and by use of sub dimensions are able to select the relevant features. Simulations and real data sets demonstrate that the method yields better misclassification rates than currently available nearest neighborhood methods and is a strong and flexible competitor in classification problems.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15

References

  1. Bache, K., Lichman, M.: Uci machine learning repository. http://archive.ics.uci.edu/ml 19 (2013)

  2. Breiman, L.: Bagging predictors. Mach. Learn. 24(2), 123–140 (1996a)

    MathSciNet  MATH  Google Scholar 

  3. Breiman, L.: Heuristics of instability and stabilisation in model selection. Ann. Stat. 24, 2350–2383 (1996b)

    MathSciNet  Article  MATH  Google Scholar 

  4. Breiman, L.: Stacked regressions. Mach. Learn. 24(1), 49–64 (1996c)

    MathSciNet  MATH  Google Scholar 

  5. Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)

    MathSciNet  Article  MATH  Google Scholar 

  6. Bühlmann, P., Hothorn, T.: Boosting algorithms: regularization, prediction and model fitting (with discussion). Stat. Sci. 22, 477–505 (2007)

    Article  MATH  Google Scholar 

  7. Bühlmann, P., Yu, B.: Boosting with the L2 loss: regression and classification. J. Am. Stat. Assoc. 98, 324–339 (2003)

    Article  MATH  Google Scholar 

  8. Candes, E., Tao, T.: The Dantzig selector: statistical estimation when p is much larger than n. Ann. Stat. 35(6), 2313–2351 (2007)

    MathSciNet  Article  MATH  Google Scholar 

  9. Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20(3), 273–297 (1995)

    MATH  Google Scholar 

  10. Domeniconi, C., Peng, J., Gunopulos, D.: Locally adaptive metric nearest-neighbor classification. IEEE Trans. Pattern Anal. Mach. Intell. 24, 1281–1285 (2002)

    Article  Google Scholar 

  11. Domeniconi, C., Yan, B.: Nearest neighbor ensemble. In: Proceedings of the 17th International Conference on Pattern Recognition, vol. 1, pp. 228–231 (2004)

  12. Fan, J., Li, R.: Variable selection via nonconcave penalize likelihood and its oracle properties. J. Am. Stat. Assoc. 96, 1348–1360 (2001)

    MathSciNet  Article  MATH  Google Scholar 

  13. Fix, E., Hodges, J.L.: Discriminatory Analysis-nonparametric Discrimination: Consistency Properties. US Air Force School of Aviation Medicine, Randolph Field (1951)

    MATH  Google Scholar 

  14. Friedman, J., Hastie, T., Tibshirani, R.: Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw. 33(1), 1–22 (2010)

    Article  Google Scholar 

  15. Friedman, J.H.: Flexible metric nearest neighbor classification. Technical report 113, Stanford University, Statistics Department (1994)

  16. Friedman, J.H., Hastie, T., Tibshirani, R.: Additive logistic regression: a statistical view of boosting. Ann. Stat. 28, 337–407 (2000)

    MathSciNet  Article  MATH  Google Scholar 

  17. Gertheiss, J., Tutz, G.: Feature selection and weighting by nearest neighbor ensembles. Chemom. Intell. Lab. Syst. 99, 30–38 (2009)

    Article  Google Scholar 

  18. Ghosh, A.K.: On nearest neighbor classification using adaptive choice of k. J. Comput. Graph. Stat. 16(2), 482–502 (2007)

    MathSciNet  Article  Google Scholar 

  19. Ghosh, A.K.: A probabilistic approach for semi-supervised nearest neighbor classification. Pattern Recognit. Lett. 33(9), 1127–1133 (2012)

    Article  Google Scholar 

  20. Ghosh, A.K., Godtliebsen, F.: On hybrid classification using model assisted posterior estimates. Pattern Recognit. 45(6), 2288–2298 (2012)

    Article  MATH  Google Scholar 

  21. Goeman, J. J.: Penalized: weighted k-nearest neighbors. R package version 0.9–42 (2012)

  22. Hall, P., Park, B.U., Samworth, R.J.: Choice of neighbor order in nearest-neighbor classification. Ann. Stat. 36, 2135–2152 (2008)

    MathSciNet  Article  MATH  Google Scholar 

  23. Hastie, T., Tibshirani, R.: Discriminant adaptive nearest-neighbor classification. IEEE Trans. Pattern Anal. Mach. Intell. 18, 607–616 (1996)

    Article  Google Scholar 

  24. Hastie, T., Tibshirani, R., Friedman, J.H.: The Elements of Statistical Learning, 2nd edn. Springer, New York (2009)

    Book  MATH  Google Scholar 

  25. Holmes, C., Adams, N.: A probabilistic nearest neighbour method for statistical pattern recognition. J. R. Stat. Soc.: Ser. B (Stat. Methodol.) 64(2), 295–306 (2002)

    MathSciNet  Article  MATH  Google Scholar 

  26. Holmes, C.C., Adams, N.M.: Likelihood inference in nearest-neighbour classification models. Biometrika 90(1), 99–112 (2003)

    MathSciNet  Article  MATH  Google Scholar 

  27. Hothorn, T.: TH.data: TH’s data archive. R package version 1.0-3 (2014)

  28. Hothorn, T., Buehlmann, P., Kneib, T., Schmid, M., Hofner, B.: Mboost: model-based boosting. R package version 2.2-3 (2013)

  29. Leisch, F., Dimitriadou, E.: mlbench: Machine learning benchmark problems. R package version 2.1-1 (2010)

  30. Liaw, A., Wiener, M.: Classification and regression by randomforest. R News 2(3), 18–22 (2002)

    Google Scholar 

  31. Lin, Y., Jeon, Y.: Random forests and adaptive nearest neighbors. J. Am. Stat. Assoc. 101, 578–590 (2006)

    MathSciNet  Article  MATH  Google Scholar 

  32. Meyer, D., Dimitriadou, E., Hornik, K., Weingessel, A., Leisch, F.: e1071: Misc functions of the department of statistics (e1071), TU Wien. R package version 1.6-2 (2014)

  33. Morin, R.L., Raeside, D.E.: A reappraisal of distance-weighted k-nearest neighbor classification for pattern recognition with missing data. IEEE Trans. Syst. Man Cybern. 11, 241–243 (1981)

    MathSciNet  Article  Google Scholar 

  34. Nadaraya, E.A.: On estimating regression. Theory Probab. Appl. 10, 186–190 (1964)

    Article  MATH  Google Scholar 

  35. Paik, M., Yang, Y.: Combining nearest neighbor classifiers versus cross-validation selection. Stat. Appl. Genet. Mol. Biol. 3(12), 1–19 (2004)

  36. Park, M.Y., Hastie, T.: An l1 regularization-path algorithm for generalized linear models. J. R. Stat. Soc. B 69, 659–677 (2007)

    MathSciNet  Article  Google Scholar 

  37. Parthasarthy, G., Chatterji, B.N.: A class of new knn methods for low sample problems. IEEE Trans. Syst. Man Cybern. 20, 715–718 (1990)

  38. Pößnecker, W.: MRSP: multinomial response models with structured penalties. R package version 0.4.2 (2014)

  39. R Core Team: R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing (2013)

  40. Ripley, B.D.: Pattern Recognition and Neural Networks. Cambridge University Press, Cambridge (1996)

    Book  MATH  Google Scholar 

  41. Schliep, K., Hechenbichler, K.: kknn: Weighted k-nearest neighbors. R package version 1.2-3 (2013)

  42. Silverman, B.W., Jones, M.C.: Commentary on fix and hodges (1951): an important contribution to nonparametric discriminant analysis and density estimation. Int. Stat. Rev. 57, 233–238 (1989)

    Article  MATH  Google Scholar 

  43. Simonoff, J.S.: Smoothing Methods in Statistics. Springer, New York (1996)

    Book  MATH  Google Scholar 

  44. Stone, C.J.: Consistent nonparametric regression (with discussion). Ann. Stat. 5, 595–645 (1977)

    Article  MATH  Google Scholar 

  45. Tibshirani, R.: Regression shrinkage and selection via the lasso. J. R. Stat. Soc. B 58, 267–288 (1996)

    MathSciNet  MATH  Google Scholar 

  46. Tibshirani, R., Chu, G., Narasimhan, B., Li, J.: samr: SAM: Significance analysis of microarrays. R package version 2.0 (2011)

  47. Tutz, G., Binder, H.: Generalized additive modeling with implicit variable selection by likelihood-based boosting. Biometrics 62, 961–971 (2006)

    MathSciNet  Article  MATH  Google Scholar 

  48. Tutz, G., Pössnecker, W., Uhlmann, L.: Variable selection in general multinomial logit models. Comput. Stat. Data Anal. 82, 207–222 (2015)

    MathSciNet  Article  Google Scholar 

  49. Venables, W.N., Ripley, B.D.: Modern Applied Statistics with S (Fourth ed.). New York: Springer. ISBN 0-387-95457-0 (2002)

  50. Watson, G.S.: Smooth regression analysis. Sankhyā, Ser. A 26, 359–372 (1964)

    MathSciNet  MATH  Google Scholar 

  51. Zou, H., Hastie, T.: Regularization and variable selection via the elastic net. J. R. Stat. Soc. B 67, 301–320 (2005)

    MathSciNet  Article  MATH  Google Scholar 

Download references

Author information

Affiliations

Authors

Corresponding author

Correspondence to Gerhard Tutz.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Tutz, G., Koch, D. Improved nearest neighbor classifiers by weighting and selection of predictors. Stat Comput 26, 1039–1057 (2016). https://doi.org/10.1007/s11222-015-9588-z

Download citation

Keywords

  • Nearest neighborhood methods
  • Classification
  • Lasso
  • Boosting
  • Logit model
  • Random forests
  • Support vector machine