Efficient Model Selection for Large-Scale Nearest-Neighbor Data Mining

  • Greg Hamerly
  • Greg Speegle
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6121)

Abstract

One of the most widely used models for large-scale data mining is the k-nearest neighbor (k-nn) algorithm. It can be used for classification, regression, density estimation, and information retrieval. To use k-nn, a practitioner must first choose k, usually selecting the k with the minimal loss estimated by cross-validation. In this work, we begin with an existing but little-studied method that greatly accelerates the cross-validation process for selecting k from a range of user-provided possibilities. The result is that a much larger range of k values may be examined more quickly. Next, we extend this algorithm with an additional optimization to provide improved performance for locally linear regression problems. We also show how this method can be applied to automatically select the range of k values when the user has no a priori knowledge of appropriate bounds. Furthermore, we apply statistical methods to reduce the number of examples examined while still finding a likely best k, greatly improving performance for large data sets. Finally, we present both analytical and experimental results that demonstrate these benefits.

Keywords

data mining k nearest neighbor optimal parameter selection 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Arya, S., Mount, D., Netanyahu, N., Silverman, R., Wu, A.: An optimal algorithm for approximate nearest neighbor searching in fixed dimensions. Journal of the ACM 45(6), 891–923 (1999)CrossRefMATHGoogle Scholar
  2. 2.
    Blockeel, H., Struyf, J.: Efficient algorithms for decision tree cross-validation. In: International Conference on Machine Learning, pp. 11–18 (2001)Google Scholar
  3. 3.
    Böhm, C., Berchtold, S., Keim, D.A.: Searching in high-dimensional spaces: Index structures for improving the performance of multimedia databases. ACM Computing Surveys 33(3), 322–373 (2001)CrossRefGoogle Scholar
  4. 4.
    Cover, T., Hart, P.: Nearest neighbor pattern classification. IEEE Transactions on Information Theory 13(1), 21–27 (1967)CrossRefMATHGoogle Scholar
  5. 5.
    Devroye, L., Györfi, L., Lugosi, G.: A Probabilistic Theory of Pattern Recognition. Springer, Heidelberg (1996)CrossRefMATHGoogle Scholar
  6. 6.
    Ferrer-Troyano, F.J., Aguilar-Ruiz, J.S., Riquelme, J.-C.: Empirical Evaluation of the Difficulty of Finding a Good Value of k for the Nearest Neighbor. In: Sloot, P.M.A., Abramson, D., Bogdanov, A.V., Gorbachev, Y.E., Dongarra, J., Zomaya, A.Y. (eds.) ICCS 2003. LNCS, vol. 2658, pp. 766–773. Springer, Heidelberg (2003)CrossRefGoogle Scholar
  7. 7.
    Friedman, J.H., Bentley, J.L., Finkel, R.A.: Two algorithms for nearest-neighbor search in high dimensions. ACM Transactions on Mathematical Software 3(3), 209–226 (1977)CrossRefGoogle Scholar
  8. 8.
    Geisser, S.: The predictive sample reuse method with applications. Journal of the American Statistical Association 70(350), 320–328 (1975)CrossRefMATHGoogle Scholar
  9. 9.
    Ghosh, A., Chaudhuri, P., Murthy, C.: On visualization and aggregation of nearest neighbor classifiers. IEEE Transactions on Pattern Analysis and Machine Intelligence 27(10), 1592–1602 (2005)CrossRefGoogle Scholar
  10. 10.
    Ghosh, A.K.: On nearest neighbor classification using adaptive choice of k. Journal of Computational and Graphical Statistics 16(2), 482–502 (2007)CrossRefGoogle Scholar
  11. 11.
    Gionis, A., Indyk, P., Motwani, R.: Similarity search in high dimensions via hashing. In: International Conference on Very Large Data Bases, pp. 518–529 (1999)Google Scholar
  12. 12.
    Golub, G.H., Van Loan, C.F.: Matrix Computations, 2nd edn. Johns Hopkins University Press (1996)Google Scholar
  13. 13.
    Guttman, A.: R-trees: A dynamic index structure for spatial searching. In: ACM International Conference on Management of Data, pp. 47–57 (1984)Google Scholar
  14. 14.
    Hastie, T., Tibshirani, R.: Discriminant adaptive nearest neighbor classification. IEEE Transactions on Pattern Analysis and Machine Intelligence 18(6), 607–616 (1996)CrossRefGoogle Scholar
  15. 15.
    Indyk, P., Motwani, R.: Approximate nearest neighbors: Towards removing the curse of dimensionality. In: ACM Symposium on Theory of Computation, pp. 604–613 (1998)Google Scholar
  16. 16.
    Li, K.-C.: Asymptotic optimality for c p, c l, cross-validation, and generalized cross-validation: Discrete index set. The Annals of Statistics 15(3), 958–975 (1987)CrossRefMATHGoogle Scholar
  17. 17.
    Li, L., Weinberg, C., Darden, T., Pederson, L.: Gene selection for sample classification based on gene expression data: Study of sensitivty to choice of parameters of the ga/knn method. Bioinformatics 17(12), 1131–1142 (2001)CrossRefGoogle Scholar
  18. 18.
    Lin, K.-I., Jagadish, H., Faloutsos, C.: The TV-tree: An index structure for high-dimensional data. The International Journal on Very Large Databases 3(4), 517–542 (1994)CrossRefGoogle Scholar
  19. 19.
    Moore, A., Lee, M.S.: Efficient algorithms for minimizing cross validation error. In: International Conference on Machine Learning, pp. 190–198 (1994)Google Scholar
  20. 20.
    Mount, D.M., Arya, S.: ANN: A library for approximate nearest neighbor searching (2006), http://www.cs.umd.edu/~mount/ANN/
  21. 21.
    Mullin, M., Sukthankar, R.: Complete cross-validation for nearest neighbor classifiers. In: International Conference on Machine Learning, pp. 639–646. Morgan Kaufmann (2000)Google Scholar
  22. 22.
    Olsson, J.S.: An analysis of the coupling between training set and neighborhood sizes of the knn classifier. In: SIGIR, pp. 685–686 (2006)Google Scholar
  23. 23.
    Ouyang, D., Li, D., Li, Q.: Cross-validation and non-parametric k nearest-neighbor estimation. Econometrics Journal 9, 448–471 (2006)CrossRefMATHGoogle Scholar
  24. 24.
    Racine, J.: Feasible cross-validatory model selection for general stationary processes. Journal of Applied Econometrics 12(2), 169–179 (1997)CrossRefGoogle Scholar
  25. 25.
    Shakhnarovich, G., Indyk, P., Darrell, T. (eds.): Nearest-Neighbor Methods in Learning and Vision. MIT Press (2006)Google Scholar
  26. 26.
    Stone, M.: Cross-validatory choice and assessment of statistical predictions. Journal of the Royal Statistical Society, Series B 36(2), 111–147 (1974)MATHGoogle Scholar
  27. 27.
    Struyf, J., Blockeel, H.: Efficient Cross-Validation in ILP. In: Rouveirol, C., Sebag, M. (eds.) ILP 2001. LNCS (LNAI), vol. 2157, pp. 228–239. Springer, Heidelberg (2001)CrossRefGoogle Scholar
  28. 28.
    Uhlmann, J.K.: Satisfying general proximity/similarity queries with metric trees. Applied Mathematics Letters 4, 175–179 (1991)MATHGoogle Scholar
  29. 29.
    Wang, J., Neskovic, P., Cooper, L.N.: Neighborhood size selection in the k-nearest-neighbor rule using statistical confidence. The Journal of the Pattern Recognition Society 39(3), 417–423 (2006)CrossRefMATHGoogle Scholar
  30. 30.
    Wettschereck, D., Dietterich, T.G.: Locally adaptive nearest neighbor algorithms. Advances in Neural Information Processing Systems 6, 184–191 (1994)Google Scholar
  31. 31.
    Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques (1999)Google Scholar
  32. 32.
    Wu, X., Kumar, V., Quinlan, J.R., Ghosh, J., Yang, Q., Motoda, H., McLachlan, G.J., Ng, A., Liu, B., Yu, P.S., Zhou, Z.-H., Steinbach, M., Hand, D.J., Steinberg, D.: Top 10 algorithms in data mining. Knowledge and Information Systems 14(1), 1–37 (2008)CrossRefGoogle Scholar
  33. 33.
    Xue, G.-R., Lin, C., Yang, Q., Xi, W., Zeng, H.-J., Yu, Y., Chen, Z.: Scalable collaborative filtering using cluster-based smoothing. In: SIGIR, pp. 114–121 (2005)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Greg Hamerly
    • 1
  • Greg Speegle
    • 1
  1. 1.Baylor UniversityWacoUSA

Personalised recommendations