Skip to main content

Efficient Model Selection for Large-Scale Nearest-Neighbor Data Mining

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 6121))

Abstract

One of the most widely used models for large-scale data mining is the k-nearest neighbor (k-nn) algorithm. It can be used for classification, regression, density estimation, and information retrieval. To use k-nn, a practitioner must first choose k, usually selecting the k with the minimal loss estimated by cross-validation. In this work, we begin with an existing but little-studied method that greatly accelerates the cross-validation process for selecting k from a range of user-provided possibilities. The result is that a much larger range of k values may be examined more quickly. Next, we extend this algorithm with an additional optimization to provide improved performance for locally linear regression problems. We also show how this method can be applied to automatically select the range of k values when the user has no a priori knowledge of appropriate bounds. Furthermore, we apply statistical methods to reduce the number of examples examined while still finding a likely best k, greatly improving performance for large data sets. Finally, we present both analytical and experimental results that demonstrate these benefits.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   54.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   69.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Arya, S., Mount, D., Netanyahu, N., Silverman, R., Wu, A.: An optimal algorithm for approximate nearest neighbor searching in fixed dimensions. Journal of the ACM 45(6), 891–923 (1999)

    Article  MATH  Google Scholar 

  2. Blockeel, H., Struyf, J.: Efficient algorithms for decision tree cross-validation. In: International Conference on Machine Learning, pp. 11–18 (2001)

    Google Scholar 

  3. Böhm, C., Berchtold, S., Keim, D.A.: Searching in high-dimensional spaces: Index structures for improving the performance of multimedia databases. ACM Computing Surveys 33(3), 322–373 (2001)

    Article  Google Scholar 

  4. Cover, T., Hart, P.: Nearest neighbor pattern classification. IEEE Transactions on Information Theory 13(1), 21–27 (1967)

    Article  MATH  Google Scholar 

  5. Devroye, L., Györfi, L., Lugosi, G.: A Probabilistic Theory of Pattern Recognition. Springer, Heidelberg (1996)

    Book  MATH  Google Scholar 

  6. Ferrer-Troyano, F.J., Aguilar-Ruiz, J.S., Riquelme, J.-C.: Empirical Evaluation of the Difficulty of Finding a Good Value of k for the Nearest Neighbor. In: Sloot, P.M.A., Abramson, D., Bogdanov, A.V., Gorbachev, Y.E., Dongarra, J., Zomaya, A.Y. (eds.) ICCS 2003. LNCS, vol. 2658, pp. 766–773. Springer, Heidelberg (2003)

    Chapter  Google Scholar 

  7. Friedman, J.H., Bentley, J.L., Finkel, R.A.: Two algorithms for nearest-neighbor search in high dimensions. ACM Transactions on Mathematical Software 3(3), 209–226 (1977)

    Article  Google Scholar 

  8. Geisser, S.: The predictive sample reuse method with applications. Journal of the American Statistical Association 70(350), 320–328 (1975)

    Article  MATH  Google Scholar 

  9. Ghosh, A., Chaudhuri, P., Murthy, C.: On visualization and aggregation of nearest neighbor classifiers. IEEE Transactions on Pattern Analysis and Machine Intelligence 27(10), 1592–1602 (2005)

    Article  Google Scholar 

  10. Ghosh, A.K.: On nearest neighbor classification using adaptive choice of k. Journal of Computational and Graphical Statistics 16(2), 482–502 (2007)

    Article  Google Scholar 

  11. Gionis, A., Indyk, P., Motwani, R.: Similarity search in high dimensions via hashing. In: International Conference on Very Large Data Bases, pp. 518–529 (1999)

    Google Scholar 

  12. Golub, G.H., Van Loan, C.F.: Matrix Computations, 2nd edn. Johns Hopkins University Press (1996)

    Google Scholar 

  13. Guttman, A.: R-trees: A dynamic index structure for spatial searching. In: ACM International Conference on Management of Data, pp. 47–57 (1984)

    Google Scholar 

  14. Hastie, T., Tibshirani, R.: Discriminant adaptive nearest neighbor classification. IEEE Transactions on Pattern Analysis and Machine Intelligence 18(6), 607–616 (1996)

    Article  Google Scholar 

  15. Indyk, P., Motwani, R.: Approximate nearest neighbors: Towards removing the curse of dimensionality. In: ACM Symposium on Theory of Computation, pp. 604–613 (1998)

    Google Scholar 

  16. Li, K.-C.: Asymptotic optimality for c p , c l , cross-validation, and generalized cross-validation: Discrete index set. The Annals of Statistics 15(3), 958–975 (1987)

    Article  MATH  Google Scholar 

  17. Li, L., Weinberg, C., Darden, T., Pederson, L.: Gene selection for sample classification based on gene expression data: Study of sensitivty to choice of parameters of the ga/knn method. Bioinformatics 17(12), 1131–1142 (2001)

    Article  Google Scholar 

  18. Lin, K.-I., Jagadish, H., Faloutsos, C.: The TV-tree: An index structure for high-dimensional data. The International Journal on Very Large Databases 3(4), 517–542 (1994)

    Article  Google Scholar 

  19. Moore, A., Lee, M.S.: Efficient algorithms for minimizing cross validation error. In: International Conference on Machine Learning, pp. 190–198 (1994)

    Google Scholar 

  20. Mount, D.M., Arya, S.: ANN: A library for approximate nearest neighbor searching (2006), http://www.cs.umd.edu/~mount/ANN/

  21. Mullin, M., Sukthankar, R.: Complete cross-validation for nearest neighbor classifiers. In: International Conference on Machine Learning, pp. 639–646. Morgan Kaufmann (2000)

    Google Scholar 

  22. Olsson, J.S.: An analysis of the coupling between training set and neighborhood sizes of the knn classifier. In: SIGIR, pp. 685–686 (2006)

    Google Scholar 

  23. Ouyang, D., Li, D., Li, Q.: Cross-validation and non-parametric k nearest-neighbor estimation. Econometrics Journal 9, 448–471 (2006)

    Article  MATH  Google Scholar 

  24. Racine, J.: Feasible cross-validatory model selection for general stationary processes. Journal of Applied Econometrics 12(2), 169–179 (1997)

    Article  Google Scholar 

  25. Shakhnarovich, G., Indyk, P., Darrell, T. (eds.): Nearest-Neighbor Methods in Learning and Vision. MIT Press (2006)

    Google Scholar 

  26. Stone, M.: Cross-validatory choice and assessment of statistical predictions. Journal of the Royal Statistical Society, Series B 36(2), 111–147 (1974)

    MATH  Google Scholar 

  27. Struyf, J., Blockeel, H.: Efficient Cross-Validation in ILP. In: Rouveirol, C., Sebag, M. (eds.) ILP 2001. LNCS (LNAI), vol. 2157, pp. 228–239. Springer, Heidelberg (2001)

    Chapter  Google Scholar 

  28. Uhlmann, J.K.: Satisfying general proximity/similarity queries with metric trees. Applied Mathematics Letters 4, 175–179 (1991)

    MATH  Google Scholar 

  29. Wang, J., Neskovic, P., Cooper, L.N.: Neighborhood size selection in the k-nearest-neighbor rule using statistical confidence. The Journal of the Pattern Recognition Society 39(3), 417–423 (2006)

    Article  MATH  Google Scholar 

  30. Wettschereck, D., Dietterich, T.G.: Locally adaptive nearest neighbor algorithms. Advances in Neural Information Processing Systems 6, 184–191 (1994)

    Google Scholar 

  31. Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques (1999)

    Google Scholar 

  32. Wu, X., Kumar, V., Quinlan, J.R., Ghosh, J., Yang, Q., Motoda, H., McLachlan, G.J., Ng, A., Liu, B., Yu, P.S., Zhou, Z.-H., Steinbach, M., Hand, D.J., Steinberg, D.: Top 10 algorithms in data mining. Knowledge and Information Systems 14(1), 1–37 (2008)

    Article  Google Scholar 

  33. Xue, G.-R., Lin, C., Yang, Q., Xi, W., Zeng, H.-J., Yu, Y., Chen, Z.: Scalable collaborative filtering using cluster-based smoothing. In: SIGIR, pp. 114–121 (2005)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2012 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Hamerly, G., Speegle, G. (2012). Efficient Model Selection for Large-Scale Nearest-Neighbor Data Mining. In: MacKinnon, L.M. (eds) Data Security and Security Data. BNCOD 2010. Lecture Notes in Computer Science, vol 6121. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-25704-9_6

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-25704-9_6

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-25703-2

  • Online ISBN: 978-3-642-25704-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics