Efficient Model Selection for Large-Scale Nearest-Neighbor Data Mining

Hamerly, Greg; Speegle, Greg

doi:10.1007/978-3-642-25704-9_6

Efficient Model Selection for Large-Scale Nearest-Neighbor Data Mining

Greg Hamerly¹⁷ &
Greg Speegle¹⁷

Conference paper

922 Accesses
3 Citations

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 6121))

Abstract

One of the most widely used models for large-scale data mining is the k-nearest neighbor (k-nn) algorithm. It can be used for classification, regression, density estimation, and information retrieval. To use k-nn, a practitioner must first choose k, usually selecting the k with the minimal loss estimated by cross-validation. In this work, we begin with an existing but little-studied method that greatly accelerates the cross-validation process for selecting k from a range of user-provided possibilities. The result is that a much larger range of k values may be examined more quickly. Next, we extend this algorithm with an additional optimization to provide improved performance for locally linear regression problems. We also show how this method can be applied to automatically select the range of k values when the user has no a priori knowledge of appropriate bounds. Furthermore, we apply statistical methods to reduce the number of examples examined while still finding a likely best k, greatly improving performance for large data sets. Finally, we present both analytical and experimental results that demonstrate these benefits.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 54.99; Price excludes VAT (USA)

Softcover Book: USD 69.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Arya, S., Mount, D., Netanyahu, N., Silverman, R., Wu, A.: An optimal algorithm for approximate nearest neighbor searching in fixed dimensions. Journal of the ACM 45(6), 891–923 (1999)
Article MATH Google Scholar
Blockeel, H., Struyf, J.: Efficient algorithms for decision tree cross-validation. In: International Conference on Machine Learning, pp. 11–18 (2001)
Google Scholar
Böhm, C., Berchtold, S., Keim, D.A.: Searching in high-dimensional spaces: Index structures for improving the performance of multimedia databases. ACM Computing Surveys 33(3), 322–373 (2001)
Article Google Scholar
Cover, T., Hart, P.: Nearest neighbor pattern classification. IEEE Transactions on Information Theory 13(1), 21–27 (1967)
Article MATH Google Scholar
Devroye, L., Györfi, L., Lugosi, G.: A Probabilistic Theory of Pattern Recognition. Springer, Heidelberg (1996)
Book MATH Google Scholar
Ferrer-Troyano, F.J., Aguilar-Ruiz, J.S., Riquelme, J.-C.: Empirical Evaluation of the Difficulty of Finding a Good Value of k for the Nearest Neighbor. In: Sloot, P.M.A., Abramson, D., Bogdanov, A.V., Gorbachev, Y.E., Dongarra, J., Zomaya, A.Y. (eds.) ICCS 2003. LNCS, vol. 2658, pp. 766–773. Springer, Heidelberg (2003)
Chapter Google Scholar
Friedman, J.H., Bentley, J.L., Finkel, R.A.: Two algorithms for nearest-neighbor search in high dimensions. ACM Transactions on Mathematical Software 3(3), 209–226 (1977)
Article Google Scholar
Geisser, S.: The predictive sample reuse method with applications. Journal of the American Statistical Association 70(350), 320–328 (1975)
Article MATH Google Scholar
Ghosh, A., Chaudhuri, P., Murthy, C.: On visualization and aggregation of nearest neighbor classifiers. IEEE Transactions on Pattern Analysis and Machine Intelligence 27(10), 1592–1602 (2005)
Article Google Scholar
Ghosh, A.K.: On nearest neighbor classification using adaptive choice of k. Journal of Computational and Graphical Statistics 16(2), 482–502 (2007)
Article Google Scholar
Gionis, A., Indyk, P., Motwani, R.: Similarity search in high dimensions via hashing. In: International Conference on Very Large Data Bases, pp. 518–529 (1999)
Google Scholar
Golub, G.H., Van Loan, C.F.: Matrix Computations, 2nd edn. Johns Hopkins University Press (1996)
Google Scholar
Guttman, A.: R-trees: A dynamic index structure for spatial searching. In: ACM International Conference on Management of Data, pp. 47–57 (1984)
Google Scholar
Hastie, T., Tibshirani, R.: Discriminant adaptive nearest neighbor classification. IEEE Transactions on Pattern Analysis and Machine Intelligence 18(6), 607–616 (1996)
Article Google Scholar
Indyk, P., Motwani, R.: Approximate nearest neighbors: Towards removing the curse of dimensionality. In: ACM Symposium on Theory of Computation, pp. 604–613 (1998)
Google Scholar
Li, K.-C.: Asymptotic optimality for c _p, c _l, cross-validation, and generalized cross-validation: Discrete index set. The Annals of Statistics 15(3), 958–975 (1987)
Article MATH Google Scholar
Li, L., Weinberg, C., Darden, T., Pederson, L.: Gene selection for sample classification based on gene expression data: Study of sensitivty to choice of parameters of the ga/knn method. Bioinformatics 17(12), 1131–1142 (2001)
Article Google Scholar
Lin, K.-I., Jagadish, H., Faloutsos, C.: The TV-tree: An index structure for high-dimensional data. The International Journal on Very Large Databases 3(4), 517–542 (1994)
Article Google Scholar
Moore, A., Lee, M.S.: Efficient algorithms for minimizing cross validation error. In: International Conference on Machine Learning, pp. 190–198 (1994)
Google Scholar
Mount, D.M., Arya, S.: ANN: A library for approximate nearest neighbor searching (2006), http://www.cs.umd.edu/~mount/ANN/
Mullin, M., Sukthankar, R.: Complete cross-validation for nearest neighbor classifiers. In: International Conference on Machine Learning, pp. 639–646. Morgan Kaufmann (2000)
Google Scholar
Olsson, J.S.: An analysis of the coupling between training set and neighborhood sizes of the knn classifier. In: SIGIR, pp. 685–686 (2006)
Google Scholar
Ouyang, D., Li, D., Li, Q.: Cross-validation and non-parametric k nearest-neighbor estimation. Econometrics Journal 9, 448–471 (2006)
Article MATH Google Scholar
Racine, J.: Feasible cross-validatory model selection for general stationary processes. Journal of Applied Econometrics 12(2), 169–179 (1997)
Article Google Scholar
Shakhnarovich, G., Indyk, P., Darrell, T. (eds.): Nearest-Neighbor Methods in Learning and Vision. MIT Press (2006)
Google Scholar
Stone, M.: Cross-validatory choice and assessment of statistical predictions. Journal of the Royal Statistical Society, Series B 36(2), 111–147 (1974)
MATH Google Scholar
Struyf, J., Blockeel, H.: Efficient Cross-Validation in ILP. In: Rouveirol, C., Sebag, M. (eds.) ILP 2001. LNCS (LNAI), vol. 2157, pp. 228–239. Springer, Heidelberg (2001)
Chapter Google Scholar
Uhlmann, J.K.: Satisfying general proximity/similarity queries with metric trees. Applied Mathematics Letters 4, 175–179 (1991)
MATH Google Scholar
Wang, J., Neskovic, P., Cooper, L.N.: Neighborhood size selection in the k-nearest-neighbor rule using statistical confidence. The Journal of the Pattern Recognition Society 39(3), 417–423 (2006)
Article MATH Google Scholar
Wettschereck, D., Dietterich, T.G.: Locally adaptive nearest neighbor algorithms. Advances in Neural Information Processing Systems 6, 184–191 (1994)
Google Scholar
Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques (1999)
Google Scholar
Wu, X., Kumar, V., Quinlan, J.R., Ghosh, J., Yang, Q., Motoda, H., McLachlan, G.J., Ng, A., Liu, B., Yu, P.S., Zhou, Z.-H., Steinbach, M., Hand, D.J., Steinberg, D.: Top 10 algorithms in data mining. Knowledge and Information Systems 14(1), 1–37 (2008)
Article Google Scholar
Xue, G.-R., Lin, C., Yang, Q., Xi, W., Zeng, H.-J., Yu, Y., Chen, Z.: Scalable collaborative filtering using cluster-based smoothing. In: SIGIR, pp. 114–121 (2005)
Google Scholar

Download references

Author information

Authors and Affiliations

Baylor University, Waco, TX, 76798, USA
Greg Hamerly & Greg Speegle

Authors

Greg Hamerly
View author publications
You can also search for this author in PubMed Google Scholar
Greg Speegle
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

School of Computing and Mathematical Sciences, Old Royal Naval College, University of Greenwich, Park Row, Greenwich, SE10 9LS, London, UK
Lachlan M. MacKinnon

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hamerly, G., Speegle, G. (2012). Efficient Model Selection for Large-Scale Nearest-Neighbor Data Mining. In: MacKinnon, L.M. (eds) Data Security and Security Data. BNCOD 2010. Lecture Notes in Computer Science, vol 6121. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-25704-9_6

Download citation

DOI: https://doi.org/10.1007/978-3-642-25704-9_6
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-25703-2
Online ISBN: 978-3-642-25704-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics