Abstract
One of the most widely used models for large-scale data mining is the k-nearest neighbor (k-nn) algorithm. It can be used for classification, regression, density estimation, and information retrieval. To use k-nn, a practitioner must first choose k, usually selecting the k with the minimal loss estimated by cross-validation. In this work, we begin with an existing but little-studied method that greatly accelerates the cross-validation process for selecting k from a range of user-provided possibilities. The result is that a much larger range of k values may be examined more quickly. Next, we extend this algorithm with an additional optimization to provide improved performance for locally linear regression problems. We also show how this method can be applied to automatically select the range of k values when the user has no a priori knowledge of appropriate bounds. Furthermore, we apply statistical methods to reduce the number of examples examined while still finding a likely best k, greatly improving performance for large data sets. Finally, we present both analytical and experimental results that demonstrate these benefits.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Arya, S., Mount, D., Netanyahu, N., Silverman, R., Wu, A.: An optimal algorithm for approximate nearest neighbor searching in fixed dimensions. Journal of the ACM 45(6), 891–923 (1999)
Blockeel, H., Struyf, J.: Efficient algorithms for decision tree cross-validation. In: International Conference on Machine Learning, pp. 11–18 (2001)
Böhm, C., Berchtold, S., Keim, D.A.: Searching in high-dimensional spaces: Index structures for improving the performance of multimedia databases. ACM Computing Surveys 33(3), 322–373 (2001)
Cover, T., Hart, P.: Nearest neighbor pattern classification. IEEE Transactions on Information Theory 13(1), 21–27 (1967)
Devroye, L., Györfi, L., Lugosi, G.: A Probabilistic Theory of Pattern Recognition. Springer, Heidelberg (1996)
Ferrer-Troyano, F.J., Aguilar-Ruiz, J.S., Riquelme, J.-C.: Empirical Evaluation of the Difficulty of Finding a Good Value of k for the Nearest Neighbor. In: Sloot, P.M.A., Abramson, D., Bogdanov, A.V., Gorbachev, Y.E., Dongarra, J., Zomaya, A.Y. (eds.) ICCS 2003. LNCS, vol. 2658, pp. 766–773. Springer, Heidelberg (2003)
Friedman, J.H., Bentley, J.L., Finkel, R.A.: Two algorithms for nearest-neighbor search in high dimensions. ACM Transactions on Mathematical Software 3(3), 209–226 (1977)
Geisser, S.: The predictive sample reuse method with applications. Journal of the American Statistical Association 70(350), 320–328 (1975)
Ghosh, A., Chaudhuri, P., Murthy, C.: On visualization and aggregation of nearest neighbor classifiers. IEEE Transactions on Pattern Analysis and Machine Intelligence 27(10), 1592–1602 (2005)
Ghosh, A.K.: On nearest neighbor classification using adaptive choice of k. Journal of Computational and Graphical Statistics 16(2), 482–502 (2007)
Gionis, A., Indyk, P., Motwani, R.: Similarity search in high dimensions via hashing. In: International Conference on Very Large Data Bases, pp. 518–529 (1999)
Golub, G.H., Van Loan, C.F.: Matrix Computations, 2nd edn. Johns Hopkins University Press (1996)
Guttman, A.: R-trees: A dynamic index structure for spatial searching. In: ACM International Conference on Management of Data, pp. 47–57 (1984)
Hastie, T., Tibshirani, R.: Discriminant adaptive nearest neighbor classification. IEEE Transactions on Pattern Analysis and Machine Intelligence 18(6), 607–616 (1996)
Indyk, P., Motwani, R.: Approximate nearest neighbors: Towards removing the curse of dimensionality. In: ACM Symposium on Theory of Computation, pp. 604–613 (1998)
Li, K.-C.: Asymptotic optimality for c p , c l , cross-validation, and generalized cross-validation: Discrete index set. The Annals of Statistics 15(3), 958–975 (1987)
Li, L., Weinberg, C., Darden, T., Pederson, L.: Gene selection for sample classification based on gene expression data: Study of sensitivty to choice of parameters of the ga/knn method. Bioinformatics 17(12), 1131–1142 (2001)
Lin, K.-I., Jagadish, H., Faloutsos, C.: The TV-tree: An index structure for high-dimensional data. The International Journal on Very Large Databases 3(4), 517–542 (1994)
Moore, A., Lee, M.S.: Efficient algorithms for minimizing cross validation error. In: International Conference on Machine Learning, pp. 190–198 (1994)
Mount, D.M., Arya, S.: ANN: A library for approximate nearest neighbor searching (2006), http://www.cs.umd.edu/~mount/ANN/
Mullin, M., Sukthankar, R.: Complete cross-validation for nearest neighbor classifiers. In: International Conference on Machine Learning, pp. 639–646. Morgan Kaufmann (2000)
Olsson, J.S.: An analysis of the coupling between training set and neighborhood sizes of the knn classifier. In: SIGIR, pp. 685–686 (2006)
Ouyang, D., Li, D., Li, Q.: Cross-validation and non-parametric k nearest-neighbor estimation. Econometrics Journal 9, 448–471 (2006)
Racine, J.: Feasible cross-validatory model selection for general stationary processes. Journal of Applied Econometrics 12(2), 169–179 (1997)
Shakhnarovich, G., Indyk, P., Darrell, T. (eds.): Nearest-Neighbor Methods in Learning and Vision. MIT Press (2006)
Stone, M.: Cross-validatory choice and assessment of statistical predictions. Journal of the Royal Statistical Society, Series B 36(2), 111–147 (1974)
Struyf, J., Blockeel, H.: Efficient Cross-Validation in ILP. In: Rouveirol, C., Sebag, M. (eds.) ILP 2001. LNCS (LNAI), vol. 2157, pp. 228–239. Springer, Heidelberg (2001)
Uhlmann, J.K.: Satisfying general proximity/similarity queries with metric trees. Applied Mathematics Letters 4, 175–179 (1991)
Wang, J., Neskovic, P., Cooper, L.N.: Neighborhood size selection in the k-nearest-neighbor rule using statistical confidence. The Journal of the Pattern Recognition Society 39(3), 417–423 (2006)
Wettschereck, D., Dietterich, T.G.: Locally adaptive nearest neighbor algorithms. Advances in Neural Information Processing Systems 6, 184–191 (1994)
Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques (1999)
Wu, X., Kumar, V., Quinlan, J.R., Ghosh, J., Yang, Q., Motoda, H., McLachlan, G.J., Ng, A., Liu, B., Yu, P.S., Zhou, Z.-H., Steinbach, M., Hand, D.J., Steinberg, D.: Top 10 algorithms in data mining. Knowledge and Information Systems 14(1), 1–37 (2008)
Xue, G.-R., Lin, C., Yang, Q., Xi, W., Zeng, H.-J., Yu, Y., Chen, Z.: Scalable collaborative filtering using cluster-based smoothing. In: SIGIR, pp. 114–121 (2005)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Hamerly, G., Speegle, G. (2012). Efficient Model Selection for Large-Scale Nearest-Neighbor Data Mining. In: MacKinnon, L.M. (eds) Data Security and Security Data. BNCOD 2010. Lecture Notes in Computer Science, vol 6121. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-25704-9_6
Download citation
DOI: https://doi.org/10.1007/978-3-642-25704-9_6
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-25703-2
Online ISBN: 978-3-642-25704-9
eBook Packages: Computer ScienceComputer Science (R0)