BNCOD 2010: Data Security and Security Data pp 37-54 | Cite as
Efficient Model Selection for Large-Scale Nearest-Neighbor Data Mining
Abstract
One of the most widely used models for large-scale data mining is the k-nearest neighbor (k-nn) algorithm. It can be used for classification, regression, density estimation, and information retrieval. To use k-nn, a practitioner must first choose k, usually selecting the k with the minimal loss estimated by cross-validation. In this work, we begin with an existing but little-studied method that greatly accelerates the cross-validation process for selecting k from a range of user-provided possibilities. The result is that a much larger range of k values may be examined more quickly. Next, we extend this algorithm with an additional optimization to provide improved performance for locally linear regression problems. We also show how this method can be applied to automatically select the range of k values when the user has no a priori knowledge of appropriate bounds. Furthermore, we apply statistical methods to reduce the number of examples examined while still finding a likely best k, greatly improving performance for large data sets. Finally, we present both analytical and experimental results that demonstrate these benefits.
Keywords
data mining k nearest neighbor optimal parameter selectionPreview
Unable to display preview. Download preview PDF.
References
- 1.Arya, S., Mount, D., Netanyahu, N., Silverman, R., Wu, A.: An optimal algorithm for approximate nearest neighbor searching in fixed dimensions. Journal of the ACM 45(6), 891–923 (1999)CrossRefMATHGoogle Scholar
- 2.Blockeel, H., Struyf, J.: Efficient algorithms for decision tree cross-validation. In: International Conference on Machine Learning, pp. 11–18 (2001)Google Scholar
- 3.Böhm, C., Berchtold, S., Keim, D.A.: Searching in high-dimensional spaces: Index structures for improving the performance of multimedia databases. ACM Computing Surveys 33(3), 322–373 (2001)CrossRefGoogle Scholar
- 4.Cover, T., Hart, P.: Nearest neighbor pattern classification. IEEE Transactions on Information Theory 13(1), 21–27 (1967)CrossRefMATHGoogle Scholar
- 5.Devroye, L., Györfi, L., Lugosi, G.: A Probabilistic Theory of Pattern Recognition. Springer, Heidelberg (1996)CrossRefMATHGoogle Scholar
- 6.Ferrer-Troyano, F.J., Aguilar-Ruiz, J.S., Riquelme, J.-C.: Empirical Evaluation of the Difficulty of Finding a Good Value of k for the Nearest Neighbor. In: Sloot, P.M.A., Abramson, D., Bogdanov, A.V., Gorbachev, Y.E., Dongarra, J., Zomaya, A.Y. (eds.) ICCS 2003. LNCS, vol. 2658, pp. 766–773. Springer, Heidelberg (2003)CrossRefGoogle Scholar
- 7.Friedman, J.H., Bentley, J.L., Finkel, R.A.: Two algorithms for nearest-neighbor search in high dimensions. ACM Transactions on Mathematical Software 3(3), 209–226 (1977)CrossRefGoogle Scholar
- 8.Geisser, S.: The predictive sample reuse method with applications. Journal of the American Statistical Association 70(350), 320–328 (1975)CrossRefMATHGoogle Scholar
- 9.Ghosh, A., Chaudhuri, P., Murthy, C.: On visualization and aggregation of nearest neighbor classifiers. IEEE Transactions on Pattern Analysis and Machine Intelligence 27(10), 1592–1602 (2005)CrossRefGoogle Scholar
- 10.Ghosh, A.K.: On nearest neighbor classification using adaptive choice of k. Journal of Computational and Graphical Statistics 16(2), 482–502 (2007)CrossRefGoogle Scholar
- 11.Gionis, A., Indyk, P., Motwani, R.: Similarity search in high dimensions via hashing. In: International Conference on Very Large Data Bases, pp. 518–529 (1999)Google Scholar
- 12.Golub, G.H., Van Loan, C.F.: Matrix Computations, 2nd edn. Johns Hopkins University Press (1996)Google Scholar
- 13.Guttman, A.: R-trees: A dynamic index structure for spatial searching. In: ACM International Conference on Management of Data, pp. 47–57 (1984)Google Scholar
- 14.Hastie, T., Tibshirani, R.: Discriminant adaptive nearest neighbor classification. IEEE Transactions on Pattern Analysis and Machine Intelligence 18(6), 607–616 (1996)CrossRefGoogle Scholar
- 15.Indyk, P., Motwani, R.: Approximate nearest neighbors: Towards removing the curse of dimensionality. In: ACM Symposium on Theory of Computation, pp. 604–613 (1998)Google Scholar
- 16.Li, K.-C.: Asymptotic optimality for c p, c l, cross-validation, and generalized cross-validation: Discrete index set. The Annals of Statistics 15(3), 958–975 (1987)CrossRefMATHGoogle Scholar
- 17.Li, L., Weinberg, C., Darden, T., Pederson, L.: Gene selection for sample classification based on gene expression data: Study of sensitivty to choice of parameters of the ga/knn method. Bioinformatics 17(12), 1131–1142 (2001)CrossRefGoogle Scholar
- 18.Lin, K.-I., Jagadish, H., Faloutsos, C.: The TV-tree: An index structure for high-dimensional data. The International Journal on Very Large Databases 3(4), 517–542 (1994)CrossRefGoogle Scholar
- 19.Moore, A., Lee, M.S.: Efficient algorithms for minimizing cross validation error. In: International Conference on Machine Learning, pp. 190–198 (1994)Google Scholar
- 20.Mount, D.M., Arya, S.: ANN: A library for approximate nearest neighbor searching (2006), http://www.cs.umd.edu/~mount/ANN/
- 21.Mullin, M., Sukthankar, R.: Complete cross-validation for nearest neighbor classifiers. In: International Conference on Machine Learning, pp. 639–646. Morgan Kaufmann (2000)Google Scholar
- 22.Olsson, J.S.: An analysis of the coupling between training set and neighborhood sizes of the knn classifier. In: SIGIR, pp. 685–686 (2006)Google Scholar
- 23.Ouyang, D., Li, D., Li, Q.: Cross-validation and non-parametric k nearest-neighbor estimation. Econometrics Journal 9, 448–471 (2006)CrossRefMATHGoogle Scholar
- 24.Racine, J.: Feasible cross-validatory model selection for general stationary processes. Journal of Applied Econometrics 12(2), 169–179 (1997)CrossRefGoogle Scholar
- 25.Shakhnarovich, G., Indyk, P., Darrell, T. (eds.): Nearest-Neighbor Methods in Learning and Vision. MIT Press (2006)Google Scholar
- 26.Stone, M.: Cross-validatory choice and assessment of statistical predictions. Journal of the Royal Statistical Society, Series B 36(2), 111–147 (1974)MATHGoogle Scholar
- 27.Struyf, J., Blockeel, H.: Efficient Cross-Validation in ILP. In: Rouveirol, C., Sebag, M. (eds.) ILP 2001. LNCS (LNAI), vol. 2157, pp. 228–239. Springer, Heidelberg (2001)CrossRefGoogle Scholar
- 28.Uhlmann, J.K.: Satisfying general proximity/similarity queries with metric trees. Applied Mathematics Letters 4, 175–179 (1991)MATHGoogle Scholar
- 29.Wang, J., Neskovic, P., Cooper, L.N.: Neighborhood size selection in the k-nearest-neighbor rule using statistical confidence. The Journal of the Pattern Recognition Society 39(3), 417–423 (2006)CrossRefMATHGoogle Scholar
- 30.Wettschereck, D., Dietterich, T.G.: Locally adaptive nearest neighbor algorithms. Advances in Neural Information Processing Systems 6, 184–191 (1994)Google Scholar
- 31.Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques (1999)Google Scholar
- 32.Wu, X., Kumar, V., Quinlan, J.R., Ghosh, J., Yang, Q., Motoda, H., McLachlan, G.J., Ng, A., Liu, B., Yu, P.S., Zhou, Z.-H., Steinbach, M., Hand, D.J., Steinberg, D.: Top 10 algorithms in data mining. Knowledge and Information Systems 14(1), 1–37 (2008)CrossRefGoogle Scholar
- 33.Xue, G.-R., Lin, C., Yang, Q., Xi, W., Zeng, H.-J., Yu, Y., Chen, Z.: Scalable collaborative filtering using cluster-based smoothing. In: SIGIR, pp. 114–121 (2005)Google Scholar