Nearest Neighbor Queries on Big Data
- 1 Citations
- 2.8k Downloads
Abstract
k Nearest Neighbor (kNN) search is one of the simplest non-parametric learning approaches, mainly used for classification and regression. kNN identifies the k nearest neighbors to a given node given a distance metric. A new challenging kNN task is to identify the k nearest neighbors for all nodes simultaneously; also known as All kNN (AkNN) search. Similarly, the Continuous All kNN (CAkNN) search answers an AkNN search in real-time on streaming data. Although such techniques find immediate application in computational intelligence tasks, among others, they have not been efficiently optimized to this date. We study specialized scalable solutions for AkNN and CAkNN processing as demanded by the volume–velocity-variety of data in the Big Data era. We present an algorithm, coined Proximity, which does not require any additional infrastructure or specialized hardware, and its efficiency is mainly attributed to our smart search space sharing technique. Its implementation is based on a novel data structure, coined k +-heap. Proximity, being parameter-free, performs efficiently in the face of high velocity and skewed data. In our analytical studies, we found that Proximity provides better time complexity compared to existing approaches and is very well suited for large scale scenarios.
Keywords
k Nearest neighbors Big data Computational intelligence SmartphonesReferences
- 1.Roussopoulos, N., Kelley, S., Vincent, F.: Nearest neighbor queries. In: Proceedings of the ACM SIGMOD international conference on management of data, ser. SIGMOD ‘95. New York, USA: ACM, pp. 71–79 (1995)Google Scholar
- 2.Cover, T., Hart, P.: Nearest neighbor pattern classification. IEEE Trans. Inf. Theory 13(1), 21–27 (2006)CrossRefGoogle Scholar
- 3.Altman, N.S.: An introduction to kernel and nearest-neighbor nonparametric regression. Am. Stat. 46(3), 175–185 (1992)MathSciNetGoogle Scholar
- 4.Hart, P.E.: The condensed nearest neighbor rule. IEEE Trans. Inf. Theory 18, 515–516 (1968)CrossRefGoogle Scholar
- 5.Shang, W., Huang, H., Zhu, H., Lin, Y., Wang, Z., Qu, Y.: An Improved kNN Algorithm—Fuzzy kNN. Computational Intell. Secur., Lect. Notes Comput. Sci. 3801, 741–746 (2005)CrossRefGoogle Scholar
- 6.Callahan, P.B.: Optimal parallel all-nearest-neighbors using the well-separated pair decomposition. In: Proceedings of the 1993 IEEE 34th annual foundations of computer science: IEEE Computer Society, pp. 332–340. Washington, DC (1993)Google Scholar
- 7.Clarkson, K.L.: Fast algorithms for the all nearest neighbors problem. Foundations of Computer Science, Annual IEEE Symposium on, vol. 83, pp. 226–232 (1983)Google Scholar
- 8.Gabow, H.N., Bentley, J.L., Tarjan, R.E.: Scaling and related techniques for geometry problems. In: Proceedings of the sixteenth annual ACM symposium on theory of computing, ser. STOC ‘84. New York ACM, pp. 135–143 (1984)Google Scholar
- 9.Lai, T.H., Sheng, M.-J.: Constructing euclidean minimum spanning trees and all nearest neighbors on reconfigurable meshes. IEEE Trans. Parallel Distrib. Syst. 7(8), 806–817 (1996)CrossRefGoogle Scholar
- 10.Wang, Y.-R., Horng, S.-J., Wu, C.-H.: Efficient algorithms for the all nearest neighbor and closest pair problems on the linear array with a reconfigurable pipelined bus system. IEEE Trans. Parallel Distrib. Syst. 16, 193–206 (2005)CrossRefGoogle Scholar
- 11.Chen, Y., Patel, J.: Efficient evaluation of all-nearest-neighbor queries, in Data Engineering. ICDE 2007. IEEE 23rd International Conference on, Apr. 2007, pp. 1056–1065 (2007)Google Scholar
- 12.Zhang, J., Mamoulis, N., Papadias, D., Tao, Y.: All-nearest-neighbors queries. In: International conference on spatial databases, scientific and statistical database management, vol. 0, p. 297 (2004)Google Scholar
- 13.Deb, K.: Multi-Objective optimization using evolutionary algorithms. Wiley, New York (2002)Google Scholar
- 14.Mao, J., Jain, K.: Artificial neural networks for feature extraction and multivariate data projection. IEEE Trans. Neural Netw. 6(2), 296–317 (1995)Google Scholar
- 15.Hansen, P., Mladenovic, N.: Variable neighborhood search. In: Editors: Fred W Glover, Gary A Kochenberger.(eds.) Handbook of Metaheuristics, pp. 145–184. Kluwer, Netherlands (2003)Google Scholar
- 16.Zhang, Q., Li, H., MOEA/D.: A Multi-objective evolutionary algorithm based on decomposition. In: IEEE Transactions on evolutionary computation (2007)Google Scholar
- 17.Deb, K., Pratap, A., Agarwal, S., Meyarivan, T.: A fast and elitist multiobjective genetic algorithm: NSGA II, IEEE TEC (2002)Google Scholar
- 18.Federal Communications Commission—Enhanced 911 website Jan 2014. [Online]. Available: http://www.fcc.gov/pshs/services/911-services/enhanced911/
- 19.Department of transportation: Intelligent transportation systems new generation 911 website Jan 2014. [Online]. Available. http://www.its.dot.gov/NG911/
- 20.Rayzit website (Jan 2014). [Online]. Available. http://www.rayzit.com
- 21.Waze website Jan 2014. [Online]. Available: Waze. http://www.waze.com/
- 22.Hoffer, J., Ramesh, V., Topi, H.: Modern database management (2013)Google Scholar
- 23.Smart metering entity website (Jan 2014). [Online]. Available. http://www.smi-ieso.ca/mdmr
- 24.Popular science: Inside google’s quest to popularize self-driving cars article Jan 2014. [Online]. Available. http://www.popsci.com/cars/article/2013-09/google-self-driving-car
- 25.Lu, W., Shen, Y., Chen, S., Ooi, B.C.: Efficient processing of k nearest neighbor joins using mapreduce. Proc. VLDB Endow. 5(10), 1016–1027 (2012)CrossRefGoogle Scholar
- 26.Zhang, C., Li, F., Jestes, J.: Efficient parallel knn joins for large data in mapreduce. In: Proceedings of the 15th international conference on extending database technology, ser. EDBT ‘12. New York ACM, pp. 38–49 (2012)Google Scholar
- 27.Dean, J., Ghemawat, S.: Mapreduce: Simplified data processing on large clusters. OSDI 2004, 137–150 (2004)Google Scholar
- 28.Boehm, C., Krebs, F.: The k-nearest neighbour join: Turbo charging the kdd process. Knowl. Inf. Syst. 6(6), 728–749 (2004)CrossRefGoogle Scholar
- 29.Seiffert, U., Schleif, F.-M., Zühlke, D.: Recent trends in computational intelligence in life sciences In ESANN (2011)Google Scholar
- 30.Thomas, S., Jin, Y.: Reconstructing biological gene regulatory networks: where optimization meets big data, Evolutionary Intelligence, pp. 1–19 (2013)Google Scholar
- 31.Witold Pedrycz.: Granular computing: Analysis and design of intelligent systems. In CRC Press (2013)Google Scholar
- 32.Ranzato, Q.Le., Monga, R., Devin, M., Chen, K., Corrado, G., Dean, J., Ng, A.: Building high-level features using large scale unsupervised learning. In: International conference in machine learning (2012)Google Scholar
- 33.Hall, L.O., Chawla, N., Bowyer, K.W.: Decision tree learning on very large data sets. In: IEEE international conference on system, man and cybernetics (SMC), pp. 187–222 (1998)Google Scholar
- 34.Patil, D.V., Bichkar, R.S., A hybrid evolutionary approach to construct optimal decision trees with large data sets. In: IEEE international conference on industrial technology, pp. 429–433 (2006)Google Scholar
- 35.Lu, Y.-L., Fahn, C.-S.: Hierarchical artificial neural networks for recognizing high similar large data sets. In: International conference on machine learning and cybernetics, vol. 7, pp. 1930–1935 (2007)Google Scholar
- 36.Geolocation API website Jan 2014. [Online]. Available. http://code.google.com/apis/gears/api_geolocation.html
- 37.Vaidya, P.M.: An o(n log n) algorithm for the all-nearest-neighbors problem. Discrete, Computational Geom. 4, 101–115 (1989)CrossRefzbMATHMathSciNetGoogle Scholar
- 38.Xia, C., Lu, H., Ooi, B.C., Hu, J., Gorder: an efficient method for knn join processing. In: Proceedings of the 13th international conference on Very large data bases—vol 30, ser. VLDB ‘04. VLDB Endowment, pp. 756–767 (2004)Google Scholar
- 39.Yao, B., Li, F., Kumar, P.: K nearest neighbor queries and knn-joins in large relational databases (almost) for free. In: Data engineering (ICDE), 2010 IEEE 26th international conference on, pp. 4–15 (2010)Google Scholar
- 40.Yu, C., Cui, B., Wang, S., Su, J.: Efficient index-based knn join processing for high-dimensional data. Inf. Softw. Technol. 49(4), 332–344 (2007)CrossRefGoogle Scholar
- 41.Yu, X., Q.K., Pu, Koudas, N.: Monitoring k-nearest neighbor queries over moving objects. In: Proceedings of the 21st international conference on data engineering ser. ICDE ‘05 IEEE computer society, pp. 631–642 Washington, DC (2005)Google Scholar
- 42.Mouratidis, K., Papadias, D., Hadjieleftheriou, M., Conceptual partitioning: an efficient method for continuous nearest neighbor monitoring. In: Proceedings of the ACM SIGMOD international conference on management of data, ser. SIGMOD ‘05. New York: ACM, pp. 634–645 (2005)Google Scholar
- 43.Chatzimilioudis, G., Zeinalipour-Yazti, D., Lee, W.-C., Dikaiakos, M. D.: Continuous all k-nearest neighbor querying in smartphone networks. In: 13th international conference on mobile data management (MDM’12) 2012Google Scholar
- 44.Rappaport, T.: Wireless communications: principles and practice, 2nd edn. Prentice Hall PTR, Upper Saddle River, NJ (2001)Google Scholar
- 45.Universal mobile telephone system world website Jan 2014. [Online]. Available. http://www.umtsworld.com/technology/capacity.htm