Nearest Neighbor Queries on Big Data

  • Georgios Chatzimilioudis
  • Andreas Konstantinidis
  • Demetrios Zeinalipour-Yazti
Chapter
Part of the Studies in Big Data book series (SBD, volume 8)

Abstract

k Nearest Neighbor (kNN) search is one of the simplest non-parametric learning approaches, mainly used for classification and regression. kNN identifies the k nearest neighbors to a given node given a distance metric. A new challenging kNN task is to identify the k nearest neighbors for all nodes simultaneously; also known as All kNN (AkNN) search. Similarly, the Continuous All kNN (CAkNN) search answers an AkNN search in real-time on streaming data. Although such techniques find immediate application in computational intelligence tasks, among others, they have not been efficiently optimized to this date. We study specialized scalable solutions for AkNN and CAkNN processing as demanded by the volume–velocity-variety of data in the Big Data era. We present an algorithm, coined Proximity, which does not require any additional infrastructure or specialized hardware, and its efficiency is mainly attributed to our smart search space sharing technique. Its implementation is based on a novel data structure, coined k+-heap. Proximity, being parameter-free, performs efficiently in the face of high velocity and skewed data. In our analytical studies, we found that Proximity provides better time complexity compared to existing approaches and is very well suited for large scale scenarios.

Keywords

k Nearest neighbors Big data Computational intelligence Smartphones 

References

  1. 1.
    Roussopoulos, N., Kelley, S., Vincent, F.: Nearest neighbor queries. In: Proceedings of the ACM SIGMOD international conference on management of data, ser. SIGMOD ‘95. New York, USA: ACM, pp. 71–79 (1995)Google Scholar
  2. 2.
    Cover, T., Hart, P.: Nearest neighbor pattern classification. IEEE Trans. Inf. Theory 13(1), 21–27 (2006)CrossRefGoogle Scholar
  3. 3.
    Altman, N.S.: An introduction to kernel and nearest-neighbor nonparametric regression. Am. Stat. 46(3), 175–185 (1992)MathSciNetGoogle Scholar
  4. 4.
    Hart, P.E.: The condensed nearest neighbor rule. IEEE Trans. Inf. Theory 18, 515–516 (1968)CrossRefGoogle Scholar
  5. 5.
    Shang, W., Huang, H., Zhu, H., Lin, Y., Wang, Z., Qu, Y.: An Improved kNN Algorithm—Fuzzy kNN. Computational Intell. Secur., Lect. Notes Comput. Sci. 3801, 741–746 (2005)CrossRefGoogle Scholar
  6. 6.
    Callahan, P.B.: Optimal parallel all-nearest-neighbors using the well-separated pair decomposition. In: Proceedings of the 1993 IEEE 34th annual foundations of computer science: IEEE Computer Society, pp. 332–340. Washington, DC (1993)Google Scholar
  7. 7.
    Clarkson, K.L.: Fast algorithms for the all nearest neighbors problem. Foundations of Computer Science, Annual IEEE Symposium on, vol. 83, pp. 226–232 (1983)Google Scholar
  8. 8.
    Gabow, H.N., Bentley, J.L., Tarjan, R.E.: Scaling and related techniques for geometry problems. In: Proceedings of the sixteenth annual ACM symposium on theory of computing, ser. STOC ‘84. New York ACM, pp. 135–143 (1984)Google Scholar
  9. 9.
    Lai, T.H., Sheng, M.-J.: Constructing euclidean minimum spanning trees and all nearest neighbors on reconfigurable meshes. IEEE Trans. Parallel Distrib. Syst. 7(8), 806–817 (1996)CrossRefGoogle Scholar
  10. 10.
    Wang, Y.-R., Horng, S.-J., Wu, C.-H.: Efficient algorithms for the all nearest neighbor and closest pair problems on the linear array with a reconfigurable pipelined bus system. IEEE Trans. Parallel Distrib. Syst. 16, 193–206 (2005)CrossRefGoogle Scholar
  11. 11.
    Chen, Y., Patel, J.: Efficient evaluation of all-nearest-neighbor queries, in Data Engineering. ICDE 2007. IEEE 23rd International Conference on, Apr. 2007, pp. 1056–1065 (2007)Google Scholar
  12. 12.
    Zhang, J., Mamoulis, N., Papadias, D., Tao, Y.: All-nearest-neighbors queries. In: International conference on spatial databases, scientific and statistical database management, vol. 0, p. 297 (2004)Google Scholar
  13. 13.
    Deb, K.: Multi-Objective optimization using evolutionary algorithms. Wiley, New York (2002)Google Scholar
  14. 14.
    Mao, J., Jain, K.: Artificial neural networks for feature extraction and multivariate data projection. IEEE Trans. Neural Netw. 6(2), 296–317 (1995)Google Scholar
  15. 15.
    Hansen, P., Mladenovic, N.: Variable neighborhood search. In: Editors: Fred W Glover, Gary A Kochenberger.(eds.) Handbook of Metaheuristics, pp. 145–184. Kluwer, Netherlands (2003)Google Scholar
  16. 16.
    Zhang, Q., Li, H., MOEA/D.: A Multi-objective evolutionary algorithm based on decomposition. In: IEEE Transactions on evolutionary computation (2007)Google Scholar
  17. 17.
    Deb, K., Pratap, A., Agarwal, S., Meyarivan, T.: A fast and elitist multiobjective genetic algorithm: NSGA II, IEEE TEC (2002)Google Scholar
  18. 18.
    Federal Communications Commission—Enhanced 911 website Jan 2014. [Online]. Available: http://www.fcc.gov/pshs/services/911-services/enhanced911/
  19. 19.
    Department of transportation: Intelligent transportation systems new generation 911 website Jan 2014. [Online]. Available. http://www.its.dot.gov/NG911/
  20. 20.
    Rayzit website (Jan 2014). [Online]. Available. http://www.rayzit.com
  21. 21.
    Waze website Jan 2014. [Online]. Available: Waze. http://www.waze.com/
  22. 22.
    Hoffer, J., Ramesh, V., Topi, H.: Modern database management (2013)Google Scholar
  23. 23.
    Smart metering entity website (Jan 2014). [Online]. Available. http://www.smi-ieso.ca/mdmr
  24. 24.
    Popular science: Inside google’s quest to popularize self-driving cars article Jan 2014. [Online]. Available. http://www.popsci.com/cars/article/2013-09/google-self-driving-car
  25. 25.
    Lu, W., Shen, Y., Chen, S., Ooi, B.C.: Efficient processing of k nearest neighbor joins using mapreduce. Proc. VLDB Endow. 5(10), 1016–1027 (2012)CrossRefGoogle Scholar
  26. 26.
    Zhang, C., Li, F., Jestes, J.: Efficient parallel knn joins for large data in mapreduce. In: Proceedings of the 15th international conference on extending database technology, ser. EDBT ‘12. New York ACM, pp. 38–49 (2012)Google Scholar
  27. 27.
    Dean, J., Ghemawat, S.: Mapreduce: Simplified data processing on large clusters. OSDI 2004, 137–150 (2004)Google Scholar
  28. 28.
    Boehm, C., Krebs, F.: The k-nearest neighbour join: Turbo charging the kdd process. Knowl. Inf. Syst. 6(6), 728–749 (2004)CrossRefGoogle Scholar
  29. 29.
    Seiffert, U., Schleif, F.-M., Zühlke, D.: Recent trends in computational intelligence in life sciences In ESANN (2011)Google Scholar
  30. 30.
    Thomas, S., Jin, Y.: Reconstructing biological gene regulatory networks: where optimization meets big data, Evolutionary Intelligence, pp. 1–19 (2013)Google Scholar
  31. 31.
    Witold Pedrycz.: Granular computing: Analysis and design of intelligent systems. In CRC Press (2013)Google Scholar
  32. 32.
    Ranzato, Q.Le., Monga, R., Devin, M., Chen, K., Corrado, G., Dean, J., Ng, A.: Building high-level features using large scale unsupervised learning. In: International conference in machine learning (2012)Google Scholar
  33. 33.
    Hall, L.O., Chawla, N., Bowyer, K.W.: Decision tree learning on very large data sets. In: IEEE international conference on system, man and cybernetics (SMC), pp. 187–222 (1998)Google Scholar
  34. 34.
    Patil, D.V., Bichkar, R.S., A hybrid evolutionary approach to construct optimal decision trees with large data sets. In: IEEE international conference on industrial technology, pp. 429–433 (2006)Google Scholar
  35. 35.
    Lu, Y.-L., Fahn, C.-S.: Hierarchical artificial neural networks for recognizing high similar large data sets. In: International conference on machine learning and cybernetics, vol. 7, pp. 1930–1935 (2007)Google Scholar
  36. 36.
    Geolocation API website Jan 2014. [Online]. Available. http://code.google.com/apis/gears/api_geolocation.html
  37. 37.
    Vaidya, P.M.: An o(n log n) algorithm for the all-nearest-neighbors problem. Discrete, Computational Geom. 4, 101–115 (1989)CrossRefMATHMathSciNetGoogle Scholar
  38. 38.
    Xia, C., Lu, H., Ooi, B.C., Hu, J., Gorder: an efficient method for knn join processing. In: Proceedings of the 13th international conference on Very large data basesvol 30, ser. VLDB ‘04. VLDB Endowment, pp. 756–767 (2004)Google Scholar
  39. 39.
    Yao, B., Li, F., Kumar, P.: K nearest neighbor queries and knn-joins in large relational databases (almost) for free. In: Data engineering (ICDE), 2010 IEEE 26th international conference on, pp. 4–15 (2010)Google Scholar
  40. 40.
    Yu, C., Cui, B., Wang, S., Su, J.: Efficient index-based knn join processing for high-dimensional data. Inf. Softw. Technol. 49(4), 332–344 (2007)CrossRefGoogle Scholar
  41. 41.
    Yu, X., Q.K., Pu, Koudas, N.: Monitoring k-nearest neighbor queries over moving objects. In: Proceedings of the 21st international conference on data engineering ser. ICDE ‘05 IEEE computer society, pp. 631–642 Washington, DC (2005)Google Scholar
  42. 42.
    Mouratidis, K., Papadias, D., Hadjieleftheriou, M., Conceptual partitioning: an efficient method for continuous nearest neighbor monitoring. In: Proceedings of the ACM SIGMOD international conference on management of data, ser. SIGMOD ‘05. New York: ACM, pp. 634–645 (2005)Google Scholar
  43. 43.
    Chatzimilioudis, G., Zeinalipour-Yazti, D., Lee, W.-C., Dikaiakos, M. D.: Continuous all k-nearest neighbor querying in smartphone networks. In: 13th international conference on mobile data management (MDM’12) 2012Google Scholar
  44. 44.
    Rappaport, T.: Wireless communications: principles and practice, 2nd edn. Prentice Hall PTR, Upper Saddle River, NJ (2001)Google Scholar
  45. 45.
    Universal mobile telephone system world website Jan 2014. [Online]. Available. http://www.umtsworld.com/technology/capacity.htm

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  • Georgios Chatzimilioudis
    • 1
  • Andreas Konstantinidis
    • 1
  • Demetrios Zeinalipour-Yazti
    • 1
  1. 1.Department of Computer ScienceUniversity of CyprusNicosiaCyprus

Personalised recommendations