Algorithms for processing the group K nearest-neighbor query on distributed frameworks

Abstract

Given two datasets of points (called Query and Training), the Group (K) Nearest-Neighbor (GKNN) query retrieves (K) points of the Training with the smallest sum of distances to every point of the Query. This spatial query has been studied during the recent years and several performance improving techniques and pruning heuristics have been proposed. In previous work, we presented the first MapReduce algorithm, consisting of alternating local and parallel phases, which can be used to effectively process the GKNN query when the Query fits in memory, while the Training one belongs to the Big Data category. In this paper, we present a significantly improved algorithm that incorporates a new high-performance refining method, a fast way to calculate distance sums for pruning purposes and several other minor coding and algorithmic improvements. Moreover, we transform this algorithm (which has been implemented in the Hadoop framework) to SpatialHadoop (a popular distributed framework that is dedicated to spatial processing), using a novel two-level partitioning method. Using real world and synthetic datasets, we also present a thorough experimental study of the Hadoop and SpatialHadoop versions of the algorithm, including a backstage analysis of the algorithm’s performance, using metrics that highlight its internal functioning. Finally, we present an experimental comparison of the Hadoop, the SpatialHadoop versions and the version of our previous work, showing that the improved versions are the big winners, with the SpatialHadoop one being faster than its Hadoop counterpart.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21
Fig. 22
Fig. 23
Fig. 24
Fig. 25
Fig. 26
Fig. 27
Fig. 28
Fig. 29
Fig. 30
Fig. 31
Fig. 32
Fig. 33
Fig. 34
Fig. 35
Fig. 36
Fig. 37

Notes

  1. 1.

    A “local” phase means that it is executed locally on the Master Node machine (Name Node) only.

  2. 2.

    Phase 1 Reducer only performs a simple summation.

  3. 3.

    gradf(xy) or \(\nabla f(x,y)=(\frac{\partial f}{\partial x},\frac{\partial f}{\partial y})\).

References

  1. 1.

    Beckmann, N., Kriegel, H., Schneider, R., Seeger, B.: The r*-tree: An efficient and robust access method for points and rectangles. In: SIGMOD Conference, pp. 322–331 (1990)

  2. 2.

    Chen, C.L.P., Zhang, C.: Data-intensive applications, challenges, techniques and technologies: a survey on big data. Inf. Sci. 275, 314–347 (2014)

    Article  Google Scholar 

  3. 3.

    Dean, J., Ghemawat, S.: Mapreduce: Simplified data processing on large clusters. In: OSDI Conference, pp. 137–150 (2004)

  4. 4.

    Eldawy, A., Mokbel, M.F.: Spatialhadoop: A mapreduce framework for spatial data. In: ICDE Conference, pp. 1352–1363 (2015)

  5. 5.

    Elmongui, H.G., Mokbel, M.F., Aref, W.G.: Continuous aggregate nearest neighbor queries. GeoInformatica 17(1), 63–95 (2013)

    Article  Google Scholar 

  6. 6.

    Fagin, R., Lotem, A., Naor, M.: Optimal aggregation algorithms for middleware. J. Comput. Syst. Sci. 66(4), 614–656 (2003)

    MathSciNet  Article  Google Scholar 

  7. 7.

    Gani, A., Siddiqa, A., Shamshirband, S., Hanum, F.: A survey on indexing techniques for big data: taxonomy and performance evaluation. Knowl. Inf. Syst. 46(2), 241–284 (2016)

    Article  Google Scholar 

  8. 8.

    Garcia-Garcia, F., Corral, A., Iribarne, L., Vassilakopoulos, M.: MRSLICE: efficient rknn query processing in spatialhadoop. In: MEDI Conference, pp. 235–250 (2019)

  9. 9.

    García-García, F., Corral, A., Iribarne, L., Vassilakopoulos, M.: Improving distance-join query processing with voronoi-diagram based partitioning in spatialhadoop. Future Gener. Comput. Syst. 111, 723–740 (2020)

    Article  Google Scholar 

  10. 10.

    Garcia-Garcia, F., Corral, A., Iribarne, L., Vassilakopoulos, M., Manolopoulos, Y.: Efficient large-scale distance-based join queries in spatialhadoop. GeoInformatica 22(2), 171–209 (2018)

    Article  Google Scholar 

  11. 11.

    Garcia-Garcia, F., Corral, A., Iribarne, L., Vassilakopoulos, M., Manolopoulos, Y.: Efficient distance join query processing in distributed spatial data management systems. Inf. Sci. 512, 985–1008 (2020)

    Article  Google Scholar 

  12. 12.

    Guo, F., Yuan, Y., Wang, G., Chen, L., Lian, X., Wang, Z.: Cohesive group nearest neighbor queries over road-social networks. In: ICDE Conference, pp. 434–445 (2019)

  13. 13.

    Guttman, A.: R-trees: A dynamic index structure for spatial searching. In: SIGMOD Conference, pp. 47–57 (1984)

  14. 14.

    Hashem, T., Kulik, L., Zhang, R.: Privacy preserving group nearest neighbor queries. In: EDBT Conference, pp. 489–500. ACM (2010)

  15. 15.

    Jain, A.K., Murty, M.N., Flynn, P.J.: Data clustering: a review. ACM Comput. Surv. 31(3), 264–323 (1999)

    Article  Google Scholar 

  16. 16.

    Jiang, T., Gao, Y., Zhang, B., Liu, Q., Chen, L.: Reverse top-k group nearest neighbor search. In: WAIM Conference, pp. 429–439. Springer (2013)

  17. 17.

    Kalyvas, C., Maragoudakis, M.: Skyline and reverse skyline query processing in spatialhadoop. Data Knowl. Eng. 122, 55–80 (2019)

    Article  Google Scholar 

  18. 18.

    Li, F., Ooi, B.C., Özsu, M.T., Wu, S.: Distributed data management using mapreduce. ACM Comput. Surv. 46(3), 31:1–31:42 (2014)

    Google Scholar 

  19. 19.

    Li, F., Yao, B., Kumar, P.: Group enclosing queries. IEEE Trans. Knowl. Data Eng. 23(10), 1526–1540 (2011)

    Article  Google Scholar 

  20. 20.

    Li, H., Lu, H., Huang, B., Huang, Z.: Two ellipse-based pruning methods for group nearest neighbor queries. In: ACM-GIS Conference, pp. 192–199. ACM (2005)

  21. 21.

    Li, J., Thomsen, J.R., Yiu, M.L., Mamoulis, N.: Efficient notification of meeting points for moving groups via independent safe regions. IEEE Trans. Knowl. Data Eng. 27(7), 1767–1781 (2015)

    Article  Google Scholar 

  22. 22.

    Li, J., Wang, B., Wang, G., Bi, X.: Efficient processing of probabilistic group nearest neighbor query on uncertain data. In: DASFAA Conference, pp. 436–450. Springer (2014)

  23. 23.

    Lian, X., Chen, L.: Probabilistic group nearest neighbor queries in uncertain databases. IEEE Trans. Knowl. Data Eng. 20(6), 809–824 (2008)

    Article  Google Scholar 

  24. 24.

    Liu, X., Chen, F., Lu, C.: Robust prediction and outlier detection for spatial datasets. In: ICDM Conference, pp. 469–478 (2012)

  25. 25.

    Liu, Z., Wang, C., Wang, J.: Aggregate nearest neighbor queries in uncertain graphs. World Wide Web 17(1), 161–188 (2014)

    Article  Google Scholar 

  26. 26.

    Luo, Y., Chen, H., Furuse, K., Ohbo, N.: Efficient methods in finding aggregate nearest neighbor by projection-based filtering. In: ICCSA Conference, pp. 821–833. Springer (2007)

  27. 27.

    Malik, S.U.R., Khan, S.U., Ewen, S.J., Tziritas, N., Kolodziej, J., Zomaya, A.Y., Madani, S.A., Min-Allah, N., Wang, L., Xu, C., Malluhi, Q.M., Pecero, J.E., Balaji, P., Vishnu, A., Ranjan, R., Zeadally, S., Li, H.: Performance analysis of data intensive cloud systems based on data management and replication: a survey. Distrib. Parallel Databases 34(2), 179–215 (2016)

    Article  Google Scholar 

  28. 28.

    Moutafis, P., García-García, F., Mavrommatis, G., Vassilakopoulos, M., Corral, A., Iribarne, L.: Mapreduce algorithms for the K group nearest-neighbor query. In: Proceedings of the 34th ACM/SIGAPP Symposium on Applied Computing, SAC 2019, Limassol, Cyprus, April 8-12, 2019, pp. 448–455 (2019)

  29. 29.

    Namnandorj, S., Chen, H., Furuse, K., Ohbo, N.: Efficient bounds in finding aggregate nearest neighbors. In: DEXA Conference, pp. 693–700. Springer (2008)

  30. 30.

    Nghiem, T.P., Green, D., Taniar, D.: Peer-to-peer group k-nearest neighbours in mobile ad-hoc networks. In: ICPADS Conference, pp. 166–173 (2013)

  31. 31.

    Papadias, D., Shen, Q., Tao, Y., Mouratidis, K.: Group nearest neighbor queries. In: ICDE Conference, pp. 301–312. IEEE (2004)

  32. 32.

    Papadias, D., Tao, Y., Mouratidis, K., Hui, C.K.: Aggregate nearest neighbor queries in spatial databases. ACM Trans. Database Syst. 30(2), 529–576 (2005)

    Article  Google Scholar 

  33. 33.

    Roumelis, G., Vassilakopoulos, M., Corral, A., Manolopoulos, Y.: Plane-sweep algorithms for the k group nearest-neighbor query. In: GISTAM Conference, pp. 83–93. Scitepress (2015)

  34. 34.

    Roumelis, G., Vassilakopoulos, M., Corral, A., Manolopoulos, Y.: The k group nearest-neighbor query on non-indexed ram-resident data. In: C. Grueau, J. Gustavo Rocha (eds.) Geographical Information Systems Theory, Applications and Management, pp. 69–89. Springer, New York (2016)

  35. 35.

    Safar, M.: Group k-nearest neighbors queries in spatial network databases. J. Geogr. Syst. 10(4), 407–416 (2008)

    Article  Google Scholar 

  36. 36.

    Sultana, N., Hashem, T., Kulik, L.: Group nearest neighbor queries in the presence of obstacles. In: SIGSPATIAL Conference, pp. 481–484 (2014)

  37. 37.

    Zhang, D., Chan, C., Tan, K.: Nearest group queries. In: SSDBM Conference, p. 7. ACM (2013)

  38. 38.

    Zhu, L., Jing, Y., Sun, W., Mao, D., Liu, P.: Voronoi-based aggregate nearest neighbor query processing in road networks. In: ACM-GIS Conference, pp. 518–521. ACM (2010)

Download references

Acknowledgements

Work of Francisco García-García, Michael Vassilakopoulos, Antonio Corral and Luis Iribarne funded by the MINECO research project [TIN2017-83964-R].

Author information

Affiliations

Authors

Corresponding author

Correspondence to Michael Vassilakopoulos.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Moutafis, P., García-García, F., Mavrommatis, G. et al. Algorithms for processing the group K nearest-neighbor query on distributed frameworks. Distrib Parallel Databases (2020). https://doi.org/10.1007/s10619-020-07317-8

Download citation

Keywords

  • Spatial query processing
  • Group nearest-neighbor query
  • MapReduce algorithms
  • Hadoop
  • SpatialHadoop