ADBIS 2016: Advances in Databases and Information Systems pp 212-225 | Cite as
Enhancing SpatialHadoop with Closest Pair Queries
Abstract
Given two datasets P and Q, the K Closest Pair Query (KCPQ) finds the K closest pairs of objects from \(P \times Q\). It is an operation widely adopted by many spatial and GIS applications. As a combination of the K Nearest Neighbor (KNN) and the spatial join queries, KCPQ is an expensive operation. Given the increasing volume of spatial data, it is difficult to perform a KCPQ on a centralized machine efficiently. For this reason, this paper addresses the problem of computing the KCPQ on big spatial datasets in SpatialHadoop, an extension of Hadoop that supports spatial operations efficiently, and proposes a novel algorithm in SpatialHadoop to perform efficient parallel KCPQ on large-scale spatial datasets. We have evaluated the performance of the algorithm in several situations with big synthetic and real-world datasets. The experiments have demonstrated the efficiency and scalability of our proposal.
Keywords
Closest pair queries Spatial data processing SpatialHadoop MapReduceReferences
- 1.Corral, A., Manolopoulos, Y., Theodoridis, Y., Vassilakopoulos, M.: Closest pair queries in spatial databases. In: SIGMOD Conference, pp. 189–200 (2000)Google Scholar
- 2.Corral, A., Manolopoulos, Y., Theodoridis, Y., Vassilakopoulos, M.: Algorithms for processing \(K\)-closest-pair queries in spatial databases. Data Knowl. Eng. 49(1), 67–104 (2004)CrossRefGoogle Scholar
- 3.Nanopoulos, A., Theodoridis, Y., Manolopoulos, Y.: C\(^2\)P: clustering based on closest pairs. In: VLDB Confernece, pp. 331–340 (2001)Google Scholar
- 4.Gao, Y., Chen, L., Li, X., Yao, B., Chen, G.: Efficient \(k\)-closest pair queries in general metric spaces. VLDB J. 24(3), 415–439 (2015)CrossRefGoogle Scholar
- 5.Roumelis, G., Vassilakopoulos, M., Corral, A., Manolopoulos, Y.: A new plane-sweep algorithm for the K-closest-pairs query. In: Geffert, V., Preneel, B., Rovan, B., Štuller, J., Tjoa, A.M. (eds.) SOFSEM 2014. LNCS, vol. 8327, pp. 478–490. Springer, Heidelberg (2014)CrossRefGoogle Scholar
- 6.Zhang, S., Han, J., Liu, Z., Wang, K., Xu, Z.: SJMR: parallelizing spatial join with MapReduce on clusters. In: CLUSTER Conference, pp. 1–8 (2009)Google Scholar
- 7.You, S., Zhang, J., Gruenwald, L.: Spatial join query processing in cloud: analyzing design choices and performance comparisons. In: ICPP Conference, pp. 90–97 (2015)Google Scholar
- 8.Zhang, C., Li, F., Jestes, J.: Efficient parallel \(k\)-NN joins for large data in MapReduce. In: EDBT Conference, pp. 38–49 (2012)Google Scholar
- 9.Lu, W., Shen, Y., Chen, S., Ooi, B.C.: Efficient processing of \(k\) nearest neighbor joins using MapReduce. PVLDB 5(10), 1016–1027 (2012)Google Scholar
- 10.Kim, Y., Shim, K.: Parallel top-\(K\) similarity join algorithms using MapReduce. In: ICDE Conference, pp. 510–521 (2012)Google Scholar
- 11.Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. In: OSDI Conference, pp. 137–150 (2004)Google Scholar
- 12.Li, F., Ooi, B.C., Özsu, M.T., Wu, S.: Distributed data management using MapReduce. ACM Comput. Surv. 46(3), 31:1–31:42 (2014)Google Scholar
- 13.Doulkeridis, C., Nørvåg, K.: A survey of large-scale analytical query processing in MapReduce. VLDB J. 23(3), 355–380 (2014)CrossRefGoogle Scholar
- 14.Eldawy, A., Mokbel, M.F.: SpatialHadoop: a MapReduce framework for spatial data. In: ICDE Conference, pp. 1352–1363 (2015)Google Scholar
- 15.Pertesis, D., Doulkeridis, C.: Efficient skyline query processing in SpatialHadoop. Inf. Syst. 54, 325–335 (2015)CrossRefGoogle Scholar
- 16.Lu, J., Güting, R.H.: Parallel secondo: boosting database engines with hadoop. In: ICPADS Conference, pp. 738–743 (2012)Google Scholar
- 17.Aji, A., Wang, F., Vo, H., Lee, R., Liu, Q., Zhang, X., Saltz, J.H.: Hadoop-GIS: a high performance spatial data warehousing system over MapReduce. PVLDB 6(11), 1009–1020 (2013)Google Scholar
- 18.Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., Anthony, S., Liu, H., Wyckoff, P., Murthy, R.: Hive - a warehousing solution over a MapReduce framework. PVLDB 2(2), 1626–1629 (2009)Google Scholar
- 19.You, S., Zhang, J., Gruenwald, L.: Large-scale spatial join query processing in cloud. In: ICDE Workshops, pp. 34–41 (2015)Google Scholar
- 20.Ma, Q., Yang, B., Qian, W., Zhou, A.: Query processing of massive trajectory data based on MapReduce. In: CloudDB Conference, pp. 9–16 (2009)Google Scholar
- 21.Zhang, S., Han, J., Liu, Z., Wang, K., Feng, S.: Spatial queries evaluation with MapReduce. In: GCC Conference, pp. 287–292 (2009)Google Scholar
- 22.Akdogan, A., Demiryurek, U., Kashani, F.B., Shahabi, C.: Voronoi-based geospatial query processing with MapReduce. In: CloudCom Conference, pp. 9–16 (2010)Google Scholar
- 23.Wang, K., Han, J., Tu, B., Dai, J., Zhou, W., Song, X.: Accelerating spatial data processing with MapReduce. In: ICPADS Conference, pp. 229–236 (2010)Google Scholar
- 24.Patel, J.M., DeWitt, D.J.: Partition based spatial-merge join. In: SIGMOD Conference, pp. 259–270 (1996)Google Scholar
- 25.Park, Y., Min, J.K., Shim, K.: Parallel computation of skyline and reverse skyline queries using MapReduce. PVLDB 6(14), 2002–2013 (2013)Google Scholar
- 26.Eldawy, A., Li, Y., Mokbel, M.F., Janardan, R.: CG_Hadoop: computational geometry in MapReduce. In: SIGSPATIAL Conference, pp. 284–293 (2013)Google Scholar
- 27.Eldawy, A., Alarabi, L., Mokbel, M.F.: Spatial partitioning techniques in SpatialHadoop. PVLDB 8(12), 1602–1613 (2015)Google Scholar
- 28.Gutierrez, G., Sáez, P.: The \(k\) closest pairs in spatial databases - When only one set is indexed. GeoInformatica 17(4), 543–565 (2013)CrossRefGoogle Scholar