1 Introduction

Integrating cloud computing with GIS represents a new era for geospatial big data and its application for the geographic information systems (GISs) [1, 2]. Cloud computing is a new computing model of ‘pay for what you use’ [3]. In essence, one can say that it is a transition from clients owning personal computers (PCs) to clients only having access to computing resources preserved by service providers.

Recently, Hadoop [4], which released in 2007, is the most well-known open source cloud-computing platform. Hadoop has a distributed file system that enables to maintain a massive number of operations in a parallel and distributed processing Mode. Moreover, Hadoop with MapReduce programming paradigm is more robust and scalable than traditional programming. It could diverts work to another location and keeps processing even if a node has a failure [5]. In addition, new nodes could be added easily without changing the data formats, the loading mechanism, and the infrastructure.

The previous researchers work about using Hadoop to manipulate big geospatial data goes in two main trends:

1.1 Spatial-Operation Oriented

In this, researchers intended to develop a function for a particular spatial operation where the desired operations will be executed on traditional Hadoop clusters. For example: (1) R-tree construction [6], an R-tree allocates records as per their Z-values and, for each partition, it combines those R-trees under the same root [7]. (2) RQ [8,9,10], the records are scanned against the query range. (3) kNN query [11, 12], determines the k-nearest points, using distance metrics, from a given location using a brute force approach [9] [11]. (4) All nearest neighbour (ANN) query [13], the points are parcelled by their Z-values in order to find a result similar to kNN queries. (5) Reverse nearest neighbour (RNN) query [11], find all the objects for which query location has nearest neighbours. (6) Spatial join [9, 14,15,16,17], the map function converts data into cells and the reduce function joins data in each cell. (7) kNN join [18,19,20].

1.2 Full-System Oriented

Five main systems were proposed: (1) Parallel-Secondo [21] as a parallel spatial DBMS that utilizations Hadoop as a distributed task scheduler, (2) MD-HBase [22] expands HBase [23], (3) Hadoop-GIS [24]; and (4) GeoSpark [25, 26].

All these systems are an upper layer over classic Hadoop and hence they inherited all its limitations [27, 28]. Hadoop has confinements and execution bottlenecks and it does not support spatial data. Furthermore, the uniform grid index is the only index available, and so these systems only deal with a uniform distribution of data. In addition, the developed systems cannot access the constructed index or enhance new spatial operations.

In contrary to these systems, (5) SpatialHadoop [29, 30] is developed to guarantee spatial data operations in Hadoop [29, 31].

Until now, to the best of our knowledge, no one could make the original PR-Tree partitioning algorithm applicable at SpatialHadoop platforms. State-of-art partitioning algorithms are only confined to eight main algorithms namely: Grid, Quadtree, Z-curve, Hilbert curve, Sort-Tile-Recursive (STR), STR+, KD-Tree, and 2DPR-Tree [27, 32, 33]. Dependently, this paper proposes a number of novelties:

  1. 1.

    State-of-art partitioning algorithms researches [27] only depend in their measurements on query execution time, file size, number of partitions, and indexing time. All of them do not pay any attention to the accuracy of objects retrieved. Hence, a new metric (accuracy of query results) for the assessment of different partitioning techniques is presented.

  2. 2.

    All various SpatialHadoop partitioning techniques are based on a two-tier (2-tier) process for approximating all objects in the input spatial dataset into a set of two-dimensional (2D) points. By this way, it may lead to inaccurate results. The proposed MBR-aware PR-Tree, unlike other techniques [27], guarantees the desired number of partitions and a higher accuracy is well preserved. The accuracy is explored through depicting the graphical representation of the results and studying the number of objects retrieved.

  3. 3.

    The proposed MBR-aware PR-Tree approach overcomes all other state-of-art techniques in terms of performance, functionality, and accuracy for both RQs and kNN queries.

The rest of this paper is organized as follows: Sect. 2 presents a glance on SpatialHadoop system architecture and related works that has been done on different partitioning techniques. Section 3 introduces the presented partitioning algorithm. Section 4 presents the results of the experimentations performed. Finally, Sect. 5 shows the conclusion and future work.

2 Related Works

In SpatialHadoop, spatial data are partitioned and disseminated to the cluster nodes. Thereafter, these data are aggregated into one partition according to their spatial closeness, which will be indexed later. a set of spatial index structures were developed based on a number of partitioning techniques such as grid [27], R-tree [34], R+-tree [35], Z-curve[36], Hilbert curve [37], Quadtree [38], KD-Tree [39, 40] and 2DPR-Tree [33].

The core of SpatialHadoop consists of four layers as shown in Fig. 1: (1) the Language layer called Pigeon [41, 42]. It conceals all complexities of the framework by giving a simple high-level language. Pigeon is combatable with the Open Geospatial Consortium (OGC) standard [43]. PostGIS [44]and Oracle Spatial [45] both are adapted to OGC standard. Therefore, adaptation with the OGC standard makes it possible to integrate with these existing systems by exporting/importing data. (2) The Query Processing layer (operations layer) includes the spatial operations (range query, kNN and spatial join) upheld by SpatialHadoop. A set of basic computational geometry operations were developed in CG_Hadoop [46]. In addition, new spatial operations, that use different built indexes, can be developed based on SpatialHadoop query processing engine. (3) The MapReduce layer provides the MapReduce query engine with the access method to utilize the spatial constructed Indexes while processing a spatial query. In case of the absence of either SpatialFileSplitter or SpatialRecordReader, the query processing layer will scan the whole file and will not be able to access the constructed indexes. The SpatialFileSplitter accesses the global index to eliminate file partitions that do not participate into answering according to a user-defined filter function. Using the local index stored in each partition, the SpatialRecordReader could handle these partitions efficiently and reuse indexes effortlessly. (4) The Storage/Spatial Indexing layer employs a two-layer index structure consisting of one global index and many local indexes. The global index partitions data across the cluster nodes, and then each node indexes its partition separately using a local index. The separation of global and local indexes supports the MapReduce programming paradigm where the MapReduce job configured using the global index and map tasks processed using the local indexes. Adjusting the file into 64 MB partitions ensures load balancing, allows indexing each partition separately in memory and allows writing each partition to a one Hadoop Distributed File System (HDFS) block in an append-only manner.

Fig. 1
figure 1

SpatialHadoop system architecture [33]

The process of creating spatial index comprises three main stages: partitioning, local indexing, and global indexing [29]. In partitioning stage, the input file spatially fractioned into n partitions to satisfy that spatially, nearby objects are stored in the same partition. The size of partition is equal to the size of HDFS block (64 MB). After that, partitions boundaries are calculated differently based on the index algorithm being utilized. Towards the end, for each record r allocated to a partition p, the map function composes an intermediate pair < p, i > . Such pairs are then gathered by p and sent to the reduce function for the following stage. In local indexing stage, a local index structure is built upon the data contents of each partition. A reduce function stores the records assigned to each partition in a local index file. In Global Indexing stage, a global index that indexes all partitions is built. All local index files are concatenated into one file that represents the final indexed file. However, in SpatialHadoop indexing phase, all input shapes are approximated are converted into 2D points [27]. Therefore, all indexing techniques are designed to build the requested index structure based on abstracted layer of 2D points. Consequently, all partitioning techniques mentioned in the literature lacks the accuracy. MBR-aware PR-Tree is proposed as a new indexing and partitioning technique that based on the Priority R-tree [47] which presented in details in Sect. 3.

3 The Proposed PR-Tree Partitioning Approach

Objects saved in spatial datasets can be somewhat complicated. Therefore, it is approximated into more simplistic objects. After that, spatial indexes are based on these approximations. All previous SpatialHadoop partitioning techniques are based on a 2-tier process for approximating all objects found in the input dataset into a set of 2D points as shown in Fig. 2. In addition, all spatial operations are performed based on these 2D definitions.

Fig. 2
figure 2

Two-tier process for approximating a virtual spatial data into a set of 2D points

By this way, the input spatial geometries are mapped into a set of Minimum Boundary Rectangle. Then, the centroid point of each rectangle is calculated which treated as the final geometry representation. Too many spatial data of the original geometries are lost due to this 2-tier approximation process causing inaccurate results for the spatial operations. For instance, assume we have the shapes shown in Fig. 2. Figure 3 shows the process of performing RQ operation based on a randomly generated spatial dataset which is performed according to algorithm 2. Assume a query point (P) and a distance threshold (d). A spatial RQ should returns A and D objects which are fully located inside the search area. However, as all shapes are mapped into 2D points, the 2-tier process returns A, B, C, D, E, and F objects as depicted in Fig. 4. It noticeable that this way lacks the accuracy of spatial data retrieval.

Fig. 3
figure 3

Rang query operation over a virtual spatial input dataset

Fig. 4
figure 4

Rang query operation over the approximated point layer

figure a

For KNN query, as mentioned in Sect. 2, it determines the k-nearest points, using distance metrics, from a given location based on a brute force approach. For instance, assume we search for three most nearest neighbour of a query point (P) illustrated in Fig. 5. It is obviously that B, C, D objects is the most nearest neighbour to the specified query point. However, based on the approximated 2D point layer, the results return A, C, and D objects as the most nearest neighbour to the specified query point. Figure 6 illustrates these details.

Fig. 5
figure 5

KNN query operation over a virtual spatial input dataset

Fig. 6
figure 6

KNN query operation over the approximated 2D point layer

figure b

In contrast to other SpatialHadoop partitioning techniques, the proposed MBR-aware PR-Tree partitioning technique is based on a 1-tier process of approximating all objects in the input spatial dataset, as shown in Fig. 7. Such a 1-tier process is proceeded through calculating the MBR for each object in the input dataset. Thereafter, a spatial index is constructed based on these approximated set of rectangles. MBR-aware PR-Tree partitioning technique bypass the process of approximating the rectangles layer into a 2D points which is the main cause of wastage for many of the original geometries data.

Fig. 7
figure 7

One-tier process of approximating a virtual input spatial dataset into a set of Minimum Boundary Rectangles

If we again assume the same RQ with query point (p) and a distance threshold (d), for the 1-tier process, it returns only A and D objects as depicted in Fig. 8. Therefore, it is matched with our naked eyes, and therefore, it is more accurate than other techniques which based on 2-tier process. In addition, assume again that we need to retrieve the three most nearest neighbour of a query point P shown in Fig. 9. Hence, the query result returns B, C, and D objects as the most nearest neighbour to the specified query point, which is also matched with our naked eyes.

Fig. 8
figure 8

Rang query operation over the MBRs layer

Fig. 9
figure 9

KNN query operation over the MBRs layer

The idea of the MBR-aware PR-Tree is to deal with the MBRs of the input shapes as a 4D point (X_min, Y_min, X_max, Y_max), which is defined in SpatialHadoop as a new data type to store the MBR value for each input shape. Starting with identifying how many shapes can one partition (leaf) contains (B). Then, calculates the MBR for each input shape. Thereafter, the total number of available shapes is calculated to estimate if they are suitable to fit into one partition. Then, a scalar priority leaf \({\nu }_{\rho }\) should be generated to store the B minimal x-coordinates shapes. In the same way, the other three priority are produced. The previous steps are performed recursively until all shapes are fitted in partitions.

Algorithm 3 illustrates the steps of the proposed MBR-aware PR-Tree partitioning technique. It starts by calculating the maximum number of shapes that can be fit in one partition (leaf). Then, each input shape is converted to a 4D point. After that, a root node is created and the total number of shapes is checked. If it is less than or equal to the maximum number of shapes that can be fit in one partition (leaf), if so then the algorithm generates a scalar priority leaf ν_ρ. Unless, it generates a priority leaf V_p^(x_min) that stores a B shapes with minimal x-coordinates. Then, the algorithm is process in the same way to produce the other three priority. On the other hand, if the root node has a number of shapes higher than 4B, then a two sub PR-Trees and a four-priority leaves are produced. The algorithm recursively applies these calculations until no shapes remaining to be filled.

figure c

4 Experimentation and Discussion

Different SpatialHadoop partitioning techniques are experimented using the Amazon Hadoop cluster with ‘m.xlage’ servers. Real datasets extracted from OpenStreetMap: Buildings (28.2 GB), Roads (25.9 GB), Lakes (9.7 GB), Cities (1.4 GB), and Sports (590 MB) [29] are utilized in the testing process for all partitioning and indexing techniques.

It is cleared that plotting a non-indexed file is a time-consuming process as it takes too much time than the indexed one. On the other hand, the proposed MBR-aware PR-Tree has the shortest plotting time due to preserving spatial proximity of the spatial shapes. Table 1 demonstrates the partitions number generated by different partitioning techniques. All partitioning techniques (MBR-aware PR-Tree, 2DPR-Tree, KD-Tree, Z-curve, and Hilbert) take into consideration the desired number of partitions unless Quadtree, STR, and STR+. The highlighted numbers shown in Table 1 does not meet the desired number of partitions.

Table 1 Number of partitions by each partitioning techniques

In Z-curve, Hilbert curve, STR, 2D PR-Tree and MBR-aware PR-Tree, boundary objects are assigned only to the partition with a maximal intersection. In contrast, Quadtree, STR+, and KD-Tree, the objects located on the boundaries between partitions are allocated to all of those overlapping partitions. This has its effect on the size of the indexed datasets as documented in Table 2. It illustrates the sizes of different indexed datasets in mega byte (MB). In addition, the proposed MBR-aware PR-Tree preserved spatial proximity of the spatial shapes and not only meets the required number of partitions but also guarantees that all partitions have the same number of objects, which in turn have roughly the same size. For all of these, it gives the minimum indexed datasets sizes, which marked in bold. Accordingly, the proposed algorithm has the advantage of highly load balancing when working on such partitions. Inversely, Quadtree, STR+, and KD-Tree have the maximum indexed datasets sizes compared with the others, which marked in green, due to generation of a big number of partitions. As a result, they waste too much storage space to save the generated partitions.

Table 2 Indexed datasets sizes in MB for different techniques

Figure 10 shows the indexing time for different real datasets (Sports, Cities, Lakes, Roads, and Buildings) for different partitioning techniques. It is noticeable that the proposed MBR-aware PR-Tree has the best indexing time. Figure 11a–d shows the range query execution time with different query window areas range from 0.01 to 50% of datasets. As noticed, as the query window area increases the Quadtree performance rapidly became better than the other techniques. The results shown in Fig. 11a–d confirm our earlier claims as the MBR-aware PR-Tree, 2DPR-Tree, and the KD-Tree answer the RQ with a query window area equal to 50% of the Buildings dataset area in 110, 111 and 120 s, respectively. On the other side, Quadtree takes approximately twice the time to answer the same query.

Fig. 10
figure 10

Indexing time for the real datasets (Sports, Cities, Lakes, Roads, and Buildings)

Fig. 11
figure 11

Range query execution time with for query window area 0.01%, 1%, 10%, and 50% on indexed real datasets a Sports, b Cities, c Lakes, and d Roads

As mentioned earlier, all SpatialHadoop partitioning techniques are based on a 2-tier process for approximating all objects in the input spatial datasets into a set of 2D points, which causes the loss of many original objects data, and in turn spatial indexes are constructed on these approximated set of 2D points. Consequently, all spatial operations are performed based on these indexes. Such 2-tier approximation process causes an inaccurate result for the spatial operations. In contrast, the proposed MBR-aware PR-Tree partitioning technique is based on a 1-tier process for approximating all objects in the input spatial dataset into their MBRs without any level of approximation or losing objects data. By the way, it guarantees more accurate query results. Figures 12 and 13 show, for instance, a RQ on Cities dataset with a query window area 0.00005% and 0.001%, respectively, for the proposed MBR-aware PR-Tree (on the right hand) and the other state-of-art techniques (on the left hand). The objects, marked in orange on the left hand, are not returned in the query using the proposed MBR-aware PR-Tree which is agreed with our dissertations.

Fig. 12
figure 12

Range query with a random query window area 0.00005% of Cities dataset

Fig. 13
figure 13

Range query with a random query window area 0.001% of Cities dataset

Figure 12b differs with Fig. 12a in only one object, whereas Fig. 13b differs in 10 objects. The reason behind that returns to the query window size. As long as the window size is enlarged, the more differentiations will be appeared. Table 3 comes to strengthen this meaning about the accuracy of returned objects. As illustrated in Table 3, the number of objects returned as a result for the execution of ten different rectangle areas (R1–R10) selected randomly for different RQ areas (0.01–50%) on different datasets. It is noticeable from Sports dataset, a small size file with a small number of objects; there are no major difference between the proposed MBR-aware PR-Tree and other techniques except only for RQ with query window area 50% in R10 the proposed algorithm returns 160,799 objects that is more accurate than the other techniques by 173 objects. However, for Roads and Buildings datasets (a big size file with a huge number of objects), it is noted that there are many differences in the number of retrieved objects for different range areas especially for quires with range area 10% and 50%.

Table 3 Difference between returned objects of the proposed MBR-aware PR-Tree and other techniques on indexed real datasets (a) Sports, (b) Cities, (c) Lakes, and (d) Roads

Figure 14a–d shows the kNN query performance on different real datasets. It is noticeable that Quadtree outperforms the other techniques in performing the kNN queries, as it divides the input datasets into a big number of partitions with small sizes. However, the MBR-aware PR-Tree has the best performance compared to techniques that work on the desired number of partitions. In addition, the proposed MBR-aware PR-Tree not only meets the required number of partitions but also guarantees that all partitions have the same number of objects, which in turn have roughly the same size so the proposed MBR-aware PR-Tree has the advantage of highly load balancing when working on such partitions inversely with Quadtree which would be have a great decreasing effect on the Quadtree performance with the input dataset with huge sizes.

Fig. 14
figure 14

kNN query execution time for Sports, Cities, Lakes and Buildings

5 Conclusion

In this paper, an enhanced PR-Tree partitioning technique in SpatialHadoop is proposed for dealing with big spatial data operations. Various state-of-art SpatialHadoop partitioning techniques have been experimentally evaluated compared to our proposed MBR-aware PR-Tree. The experimental results show that Quadtree, STR, STR+ take long time in the index creation process than the other techniques. In addition, for RQ on a large size input dataset, all other techniques performance is highly decreased specially for large query window areas. In contrary, MBR-aware PR-Tree has a better time in the index creation process and generates the desired number of partitions. RQ based on MBR-aware PR-Tree index has the best Tree execution time for all datasets sizes since it could preserve the spatial proximity of the input objects. Moreover, kNN query-based MBR-aware PR-Tree index performance becomes better as the k values and the dataset size become larger. More importantly, the proposed MBR-aware PR-Tree guarantee a higher accuracy in terms of objects retrieved. In the future work, more experiments will be conducted on the kNN query with bigger k values with the goal of further enhancing query response time and query result accuracy.