1 Introduction

Geospatial information is usually stored and managed in the form of vector data and raster data. However, such traditional data models cannot support complicated spatial queries efficiently if the queries contain structural and semantic requirements. For example, “find a lake surrounded by a forest land and a desert” is a common task in spatial pattern analyses for ecologists, but they still do the work manually, because traditional GIS systems cannot adequately handle these kinds of queries. The traditional approach involves performing multi-way spatial joins [1] across data tables (such as lakes, forests, and deserts) to obtain candidate sets, which are then further filtered down based on spatial relationship criteria, such as being surrounded by other entities. However, due to the large number of candidates generated from these joins, query efficiency could decrease significantly. One reason for this issue lies in the fact that traditional spatial data models tend to group the same type of objects together without taking into account the spatial semantics formed by neighboring objects, such as lakes and forests. These data models have difficulty fully capturing and expressing the neighboring relationships, even though they are quite common in nature.

A direct idea would be to use a graph model to describe the neighboring relationships of geospatial object, i.e., to treat each object as a vertex and describe the neighboring relationship between them using edges. Recently, researchers have been conducting extensive studies about extracting spatial knowledge and constructing knowledge graphs (KGs). Several KGs that contain geospatial objects are constructed like WorldKG [2], YAGO2 [4, 5], and YAGO2Geo [3, 6]. However, these KGs lack explicit expression for neighboring relationships, or without quantity measurement on such relationships. For instance, in WorldKG (Fig. 1a), point-like objects (sf:Point), line-like objects (sf:LineStrings), and polygon-like objects (sf:Polygons) are considered geospatial objects (geo:SpatialObjects), yet their neighboring relationships remain unexpressed. In YAGO2Geo (Fig. 1b), geo:SpatialObject is used to express geospatial objects, while geo:sfWithin and geo:sfTouches are used to express containment and contiguity relationships between objects, but these relationships have not been further quantified. In YAGO2 (Fig. 1c), schema:Place represents geographical regions, schema:neighbors expresses neighboring relationships between two regions. However, it does not quantify how closely they are. In this paper, we propose the fine-grained geospatial knowledge graph (FineGeoKG) which is capable of capturing and quantifying neighboring relationships between geospatial objects.

Fig. 1
figure 1

Ontologies of recent geospatial KGs

As Fig. 2 shows, FineGeoKG focuses on describing the neighboring relations between objects. We define such relations as strong geospatial relations (SGRs). For example, Fig. 2a shows the natural layout of objects \(\{o_1,o_2,\ldots ,o_6\}\), where some objects are neighbors. For example, \(o_1\) borders on \(o_2\), \(o_5\) intersects with \(o_1\), \(o_3\) and \(o_4\) are close. We call they have SGRs. In FineGeoKG (Fig. 2b), the vertices (or entities) indicate objects, and an edge with the label “sgr” indicates two objects has SGR. For example, the edge \(r_{34}\) between \(o_3\) and \(o_4\) has a label “sgr”. In addition, the properties of an “sgr” edge indicate the topological and directional characteristics of the relation. For example, the closeness of \(o_3\) and \(o_4\) is quantified by their distance \(l=0.5~\)km, and their relative direction is quantified by \(-\,10^\circ\). FineGeoKG is different from WorldKG [2], Yago2 [4, 5], Yago2Geo [3, 6], and other existing geospatial knowledge graphs, since FineGeoKG can capture fine-grained spatial information (i.e., neighboring relations among ground objects) while the other geospatial knowledge graphs focus on adding coarse-grained spatial information (for example, geographical coordinates) to a general purpose knowledge graph.

Fig. 2
figure 2

A toy example of FineGeoKG

When constructing FineGeoKG, it is crucial to define SGRs and find out SGRs between objects. In this paper, we first define six types of SGRs considering there are three types of geospatial objects (i.e., point, polyline, and polygon) in the perspective of geometry. The problem of finding out SGRs is in fact a geospatial interlinking problem [7] which aims at searching for object pairs having specific spatial relations. We improve the geospatial interlinking algorithm in [7] and our new algorithm can find SGRs faster. We define SGR queries which aim at finding out all subgraphs from FineGeoKG that satisfy some spatial patterns. To support SGR queries, we improve the binary join algorithm for subgraph matching based on an index for SGR edges. We conduct experiments on real datasets to evaluate the performances of our algorithms. We also demonstrate the usefulness of FineGeoKG by executing structural SGR queries and semantic SGR queries on FineGeoKG constructed. The two types of queries are subgraph matching queries [8] which contain structural and semantic requirements. The query results can help researchers analyze spatial patterns of ground objects and the layouts of facilities.

Our contributions are as follows:

  • We define the fine-grained geospatial knowledge graph (FineGeoKG) which is characterized by SGRs. The SGRs can capture the spatial coherences among geospatial objects.

  • We propose a fast geospatial interlinking algorithm which can find out SGRs between objects.

  • We design an index for SGR edges and propose an algorithm for answering SGR queries.

  • We conduct experiments to evaluate the proposed algorithms and to demonstrate the usefulness of FineGeoKG.

2 Related Work

Knowledge graph technology is evolving toward the directions of multimodal data organization and understanding, large-scale dynamic knowledge graph representation learning and pre-training, and the integration of neural and symbolic approaches for knowledge updating and reasoning [9]. Knowledge graph technology based on reinforcement learning [10] has inherent advantages in addressing challenges like data annotation difficulties and reasoning explainability, and it is also becoming a research hot topic. However, these general knowledge graphs lack sufficient focus on geographic spatial semantics, which hinders the application of knowledge graphs in GIS, mobile recommendation systems, and location-based services (LBS).

Building a graph with the ability to describe geographic scenes is the most crucial step in transforming GIS from an information service to a knowledge service [11]. To this end, in recent years, researchers have conducted extensive research on geospatial knowledge extraction and knowledge graph construction techniques, resulting in the development of geospatial knowledge graphs represented by WorldKG [2], YAGO2 [4, 5], and YAGO2geo [3, 6]. We have already discussed the limitations of these three KGs and the advantages of FineGeoKG proposed in Sect. 1.

2.1 Geospatial Knowledge Graph

Knowledge modeling In terms of spatio-temporal knowledge modeling, existing techniques typically start by constructing an ontology and then integrate the knowledge into a general knowledge graph through entity alignment, knowledge graph completion, and other methods. Researchers extract knowledge triplets from OpenStreetMap and establish alignment relationships between the WorldKG ontology and the DBpedia ontology, integrating geospatial knowledge into DBpedia [2]. Some researchers have also expanded the concepts of geographic entities and related properties based on the GeoSPARQL ontology [12], constructing knowledge graphs that include geographic administrative entities and spatial relationships [13, 14]. These KGs do not have the ability to express spatial relationships between entities effectively, and the entities mainly represent administrative regions such as provinces and cities, rather than the fine-grained geographic objects such as lakes, roads, and bus stations focused on in this paper.

Knowledge acquisition In terms of geospatial knowledge acquisition, existing techniques extract entities and relationships with geospatial attributes from various sources such as internet text (e.g., Wikipedia), GIS data, and images (e.g., historical map photos). For internet text, researchers have defined geospatial knowledge extraction rules and utilized natural language processing techniques to extract knowledge triplets from the text according to these rules [3]. For GIS data, technologies like TripleGeo [15] and GeoTriples [16] can transform GIS data in formats such as shapefiles into knowledge triplets that comply with the GeoSPARQL standard [12]. Regarding image data such as historical map photos, researchers have proposed methods to extract information such as intersections and administrative boundaries from the images [17]. The aforementioned techniques are all methods for acquiring coarse-grained geospatial knowledge from different data sources. The entities in FineGeoKG proposed in this paper are fine-grained geographic objects, namely, ground objects.

Identification of spatial relationships Some researchers have studied the generation of scene graphs from remote sensing imagery, focusing on special scenes (for example, the airports) that contain small objects (for example, tarmac, runway, etc.) [18]. G. Papadakis et al. [7] determined the existence of basic spatial topological relationships between entities based on the intersection matrix [19], considering the intersecting relationships between the internal regions, contour lines, and external regions of two geospatial entities. Bin Wu et al. [20] proposed a relation attribute proximity graph to express the spatial patterns between objects for analyzing connectivity and other features. In sum, these studies mainly focus on expressing and calculating basic spatial relationship types such as containment and adjacency, and have not yet proposed a method for quantitatively expressing spatial relationships. Axel-Cyrille proposed the ORCHID method [21], which uses the Hausdorff distance to measure the distance between two polygon objects. When the distance is less than a certain threshold, it is considered that there is a link between the two polygons. This method focuses solely on efficiently discovering spatial relationships between polygon objects. However, in this paper, six strong spatial relationships between geographic objects are defined, enabling a more comprehensive expression of semantic aspects of adjacency, orientation, and distance in terms of topology.

2.2 Queries Driven by Geospatial Relationships

Multi-way spatial join The SGR query proposed in this paper aims to identify closed regions formed by specific types of objects arranged according to a certain spatial pattern. A traditional approach to solving this problem is to use the multi-way spatial join technique [1]. Two-way join is the most basic spatial join query, where it retrieves all pairs of geospatial objects from two sets of objects, R and S, that satisfy a certain join condition (e.g., intersection). Traditional research on spatial joins has focused on reducing disk I/O operations [22]. With advancements in hardware technology, current research has shifted toward in-memory spatial joins and spatial joins based on multi-core processors [23]. The multi-way spatial join extends the two-way join. It finds tuples \((r_1, r_2, \ldots , r_k)\) that satisfy the join condition from multiple object sets \(R_1\), \(R_2\), ..., \(R_k\). The join condition can be represented as a graph structure, where nodes represent sets and edges represent the join condition between two sets. Depending on the structure of the graph, different types of queries can be formulated, such as tree queries and clique queries [1]. Researchers have proposed various multi-way join methods, including algorithms based on map-reduce [24] and the Spark platform [25]. These spatial join approaches organize objects of the same type into separate sets and then perform join operations on these sets. If these approaches are used to solve SGR queries, the query efficiency is not high. The reason is that the spatial relationships between adjacent features are relatively deterministic, but these methods require a significant amount of time to recompute these relationships each time.

Location-based subgraph matching From the perspective of knowledge graph query processing techniques, SGR queries are a type of subgraph matching query [8]. In recent years, researchers have proposed various indexing techniques and query algorithms, such as SS-Tree [26], Hilbert-Encoding [27], and Riso-Tree [28], to address such query tasks. N. Mamoulis et al. use Hilbert space-filling curves to encode geospatial entities and propose a filtering algorithm that effectively reduces the search space for selection and join queries [27]. Lei Zou et al. propose SS-trees that can integrate both ordinary semantic features and geospatial features to solve query problems on spatial RDF graph structures [26]. This technique has been applied in the gstore graph database [29]. M. Sarwat et al. investigates location-based semantic queries and proposes two indexes, Riso-Tree [28] and SPA-Graph [30], and further builds the Spindra system [31] that incorporates both indexes. These techniques can effectively address GeoSPARQL queries [12] or specific location-based spatial queries, but they cannot be directly applied to solve SGR query problems. Recently, a new efficient subgraph matching algorithm SUFF was proposed [32]. This algorithm establishes a filter library for common subgraph patterns and utilizes the filter library during the depth-first search process to achieve pruning and rapidly reduce the search space. In the experimental section of this paper, we compare the performance of this algorithm and our proposed algorithm. Due to the utilization of spatial information to construct the index structure, our algorithm outperforms this algorithm in terms of performance as the data scale increases.

3 Problem Definitions

In this section, we formally define SGR and FineGeoKG.

Definition 1

(Strong Geospatial Relation (SGR)) Two ground objects, \(o_x\) and \(o_y\), have a strong geospatial relation (SGR) if they satisfy one of the following proximity conditions.

  • \(o_x\) and \(o_y\) are both polygons. If they intersect, touch, or if \(o_x\) contains \(o_y\) (or \(o_y\) contains \(o_x\)), they have an SGR.

  • \(o_x\) and \(o_y\) are a polygon and a polyline, respectively. If they intersect, touch, or if \(o_x\) contains \(o_y\), they have an SGR.

  • \(o_x\) and \(o_y\) are a point and a polygon. If the point falls into the polygon, they have an SGR.

  • \(o_x\) and \(o_y\) are two polylines. If they intersect, they have an SGR.

  • \(o_x\) and \(o_y\) are a point and a polyline. If the minimum distance between the point and the polyline is less than \(\epsilon\), where \(\epsilon\) is a very small, they have an SGR.

  • \(o_x\) and \(o_y\) are two points. If their distance is less than \(\epsilon\), they have an SGR.

Note that the "distance" mentioned in the previous context refers to the Euclidean distance on the plane.

Definition 2

(Fine-Grained Geospatial Knowledge Graph (FineGeoKG)) A fine-grained geospatial knowledge graph (FineGeoKG) \(KG = (E, R, L, P, V)\) consists of a set of entities E representing fine-grained ground objects, a set of relations R representing SGRs between two entities, a set of labels L indicating the classes of entities and relations, a set of properties P with their corresponding property value domains V.

In a FineGeoKG, an entity can have multiple labels and a relation should have at least a “sgr” label to capture spatial coherences, as Fig. 2b shows. An entity or a relation can have multiple properties and the values must be included in the corresponding domains.

Table 1 Properties used to quantify six types of SGRs

Table 1 lists the properties used to quantify the six types of SGRs. Note that the table only presents typical methods for quantifying six types of SGR. Other methods can also be used to quantify SGR depending on the application context and requirements.

  • Polygon–Polygon The length (l) of the shared border between \(o_x\) and \(o_y\) quantifies the relation topologically. The direction interval, namely, \(\alpha =\left[ \alpha _s,\alpha _e\right]\), quantifies the relation directionally.Footnote 1

  • Polygon–Polyline The length (l) of the polyline (\(o_y\)) inside the polygon \(o_x\) quantifies the relation topologically. The clockwise angle (\(\alpha\)) from the north direction to the intersection line quantifies the relation directionally.Footnote 2

  • Polygon-Point The distance (d) from point \(o_y\) to the center of polygon \(o_x\) quantifies the relation topologically. The clockwise angle (\(\alpha\)) from the north direction to the segment \(o_xo_y\) quantifies the relation directionally.

  • Polyline–Polyline The direction interval (\(\alpha = \left[ \alpha _s,\alpha _e\right]\)) quantifies the relation directionally, where \(\alpha _s\) and \(\alpha _e\) are the clockwise angles from the north direction to the two segments which intersect.Footnote 3 Since it is difficult to quantify the relation topologically, we do not set the corresponding property.

  • Polyline–Point The minimum distance (d) from point \(o_y\) to polyline \(o_x\) quantifies the relation topologically. The clockwise angle (\(\alpha\)) from the north direction to the segment \(o_yp\) quantifies the relation directionally, where p is the point on \(o_x\) nearest to \(o_y\).

  • Point–Point The distance (d) between \(o_x\) to \(o_y\) quantifies the relation topologically. The clockwise angle \(\alpha\) from the north direction to the segment \(o_xo_y\) quantifies the relation directionally.

Note that other definitions of properties can be adopted according to different application scenarios.

4 Constructing Fine-Grained Geospatial Knowledge Graph

There are two steps to construct a FineGeoKG. Firstly, we add the ground objects into KG as entities (or vertices). We can employ the image segmentation and target detection techniques to extract ground objects from remote sensing images. According to their geometries, vegetations and other characteristics, we add labels and property values to them. Secondly, we add the relations (or edges) to KG. If two vertices have SGR, we interlink them. The geospatial interlinking problem is defined as follows.

Definition 3

(Geospatial Interlinking Problem) Given a set of polygon objects, polyline objects and point objects (O), the geospatial interlinking problem is to find out which objects have SGRs and to quantify the SGRs in topological aspect and directional aspect.

4.1 Baseline Geospatial Interlinking Algorithms

R-tree-based method The geospatial interlinking problem is a kind of spatial join problems [1]. A straightforward way is as follows. At first, we organize the objects using an R\(^*\)-tree. A polygon object is abstracted as its minimum bounding rectangle (MBR). A polyline object is abstracted as a set of MBRs and each MBR encloses a segment in the polyline. A point-object is abstracted as an MBR which corner points are the point itself. The objects are inserted into the R\(^*\)-tree. Next, we find out the objects that have SGRs with each object \(o_i \in O\) by using spatial range queries on R\(^*\)-tree. If \(o_i\) is a polygon object, the query range is its MBR. If \(o_i\) is a polyline object, the query range is its MBR set. If \(o_i\) is a point, the query range is the MBR which is the circumscribed square of a circle with \(o_i\) as its center and \(\epsilon\) as its radius. At last, we verify the SGRs between the query results and \(o_i\), and return the objects passing the verifications. For each object found, we quantify the SGR according to Table 1 and add a relation \(r_{ij}\) with label “sgr” and the property values into the knowledge graph.

Grid-based method Another method is the grid-based algorithm [7], which constructs a grid for the whole space and computes the cells covered by every object. In the grid, the cell length is \(\Delta _x\) and the cell width is \(\Delta _y\), where \(\Delta _x\) and \(\Delta _y\) are determined by the average size of all objects. We give a number \((n_x,n_y)\) to each cell as its identifier. The \(n_x\) and \(n_y\) are integers used to indicate the cells position. The cells covered by a polygon object \(o_i\) are the cells covered by \(o_i\)’s MBR. Assuming the lower left corner point and the upper right corner point of an object’s MBR are \((x_1,y_1)\) and \((x_2,y_2)\), the cells covered by the MBR are the cells with number \(n_x\in \lfloor x_1 / \Delta _x \rfloor .. \lceil x_2 / \Delta _x \rceil\) and \(n_y \in \lfloor y_1 / \Delta _y \rfloor .. \lceil y_2/ \Delta _y \rceil\). The cells covered by a polyline \(o_i\) are the cells crossed by \(o_i\). The cells covered by a point \(o_i\) are the cells covered by the circle with \(o_i\) as its center and \(\epsilon\) as its radius. A mapping table maintains the correspondences between each cell and the objects covering it. In the table, the keys are the identifiers of cells, the corresponding values are object identifier lists which record the objects covering the cells. The algorithm is similar to the R-tree-based algorithm. The difference is that we retrieve objects using the mapping table rather than using the R-tree. For each object \(o_i\in O\), we get its left corner point and its right corner point. Next, for each cell \(g_{uv}\) that \(o_i\)’s MBR covers, we obtain its corresponding object list from the mapping table, and add the objects into the candidate set. For each candidate, like the R-tree-based algorithm, we verify the SGR and insert a new relations into the knowledge graph.

4.2 Fast Geospatial Interlinking Algorithm

To find out SGRs faster, we propose Algorithm 1, which improves the grid-based algorithm in three aspects. Firstly, we recognize the SGRs while constructing the mapping table rather than constructing the table beforehand. Secondly, we follow the “point\(\rightarrow\)polyline\(\rightarrow\)polygon” order to check the objects. Thirdly, to find the cells covered as accurately as possible, we propose a method which can find the cells passed through by a segment exactly.

Algorithm 1
figure a

Fast geospatial interlinking

Processing point objects In Algorithm 1, after setting the unit sizes \((\Delta _x,\Delta _y)\) and initializing the mapping table tbl, we check the point objects (i.e., the for loop from line 2 to line 7) in order to find out the SGRs between two point objects (line 3–5) and add the objects into tbl (lines 6–7). As Fig. 3 shows, the searching range of the point object \(o_i\) is the grey cells that are covered by the circle with center \(o_i\) and radius \(\epsilon\). In Algorithm 1, the function GetSearchRange() calculates the lower left identifier \((x_1,y_1)\) and the upper right identifier \((x_2,y_2)\) of the grey cells (line 3), the function GetCandidObjs() gets the candidate objects in the lists corresponding to the grey cells in tbl (line 4). See the grey tuples in Fig. 3. And then we determine the point objects that have SGRs with \(o_i\) and update edges in the knowledge graph KG (line 5). Next, we insert \(o_i\) into tbl. In Fig. 3, the arrow illustrates the inserting operation. The function GetCellPosOfPoint() finds the identifier (uv) of the cell where \(o_i\) falls into (line 6) and then add \(o_i\) into the corresponding list \(L_{uv}\) (line 7).

Fig. 3
figure 3

Processing point objects

Processing polyline objects Firstly, we check each polyline object \(o_i\) in order to find out the SGRs between \(o_i\) and other point objects, as well as the SGRs between \(o_i\) and the polyline objects existing in tbl, i.e., the for loop from line 8 to line 15. On the one hand, the searching range for point objects is the cells covered by the rectangles extended from the segments of \(o_i\). Since the closest distance between a SGR-neighbor object and \(o_i\) should be smaller than \(\epsilon\), we generate a rectangle by taking the segment as the central axis and extending \(\epsilon\) widths to the left side and the right side of the segment. As the left figure of Fig. 4 shows, \(o_i\) consists of two segments and the rectangles are generated by using the extending method. The searching ranges are the gray cells covered by the rectangles. On the other hand, since the SGR-neighbor polyline should intersect with \(o_i\), the searching range for polyline objects is the cells crossed by \(o_i\) (the red cells in the center figure), which are contained in the gray cells. Thus, GetSearchRange() function calculates the identifiers set R of the grey cells (line 9). The algorithm gets the candidate objects from tbl according to the identifiers (line 10 to line 11). Line 12 picks out the real SGR-neighbors from the candidates and adds edges to KG. Secondly, we insert the polyline object \(o_i\) into tbl. The function GetCellPosOfPolyline() calculates the identifier set UV of the cells crossed by \(o_i\). See the red cells in the center figure. The algorithm inserts \(o_i\) to the object lists corresponding to the identifiers. The arrows in the right figure illustrate the inserting operations.

Fig. 4
figure 4

Processing polyline objects

Processing polygon objects Firstly, we check each polygon object \(o_i\) in order to find out its SGR-neighbors in the shape of points, polylines, and polygons, i.e., the for loop from line 16 to line 23. The searching range of point objects and polyline objects are the cells covered by \(o_i\), since a point SGR-neighbor should fall inside the polygon and a polyline SGR-neighbor should fall inside the polygon partly or completely. See the grey cells in the left figure of Fig. 5. The searching range of polygon objects are the cells crossed by the boundary segments of \(o_i\), since a polygon SGR-neighbor should touch \(o_i\). See the red cells in the center figure, which are contained in the gray cell set. The function GetSearchRange() gets the identifiers R of the grey cells, and then the algorithm get the candidates C from tbl according to the identifiers (lines 18–19). After picking out the real SGR-neighbors, we add new edges to KG (line 12). The searching range of a polygon object includes all the cells covered by its internal region and its edges (i.e., the gray cells shown in the left figure of Fig. 5), and the polygon object itself is represented by the cells covered by its edges (i.e., the red cells shown in the middle figure of Fig. 5). When two objects \(o_x\) and \(o_y\) are both polygons, if there is an intersection between the searching range cell set of \(o_x\) and the self-representing cell set of \(o_y\), the algorithm determines that there is a SGR (i.e., intersection, touching, or containment) between \(o_x\) and \(o_y\). A polyline object itself is represented by the set of cells it crosses (i.e., the red cells shown in the middle figure of Fig. 4). When an object is a polygon (\(o_x\)) and another object is a polyline (\(o_y\)), if there is an intersection between the search range cell set of \(o_x\) and the self-representing cell set of \(o_y\), the algorithm determines that there is a SGR (i.e., intersection, touching, or containment) between \(o_x\) and \(o_y\).

Secondly, we insert the polygon object \(o_i\) into tbl. The function GetCellPosOfPolygon() calculates the identifiers of the cells crossed by \(o_i\)’s boundary (i.e., the red ones in the center figure), and insert \(o_i\) into the object lists corresponding to the identifiers (lines 22–23). We avoid inserting \(o_i\) to the object lists corresponding to the cells, which are covered by the interior area of \(o_i\). For example, the cells with identifiers (2, 2) and (2, 3). There are two reasons. One reason is polygon objects often has large interior areas in the natural environments, and this strategy can help us reduce duplicate \(o_i\)’s in tbl. Another reason is it is sufficient to store \(o_i\)’s in the list of the cells crossed by the boundary. Two polygons might have SGR if some of their boundary segments fall into the same cell.

Time complexity analysis The three for loops in Algorithm 1 respectively handle three types of objects: point, polyline, and polygon. Within each for loop, one object is processed at a time. During the processing of each object \(o_i\), the most time-consuming step is to find the objects that truly have SGR with \(o_i\) from the candidate set. The time consumed in this step depends on the size of the candidate set. Assuming the average size of the candidate set is k, the time complexity of the algorithm is O(kn), where n is the total number of objects.

Fig. 5
figure 5

Processing polygon objects

Calculating cells crossed by a segment In Algorithm 1, a basic operation is to calculate the cells crossed by a segment. At line 9, line 13, line 17 and line 21, we have to do this basic operation. There are three steps to figure out the cells. Firstly, we delimit the spatial bounds of the cells, namely, \(x_\textrm{start}\), \(x_\textrm{end}\), \(y_\textrm{start}\) and \(y_\textrm{end}\). Figure 6 shows the four bounds of the cells crossed by segment S whose endpoints are \(s_\textrm{start}\) and \(s_\textrm{end}\) (black points). Secondly, we derive the intersections of S with the vertical lines and the horizontal lines that are within the bounds. As the figure shows, the vertical lines are \(x=x_\textrm{start} + \Delta _x\) and \(x=x_\textrm{start} + 2\times \Delta _x\). The horizontal line is \(y = y_\textrm{start} + \Delta _y\). The blue squares illustrate the intersections, which are numbered as \(s_1\), \(s_2\) and \(s_3\) according to the order of their x-coordinates. Thirdly, we calculate the midpoints of the consecutive sub-segments, which are obtained by dividing the segments by the intersections. In the figure, we have consecutive sub-segments \(s_\textrm{start}s_1\), \(s_1s_2\), \(s_2s_3\), \(s_3s_\textrm{end}\). The red triangles illustrate their midpoints. The cells are the ones we are looking for, if the midpoints fall into them. See the red cells in the figure.

Fig. 6
figure 6

Calculate cells crossed by a segment

Algorithm 2
figure b

Calculate cells crossed by a segment

Algorithm 2 summarizes the procedure of calculating the identifiers UV of the cells crossed by a segment S. Line 1 uses GetBounds() to calculate the four bounds of the cells. Line 2 uses GetIntersections() to calculate the intersections of S with the vertical lines and horizontal lines. Lines 3–5 calculate the midpoint \(p_\textrm{mid}\) of the first sub-segment \(s_\textrm{start}s_1\) by using GetMidPoint(), get the identifier (uv) of the cell where \(p_\textrm{mid}\) falls into, and then add (uv) to the result set UV. The loop (lines 6–9) calculates the midpoints of the sub-segments from \(s_1s_2\) to \(s_{k-1}s_k\), and gets the identifiers of the cells where the midpoints fall into. Lines 10–12 calculate the midpoint of \(s_ks_{end}\) and add the corresponding identifier to the result set. As Algorithm 2 is called by Algorithm 1 during its execution, the time Algorithm 2 consumes depends on the number of intersections between the line segments and the horizontal and vertical lines of the grid. It is independent of the total number of objects, denoted as n. Therefore, the time complexity of Algorithm 2 can be considered as O(1).

Discussion on the order of object processing In Algorithm 1, we process each object in the order of “point\(\rightarrow\)polyline\(\rightarrow\)polygon”. Let’s assume the point object collection is \(O_\textrm{point}=\{o_1,o_2\}\), the polyline object collection is \(O_\textrm{polyline}=\{o_3,o_4\}\), and the polygon object collection is \(O_\textrm{polygon}=\{o_5,o_6\}\). The order in which the algorithm processes the objects is \(o_1\), \(o_2\), \(o_3\), \(o_4\), \(o_5\), \(o_6\). This order is chosen to ensure the efficiency and correctness of the algorithm.

To explain the rationale behind this order, let’s review the processing steps for each object. When processing each object \(o_i\), (1) we first identify the candidate object set that may have SGR with it. This involves determining the search range based on the object’s position and selecting all the cells covered by this search range. In the mapping table tbl, the objects in the lists associated with cell identifiers serve as the candidate objects. (2) Next, we determine the objects that truly have SGR with \(o_i\) according to Definition 1. We connect them to \(o_i\) in the FineGeoKG and add properties to the edges following the methods listed in Table 1. (3) Finally, we add the object \(o_i\) itself to tbl. At this point, we need to identify all the cells covered by \(o_i\) and add \(o_i\) to the lists associated with these cell identifiers as keys. We refer to this collection of cells as the representation of \(o_i\).

Considering the performance of the algorithm, we aim to minimize the number of candidate objects obtained in step (1). In other words, we want the intersection between the cells used to represent the object and the cells used to represent the search range to be as small as possible. To achieve this goal, a heuristic approach is to process point objects before polyline objects. For point objects, we use a single cell to represent them and may need to use four cells to represent their search range.Footnote 4 For polyline objects, the cells used to represent the search range are slightly larger than those used to represent the objects themselves but not significantly larger. Therefore, when determining the existence of SGR between point objects and polyline objects, it is better to consider the point objects as the objects being searched.

Figure 7 illustrates two scenarios with different processing orders: Fig. 7a shows the order of processing point objects before polyline objects, i.e., processing \(o_1\) first and then \(o_3\), while Fig. 7b shows the order of processing polyline objects before point objects, i.e., processing \(o_3\) first and then \(o_1\). In Fig. 7a, when processing \(o_3\), the algorithm gets the cells covered by its search range (gray area). Since the cell containing \(o_1\) is not covered, \(o_1\) does not become a candidate object. In Fig. 7b, when processing \(o_1\), the algorithm gets the cells covered by its search range (gray square area). One of the cells representing \(o_3\) (red cells) is covered by the gray area, causing \(o_3\) to be mistakenly included as a candidate object. This example illustrates that processing polyline objects first leads to some incorrect point objects being included in the candidate set.

Fig. 7
figure 7

Processing point and polyline objects in different orders

Considering the correctness of the algorithm and the characteristics of tbl, we must process point objects before polygon objects, because processing polygon objects first would prevent certain correct point objects from being selected as candidates. Figure 8 illustrates two scenarios with different processing orders: Fig. 8a shows processing point objects before polygon objects, i.e., processing \(o_2\) first and then \(o_5\), while Fig. 8b shows processing polygon objects before point objects, i.e., processing \(o_5\) first and then \(o_2\). In Fig. 8a, when processing \(o_5\), the algorithm gets the cells covered by the search range (gray area). Since the cell containing \(o_2\) is covered, \(o_2\) becomes a candidate object. However, in Fig. 8b, when processing \(o_2\), the algorithm needs to determine in which polygons \(o_2\) might be enclosed, but this cannot be directly determined using tbl. This is because the list associated with the identifier of the red cell will contain \(o_5\), however, we cannot determine whether \(o_2\) is inside \(o_5\) based on the information available in tbl. This example illustrates that processing polygon objects first makes it impossible to determine whether a point object has SGR with a polygon object.

Fig. 8
figure 8

Processing point and polygon objects in different orders

In conclusion, to ensure the performance and correctness of the algorithm, we have determined the processing order as “point\(\rightarrow\)polyline\(\rightarrow\)polygon”.

5 Answering SGR Queries

Using the FineGeoKG, we can perform various queries, including graph pattern matchings and graph navigations [8]. For ecological environment analyses, it would be useful to offer SGR queries, which aim at finding out object groups from the FineGeoKG according to the given geospatial patterns. In this section, we define the SGR query problem (Sect. 5.1) and design algorithms to answer SGR queries (Sect. 5.2).

5.1 Problem Definitions

In ecological environment analyses, a geospatial pattern means a specific layout of neighboring geospatial objects. Ecologists would like to find out groups of objects that can satisfy a certain pattern. This task can be converted to a query problem on the FindGeoKG. We model the patterns as query graphs. The formal definition of SGR query graph is as follows.

Definition 4

(SGR Query Graph) An SGR query graph is a small heterogeneous graph consisting of nodes and edges. There are three types of nodes which indicate polygon objects, polyline objects and point objects. The edges indicate the SGR relations between nodes. The conditions on the edges indicate the query ranges of property values.

Figure 9 shows an SGR query graph. It specifies the desired result should consist of two polygon objects, one polyline object and one point object. The detailed requirements are: (i) Two polygon objects \(q_1\) and \(q_2\) should have SGR relation, the length l of their intersection line should between 5 km to 10 km, and relative direction \(\alpha\) of the intersection line should be within \((90^\circ ,180^\circ )\); (ii) one polygon object \(q_2\) should have SGR relation with a polyline object \(q_3\), the length l of their intersection line could be any value, and the relative direction \(\alpha\) of \(q_3\) w.r.t. \(q_2\) could be any value too; (iii) the polyline object \(q_3\) should have an SGR relation with a point object \(q_4\), the distance d between \(q_3\) and \(q_4\) should between 2 km and 3 km, and the relative direction \(\alpha\) of \(q_4\) w.r.t. \(q_3\) should be within \((290^\circ ,350^\circ )\).

Fig. 9
figure 9

An SGR query graph

According to the SGR query graph, we should find out the object groups that can meet the requirements. The formal definition of the SGR query problem is as follows.

Definition 5

(SGR Query) Given a FineGeoKG KG and a query graph P, an SGR query finds out all the object groups \(\{O_1,O_2,\ldots ,O_k\}\) from KG, which can satisfy P.

Figure 10 shows an object group \(\{o_1,o_2,o_3,o_4\}\) satisfying the query graph in Fig. 9. As Fig. 10a shows, \(o_1\) and \(o_2\) have SGR relation, the length of their intersection line is 6 km that is within the query range (5 km, 10 km), the relative direction of the line is \((100^\circ ,160^\circ )\) that is also within the query range \((90^\circ ,180^\circ )\). The objects \(o_3\) and \(o_4\) have SGR relation, the distance from \(o_4\) to \(o_3\) is 2.5 km that is within the query range (2 km, 3 km), the relative direction of \(o_4\) w.r.t. \(o_3\) is \(330^\circ\) that is within the query range \((290^\circ ,350^\circ )\). The objects \(o_1\) and \(o_3\) have SGR relation. Figure 10b also show the corresponding subgraph in the KG.

Fig. 10
figure 10

An object group satisfying the query graph in Fig. 9

An SGR query should find out all the object groups satisfying the query graph, rather than just one group shown in Fig. 10. Since the FineGeoKG often contains a large number of nodes and edges, it is time consuming to enumerate candidate subgraphs and verify whether they can match the query graph. Thus, in the next section, we propose an index to improve the SGR query performances.

5.2 Query Algorithm

The SGR query problem is in fact a special subgraph matching problem. Considering the characteristics of the query, we propose an algorithm based on the binary join method [33, 34]. Firstly, we introduce the straightforward binary join method briefly. Secondly, we design an index that can support the binary join method for SGR queries. Thirdly, we introduce the query algorithm proposed.

Fig. 11
figure 11

Naïve binary join

Naïve binary join method The naïve binary join method is to select the edge sets according to the query requirements on edges. Each edge set is stored as a table. The final results can be obtained by using natural joins between the tables. As Fig. 11 shows, according to the edge requirement (i) in the query graph, we find out the edges satisfying (i) and insert them into table \(T_{12}\) which takes vertices \(q_1\) and \(q_2\) as attributes. In the same way, according to (ii) and (iii), we create tables \(T_{23}\) and \(T_{24}\), respectively. Next, we do natural joins between tables. If two tables \(T_i\) and \(T_j\) have the same attribute \(q_x\), i.e., two edges share the same vertex, we join each tuple from \(T_i\) with each tuple from \(T_j\) when they have the same value on attribute \(q_x\). For example, we perform a natural join between \(T_{12}\) and \(T_{23}\) and obtain a result table. Then we perform a natural join between the result table and \(T_{24}\). The tuples in the final result table are the object groups satisfying the query graph.

Worst case optimal join method The Worst case optimal join (WC) method is a subgraph matching algorithm that is designed to handle worst-case scenarios, i.e., it guarantees optimal performance for all possible inputs. The algorithm is based on the notion of spectral sequences, which are a mathematical concept used in algebraic topology. The spectral sequence of a graph is a sequence of graphs that capture its topological structure.

Index of SGR edges In the binary join method, we must find out the edges (for example, the edges in \(T_{12}\)) that can meet an edge requirementFootnote 5 (for example, \({\text {edge}}_i\) in the query graph Q). A straightforward way is to check every edge in the knowledge graph KG and select satisfactory edges, however, this procedure is time consuming when KG has a large amount of edges. To make this procedure faster, we propose an index for the edges which has “sgr” labels.

Fig. 12
figure 12

Index of SGR queries

The left part of Fig. 12 illustrates the index. Since there are six types of SGR relations, the index consists of six sub-indexes, i.e., the polygon–polygon sub-index, the polygon–polyline sub-index, the polygon-point sub-index, the polyline–polyline sub-index, the polyline-point sub-index, the point–point sub-index. Each sub-index organizes the SGR edges with corresponding vertex types. For example, the polygon–polygon sub-index organizes the SGR edges that have two polygon objects as their vertices. In each sub-index, there are two data structures, i.e., a tree and a bloom filter.

The tree is used to index the edges according to the values of properties. For example, the polygon–polygon edges have two properties l and \(\alpha\), we use an R\(^*\)-tree to index the edges considering the two properties. As another example, the polyline–polyline edges only have one property \(\alpha\), we use a B\(^+\)-tree to index these edges considering the \(\alpha\) property. The bloom filter is used to determine whether an edge exists. The key of an edge is made by concatenating the identifiers of its two vertices. We build the bloom filter by using the keys of all the edges. In the query algorithm proposed later, we will use the bloom filter to verify whether there is an edge connecting vertices \(o_i\) and \(o_j\) in the knowledge graph KG.

Algorithm 3
figure c

Building the index of SGR edges

Algorithm 3 summarizes the procedure of building the index. Firstly, we initialize the bloom filter and the tree of each sub-index, and we give numbers \(\{0,1,\ldots , 6\}\) to the sub-indexes (line 1). Next, we check each edge \(r_i \in KG.R\) (i.e., the for loop), where KG.R denotes the edge set of the knowledge graph KG. We insert \(r_i\) into the corresponding sub-index if it has an “sgr” label (line 3). Line 4 gets the type of \(r_i\), which is determined by the geometry types of its vertices. Line 5 inserts \(r_i\) into the tree (i.e., \(idx_\textrm{type}.tree\)) of \(idx_\textrm{type}\). Line 6 gets the vertices \(o_1\) and \(o_2\) of \(r_i\). Line 7 generates the key by concatenating \(o_1.id\) and \(o_2.id\), which are the identifiers of \(o_1\) and \(o_2\). Line 8 inserts the key into the bloom filter (i.e., \(idx_\textrm{type}.bf\)) of \(idx_\textrm{type}\). In Algorithm 3, the for loop inserts one edge into the index in each iteration, repeating a total of m times, where m is the total number of edges. In each iteration, the time complexity of inserting each edge into the tree is \(O(\log m)\) since a tree manages m/6 edges on average. The time complexity of inserting the edge’s key into the bloom filter is \(O(\log d)\), where d is the number of hash functions. Typically, d is small. Therefore, the time complexity of Algorithm 3 is \(O(m\log m)\).

Improved binary join method To make the binary join faster, we propose an improved algorithm based on the above index. With the support of the index, the algorithm can select edges faster and reduce the number of natural joins. The algorithm does edge matching operations in the order of edge priorities. When processing an SGR query, an edge in the query graph has a higher priority if its vertices have not been hit yet. A hit vertex set VS records the vertices of the edges that have been processed already. For an edge \({\text {edge}}_x\), if both vertices of \({\text {edge}}_x\) are not in VS, the priority of \({\text {edge}}_x\) is 2. If only one vertex of \({\text {edge}}_x\) is not in VS, the priority of \({\text {edge}}_x\) is 1. If both vertices of \({\text {edge}}_x\) are in VS, the priority of \({\text {edge}}_x\) is 0. At the beginning, VS is empty and every edge in the query graph has the highest priority 2. We can choose any edge as the first edge for matching. After each edge matching operation, we update VS and choose the next edge with the highest priority for matching.

As an example, we introduce the procedure of determining the edge matching order of the query graph \(Q'\) in Fig. 12. To illustrate all types of the edge matching operations in the improved algorithm, we use a new query graph \(Q'\) by adding two edges (i.e., \({\text {edge}}_{iv}\) and \({\text {edge}}_{v}\)) to the query graph Q in Fig. 11. The right bottom part of Fig. 12 shows the changing process of the priorities. In each iteration, i.e., the column in the figure, the highest priority is marked in red. At the beginning, all of the edges have the same priorities 2, and we choose \({\text {edge}}_{i}\) for matching. After selecting the matching edges from KG, VS is updated to \(\{q_1,q_2\}\) and the edges’ priorities are updated too. The \({\text {edge}}_{iii}\) has the highest priority and we choose it for matching. After matching \({\text {edge}}_iii\), VS is updated to \(\{q_1,q_2,q_3,q_4\}\) and the priorities are updated too. The \({\text {edge}}_{v}\) becomes the edge with the highest priority. After matching \({\text {edge}}_{v}\), VS is updated to \(\{q_1,q_2,q_3,q_4,q_5\}\) and the priorities are updated. The \({\text {edge}}_{ii}\) and \({\text {edge}}_{iv}\) have the same priority. We choose \({\text {edge}}_{ii}\) as the next edge for matching. Therefore, the matching order of \(Q'\) is as follows.

$$\begin{aligned} {\text {edge}}_{i} \rightarrow {\text {edge}}_{iii} \rightarrow {\text {edge}}_{v} \rightarrow {\text {edge}}_{ii} \rightarrow edge_{iv}. \end{aligned}$$
(1)

Next, we introduce the matching operations for details. According to the edge matching order, there are different matching operations. Assume that currently the edge for matching is \({\text {edge}}_{x}\). If its priority is 1 or 2, the procedure of the matching operation is as follows. To find out the satisfactory edges in KG, we use the corresponding sub-index. With the support of the tree in the sub-index, we can quickly retrieve the edges that can meet the property requirements. To store the matching edges, we create a table with attributes \((q_x^1,q_x^2)\), where \(q_x^1\) and \(q_x^2\) are the vertices of \({\text {edge}}_x\).

If the priority of \({\text {edge}}_x\) is 0, there are two cases of matching operations. (1) The \(q_x^1\) and \(q_x^2\) act as attributes in two tables T and \(T'\), respectively. In this case, we join each tuple \(t_i \in T\) and each tuple \(t_j \in T'\) if the edge \(o_io_j\) exists in KG. (2) The \(q_x^1\) and \(q_x^2\) are in the same table T. In this case, we simply remove the tuples with non-existent edges. In the two cases, we need to verify whether an edge exists in KG. To verify the existence of an edge \(o_io_j\), we use the bloom filter of the corresponding sub-index. If the key of \(o_io_j\) cannot pass the bloom filter, \(o_io_j\) is definitely not an edge in KG. If the key can pass the bloom filter, \(o_io_j\) might be an edge in KG and we further determine whether it exists.

To illustrate the whole procedure of answering an SGR query, we use the example in Fig. 12. According to the matching order in Eq. 1, the matching operations are as follows.

  1. (1)

    We select the edges that can match \({\text {edge}}_i\) by using the tree of the polygon–polygon sub-index, and then we create a table \(T(q_1,q_2)\) to store the edges.

  2. (2)

    We select the edges that can match \({\text {edge}}_{iii}\) by using the tree of the polyline-point sub-index, and then create a table \(T(q_3,q_4)\) to store the edges.

  3. (3)

    We select the edges that can match \({\text {edge}}_{v}\) by using the tree of the polygon-point sub-index, and then create a table \(T(q_2,q_5)\) to store the edges. Since \(T(q_1,q_2)\) and \(T(q_2,q_5)\) have the same attribute \(q_2\), we do a nature join between the two tables and obtain a new table \(T(q_1,q_2,q_5)\).

  4. (4)

    We do the matching operation for \({\text {edge}}_{ii}\). Since its vertex \(q_2\) acts as an attribute in \(T(q_1,q_2,q_5)\) and its vertex \(q_3\) acts as an attribute in \(T(q_3,q_4)\), we join each tuple \(t_i\) in the first table with each tuple \(t_j\) in the second table if the edge \(t_i[q_2]t_j[q_3]\) exist. To verify the existence of the edge, we check whether the key of \(t_i[q_2]t_j[q_3]\) can pass the bloom filter of the polygon–polyline sub-index. After the join operation, we obtain a new table \(T(q_1,q_2,q_5,q_3,q_4)\).

  5. (5)

    We do the matching operation for \({\text {edge}}_{iv}\). Since its vertices \(q_1\) and \(q_4\) are in \(T(q_1,q_2,q_5,q_3,q_4)\), we remove every tuple \(t_i\) if the edge \(t_i[q_1]t_i[q_4]\) does not exist in KG or the edge \(t_i[q_1]t_i[q_4]\) does not meet the requirements on the properties.

The tuples remained in \(T(q_1,q_2,q_5,q_3,q_4)\) are the query results. Each tuple consists of a group of objects that can match the query graph \(Q'\).

Algorithm 4
figure d

Improved binary join algorithm

Algorithm 4 summarizes the procedure of answering an SGR query. Line 1 computes the edge matching order QE of the query graph \(Q'\). In the for loop, the edges in QE are processed one by one. Line 3 gets the type of \({\text {edge}}_x\). Line 4 gets the vertices (i.e., \(q_x^1\) and \(q_x^2\)) of \({\text {edge}}_x\). Line 5 gets the table \(tbl_1\) where \(q_x^1\) acts as an attribute and line 6 gets the table \(tbl_2\) where \(q_x^2\) acts as an attribute. If there is no table having attributes \(q_x^1\) or \(q_x^2\) (line 7), we find out the edges that can meet the property requirements with the support of the tree in the corresponding sub-index, and create a new table \(tbl(q_x^1,q_x^2)\) to store the edges found (line 8). The tbl is added into the table set T (line 9). If there is a table having attribute \(q_x^1\) while there is no table having \(q_x^2\) (line 10), we find out the satisfactory edges with the support of the tree, and create a new table \(tbl(q_x^1,q_x^2)\) to store the edges found (line 11). Line 12 does a natural join between \(tbl_1\) and tbl and adds the result table into T. In the same way, we deal with the case that there is a table having attribute \(q_x^2\) while there is no table having \(q_x^1\) (lines 13–15). If the tables containing \(q_x^1\) and \(q_x^2\) and they are different (line 17), we join each tuple in \(tbl_1\) with each tuple in \(tbl_2\) when the edge \(tbl_1[q_x^1]tbl_2[q_x^2]\) exists. To verify the existence of an edge, we use the bloom filter in the corresponding index (line 18). Line 19 removes \(tbl_1\) and \(tbl_2\) from T and adds the result table tbl into T. If \(q_x^1\) and \(q_x^2\) are in the same table (line 20), we remove the tuples which have non-existent edges (line 21).

Time complexity analysis In Algorithm 4, each iteration processes one edge from the query graph Q. The loop will iterate h times where h denotes the number of edges. When processing the edge \({\text {edge}}_x\), the algorithm first uses the tree-based index to find all edges that satisfy the query conditions, with a time complexity of \(O(\log m)\), where m is the total number of edges. This step creates a table tbl with attributes \(q_x^1\) or \(q_x^2\). Let t represent the average number of tuples in tbl. Next, the algorithm identifies tables with attributes containing \(q_x^1\) or \(q_x^2\) and performs table joins. The table joins in Line 12 and Line 15 occur between the newly created table tbl and an existing table, with a time complexity of O(ts), where s is the average number of tuples in the existing tables. The table join in Line 18 happens between two existing tables and has a time complexity of \(O(s^2)\). In fact, the maximum value s is the number of subgraphs in KG that can match \(Q-\{{\text {edge}}_x\}\) (i.e., the query graph with \({\text {edge}}_x\) removed). Estimating this value is not straightforward as it depends on the characteristics of KG and the query graph. Therefore, the time complexity of Algorithm 4 is \(O(h(\log m + ts))\) or \(O(hs^2)\), as \(O(\log m + ts)\) and \(O(s^2)\) cannot be simply combined or replaced by each other.

6 Experimental Analysis

6.1 Experimental Setup

Experimental environments We implemented the proposed algorithm and the comparative algorithms using Java. All the algorithm experiments were conducted on a machine with an Intel(R) Xeon(R) W-2123 CPU (3.60 GHz), 32 GB of memory, and a Windows 10 x64 operating system. All the code and datasets have been uploaded to GitHub.Footnote 6

Datasets.We evaluate the algorithms by utilizing geospatial objects collected from OpenStreetMap (OSM) in Shanghai. Table 2 summarizes the statistics of the four datasets (\(D_{1k}\), \(D_{10k}\), \(D_{100k}\), \(D_{1m}\)). Each dataset consists of various geospatial objects, such as polygons (e.g., buildings), polylines (e.g., highways), and points (e.g., restaurants). The subscript in the dataset name indicates its approximate size. For instance, the later dataset (e.g., \(D_{10k}\)) contains approximately ten times the number of objects found in the previous dataset (e.g., \(D_{1k}\)).

Table 2 Statistics of the datasets

6.2 Experiments of Geospatial Interlinking Algorithms

To construct a FineGeoKG, the key problem is to find out the SGRs between geospatial objects and interlink them. Considering the specific application of ecological spatial pattern analysis, we set the parameter \(\epsilon\) to 1 km. The value of \(\epsilon\) determines the number of point–point SGRs and point-polyline SGRs. Since \(\epsilon\) is set to a small value, the count of these two types of SGRs is relatively low, as shown in Fig. 14b.

In this section, we compare the performances of the three algorithms for geospatial interlinking, namely, the R-tree-based method (RBA), the grid-based method (GBA), and the method we proposed in Algorithm 1 (FSGR). When implementing RBA, we set the node capacity of the R-tree to 10.

Fig. 13
figure 13

Comparison of geospatial interlinking algorithms in terms of time and space consumption

Figure 13a compares the time consumption of GBA, RBA, and FSGR algorithms. As the dataset size increases, the time consumption of the algorithms also increase. The proposed FSGR algorithm is faster than RBA and GBA. The reason is that while RBA and GBA set the search range for finding candidate objects to the Minimum Bounding Rectangle (MBR) of the polygon, or to the collection of MBRs of all segments in the polyline. On the other hand, FSGR sets the search range for a polygon to all the cells it covers and sets the search range for a polyline by expanding each segment using \(\epsilon\) and considering all the cells it covers. Therefore, FSGR has a more accurate search range compared to RBA and GBA, which greatly reduces the chances of incorrect objects being selected into the candidate set. This leads to faster performance of FSGR.

Figure 13b compares the space consumption of GBA, RBA, and FSGR algorithms. As the dataset size increases, the memory consumption of the algorithms also increase. FSGR consumes less memory space compared to GBA and RBA. The memory consumption of these three algorithms mainly comes from the indexes used in the algorithms. In the index of GBA, it requires using more grid cells to represent point objects, polyline objects, and polygon objects. However, FSGR improves upon GBA and is able to use as few cells as possible to represent these three types of objects. In the index of RBA, it needs to store a tree structure that can index all objects, which occupies more space compared to the mapping table used in FSGR and GBA.

Figure 14a illustrates the number of SGRs identified across various datasets. On average, in \(D_{1k}\) and \(D_{10k}\), each object exhibits relations with one or two other objects, whereas in \(D_{100k}\) and \(D_{1m}\), each object displays relations with three or four objects. Furthermore, Fig. 14b provides insights into the distribution of SGRs of different types within \(D_{1m}\). Notably, the largest number of SGRs corresponds to polyline–polyline relations, followed by polyline–polygon relations. Point-polyline SGRs come in third place due to the composition of objects in \(D_{1m}\): polyline objects account for 64%, polygon objects account for 24%, and point objects account for 12% (as shown in Table 2).

Fig. 14
figure 14

Comparison of the number and types of SGRs

In order to assess the performance of the three algorithms in detecting SGRs of different types, we partition each dataset into three subsets, each containing a specific object type. For instance, \(D_{1k}\) is divided into \(D_{1k}^1\), \(D_{1k}^2\), and \(D_{1k}^3\), where the superscript denotes the object type. Superscript 1 represents “point”, superscript 2 represents “polyline”, and superscript 3 represents “polygon”. Thus, \(D_{1k}^1\), \(D_{1k}^2\), and \(D_{1k}^3\) respectively correspond to the point objects, polyline objects, and polygon objects from \(D_{1k}\). These segmented datasets allow us to evaluate the algorithms’ performance in identifying point–point SGRs, polyline–polyline SGRs, and polygon–polygon SGRs. To evaluate other types of SGRs, such as point-polyline SGRs, we merge \(D_{1k}^1\) and \(D_{1k}^2\) into a single dataset denoted as \(D_{1k}^{12}\). Similarly, we create \(D_{1k}^{13}\) and \(D_{1k}^{23}\).

Fig. 15
figure 15

Performance comparison of algorithms in searching different types of SGRs (ms)

Figure 15 presents the time consumption and the number of SGRs discovered by the three algorithms across different datasets. It is important to note that when computing SGRs involving different types of objects (e.g., point-polygon SGRs), we deliberately avoid identifying SGRs of the same object type (e.g., point–point SGRs). In the figures, the time consumption of FSGR is lower compared to RBA and GBA. When comparing the six types of SGRs, the time consumption for point–point SGRs is the lowest. This can be attributed to the relatively fewer point objects and the lower memory requirements of the index structure. On the other hand, the time consumption for polyline–polyline SGRs is the highest, as it involves a larger number of polyline objects and requires more memory for indexing. Additionally, the computation of spatial relations and SGR properties for polygons generally takes more time compared to points and polylines. However, since the datasets contain fewer polygons than polylines, the time consumption for polygon–polygon SGRs is not significantly high.

Fig. 16
figure 16

Memory consumption of algorithms in searching different types of SGRs (MB)

To further evaluate the memory space consumption of these algorithms when discovering different types of SGRs, we utilized the same dataset partitions as in the time-based tests mentioned earlier. This allowed us to generate Fig. 16. From the figure, it is evident that the search for different types of SGRs has minimal impact on the memory space occupied by the three indexes, with RBA occupying the highest and FSGR occupying the lowest space. However, due to variations in the number of different object types, it is evident that the point–point SGRs consume the least space; while, the polyline–polyline SGRs consume the most space.

6.3 Experiments of Spatial Queries on FineGeoKG

To observe the impact of query graph complexity on algorithm performance, we introduced the parameter r to represent the number of edges in the query graph (r = 1,2,4,8)). We have set various typical structures for query graphs with different numbers of edges. Based on these structures, we randomly assign one type (point, polygon, polyline) to each node. Figure 17 shows some query graphs used in the experiment. Each query graph is queried 10 times and the average execution time is taken as the time consumption.

Fig. 17
figure 17

Some query graphs used in the experiment

We use the Binary join (BJ), Worst case (WC) and SUFF [32] as the baseline algorithms to compare with our proposed algorithm (IB). BJ and WC are traditional algorithms for solving subgraph matching problems. SUFF is chosen because it is currently the state-of-the-art (SOTA) method for solving subgraph matching problems.

Table 3 Comparison of time consumed for index building (seconds)

Since both SUFF and IB algorithms involve the creation of indexes, Table 3 compares the time consumed for building the indexes. As the dataset size increases, the time required for index creation also increases. For the same dataset size, building the IB index takes significantly less time compared to building the SUFF index. Both IB and SUFF utilize bloom filters in their indexes, but IB only requires six bloom filters to be created (as shown in Fig. 12); while, SUFF needs to create a large number of bloom filters to record all common subgraph patterns.

Table 4 illustrates the memory consumption of BJ, WC, SUFF, and IB during runtime. As the dataset size increases, the memory consumption also increases. BJ, WC, and SUFF are sensitive to changes in r because as r increases, BJ and WC need to perform join operations on more and larger tables in-memory, and SUFF needs to load more bloom filters into memory for filtering operations. On the other hand, IB is insensitive to changes in r because it uses the same index regardless of the size of r. Its index is relatively small. It contains six bloom filters and six trees. These trees are used to index all SGR edges, resulting in a space complexity of \(O(\log m)\), where m is the total number of edges. During program execution, the entire index can be loaded into memory.

Table 4 Comparison of memory consumption by structural queries (MB)

Figure 18 compares the time consumed by BJ, WC, SUFF, and IB when solving structural queries, with the number of edges in the query graph (r) being 1, 2, 4, and 8. When the query graph is relatively simple, such as having only one edge (\(r=1\)) or two edges (\(r=2\)), IB is faster than BJ and WC. To more clearly show how much faster IB is than the baseline algorithms in different situations, we use the following formula to compare the running time of different algorithms.

$$\begin{aligned} {{\text {ratio}}_{\textrm{algo}_1,\textrm{algo}_2} = \frac{T_{\textrm{algo}_2} - T_{\textrm{algo}_1}}{T_{\textrm{algo}_2}}.} \end{aligned}$$
(2)

In this formula, \({\text {ratio}}_{\textrm{algo}_1,\textrm{algo}_2}\) is used to compare how much faster \({\text {algo}}_1\) is than \({\text {algo}}_2\), where \(T_{\textrm{algo}_1}\) and \(T_{\textrm{algo}_2}\) represent the time taken by \({\text {algo}}_1\) and \({\text {algo}}_2\), respectively. As Fig. 18a shows, when \(r=1\), for \(D_{1k}\) dataset, \({\text {ratio}}_\textrm{IB,BJ}\) and \({\text {ratio}}_\textrm{IB,WC}\) are the largest, with 72.53% and 94.16%, respectively. As Fig. 18b shows, when \(r=2\), for \(D_{10k}\) dataset, \({\text {ratio}}_\textrm{IB,BJ}\) and \({\text {ratio}}_\textrm{IB,WC}\) are the largest, with 64.67% and 87.02%, respectively. As Fig. 18c shows, when \(r=4\), for \(D_{10k}\) dataset, \({\text {ratio}}_\textrm{IB,BJ}\) and \({\text {ratio}}_\textrm{IB,WC}\) are the largest, with 80.11% and 84.83%, respectively. As Fig. 18d shows, when \(r=8\), for \(D_{10k}\) dataset, \({\text {ratio}}_\textrm{IB,BJ}\) and \({\text {ratio}}_\textrm{IB,WC}\) are the largest, with 50.89% and 44.77%, respectively. In summary, compared to the baseline algorithms (BJ and WC), IB has a significant speed advantage when processing small data sets (\(D_{1k}\), \(D_{10k}\)), but the speed advantage is not as significant when processing large data sets. Especially when the query graph is relatively complex, the time consumed by IB even slightly exceeds the baseline algorithm (\(r=8\), \(D_{100k}\), Fig. 18d). Additionally, the time consumption trends of these three algorithms roughly follow the same pattern as the dataset size increases because they all involve join operations between tables.

Fig. 18
figure 18

Comparison of time consumed by structural queries (ms)

The basic idea of SUFF is different from IB, BJ and WC. It builds and utilizes common subgraph filters to reduce the search space. SUFF is faster than the other three algorithms when processing small datasets (\(D_{1k}\)), but its speed significantly decreases as the dataset size increases. IB outperforms SUFF in speed when handling complex query graphs (\(r=4, r=8\)) with large dataset sizes (\(D_{1m}\)). The speed improvement of IB compared to the baseline methods is attributed to the construction of an index, which includes bloom filters and trees, for SGR edges. The utilization of this index contributes to the speed enhancement of IB.

When the query graph becomes more complex, with four edges (\(r=4\)) or eight edges (\(r=8\)), IB performs at a similar speed as BJ and WC, and their time consumption curves exhibit similar trends. As the number of join operations between tables increases, the performance advantage of IB obtained through indexing is gradually overshadowed by the significant time consumed by join operations. SUFF is faster than the other three algorithms when processing small datasets (\(D_{1k}\), \(D_{10k}\)), but its performance deteriorates as the dataset size increases. This is primarily due to SUFF storing bloom filters in files, and the increase in dataset size leads to a large number of file read operations.

Figure 19 compares the time consumed by IB, BJ, and WC in handling semantic queries. Since SUFF cannot directly handle semantic queries, it was not included in this experiment. The number of edges in the query graph (r) is still set to 1, 2, 4, and 8. When the query graph is not complex (i.e., \(r = 1, 2, 4\)), IB takes less time than BJ and WC across different dataset sizes, and the time trends of IB are similar to BJ and WC. However, when the query graph becomes complex (i.e., \(r = 8\)), IB performs better than BJ and WC only on small datasets (\(D_{1k}\), \(D_{10k}\)), but its performance declines as the dataset size increases. The reason for this phenomenon is that on small datasets, IB can leverage its indexes to gain performance advantages. However, as the dataset size increases, the size of the tables also increases, and the time consumed by join operations between tables outweighs the time saved by utilizing the indexes.

Fig. 19
figure 19

Comparison of time consumed by semantic queries (ms)

Figure 20 presents the structural query graphs and the corresponding results obtained from the FineGeoKG of Shanghai. The figures include four structural queries, namely \(Q_1\), \(Q_2\), \(Q_3\), and \(Q_4\). Each query graph is accompanied by a corresponding example result (\(R_1\)) depicted in the right figure of each query graph. The objects within the result are highlighted using different colors. These structural queries serve as a valuable tool for ecologists to identify geospatial object groups that adhere to specific spatial patterns.

Fig. 20
figure 20

Structural queries on FineGeoKG

Figure 21 displays the semantic query graphs and their corresponding results. The figures showcase four queries, namely \(Q_5\), \(Q_6\), \(Q_7\), and \(Q_8\), with their respective query graphs depicted in Fig. 21a, c, e, g. Each query graph consists of vertices labeled with specific semantic categories, such as waterway, park, supermarket, and more. The edges represent ranges of property values. The complexity of these query graphs surpasses that of the query graphs shown in Fig. 20. The right figures of the query graphs present examples of the query results. These semantic queries serve as valuable tools for researchers in the field of urban planning, facilitating the analysis of facility layouts and distributions.

Fig. 21
figure 21

Semantic queries on FineGeoKG

7 Conclusions

In this paper, we propose FineGeoKG which is a geospatial knowledge graph characterized by SGRs. SGRs are used to capture the neighboring relations between ground objects, which are underemphasized in the existing knowledge graphs. Using FineGeoKG, ecologists can find out groups of ground objects matching the given spatial patterns. To find out SGRs fast, we improve the geospatial interlinking algorithm and evaluate the algorithm using real datasets. To answer SGR queries fast, we improve the binary join method based on the proposed SGR edge index. To illustrate the usefulness of FineGeoKG, we demonstrate the results of structural queries and semantic queries. It is an interesting future direction to design efficient algorithms to answer more complicated spatial pattern queries considering the characteristics of FineGeoKG.