Our approach to aligning the boundaries of polygons builds upon our previous work to automatically repair the small gaps and overlaps in planar partitions where all polygons are stored independently (with the Simple Features paradigm (OGC 2006), a shapefile for instance). In Arroyo Ohori et al. (2012), we used a constrained triangulation (CT) as a supporting structure because, as explained below, it permits us to fill the whole spatial extent of the datasets with triangles, and then the triangles allow us to identify easily the gaps and overlaps between different polygonal datasets. We use the idea of labelling each triangle with the label of the polygon it decomposes: gaps will have no labels and regions where polygons overlap will have more than one label. Repairing implies relabelling triangles so that each triangle has one and only one triangle.
We first briefly describe in Sect. 3.1 the original algorithm for repairing planar partitions. Then, in Sect. 3.2, we describe our extensions to this algorithm so that the boundaries of datasets can be properly aligned when the datasets are formed by several polygons and when the gaps/overlaps are large. We also propose two modifications to our approach (in Sects. 3.3, 3.4) so that two common conflation issues faced by practitioners can be solved: how to perform horizontal conflation of polygons against a boundary represented as a line, and spatial extent conflation.
Using a CT to repair a planar partition
The workflow of Arroyo Ohori et al. (2012) is illustrated in Fig. 4 and is as follows:
the CT of the input segments forming the polygons is constructed;
each triangle in the CT is labelled with the identifier of the polygon inside which it is located;
problems are detected by identifying triangles with no label or with two or more labels;
gaps/overlaps are fixed locally with the most appropriate label;
modified polygons are reconstructed and returned in a GIS format (e.g., a shapefile).
Constrained triangulations. A constrained triangulation (CT) permits us to decompose one or more polygons into non-overlapping triangles, Fig. 5 shows an example. Notice that no edges of the triangulation cross the constraints (the boundaries of the polygon). It is known that any polygon (also with holes) can be triangulated without adding extra vertices (de Berg et al. 2000; Shewchuk 1997). In our original approach, the triangulation was performed by constructing a CT of all the segments representing the boundaries (exterior and interior) of each polygon.
If, as in Fig. 5, two polygons are adjacent by one edge e, then e will be inserted twice. Doing this is usually not a problem for triangulation libraries because they ignore points and segments at the same location (as is the case with the solution we use, see Sect. 4). Likewise, when edges are found to intersect, they are split with a new vertex created at the intersection point. These are the only vertices that are added during the conflation process.
Labelling triangles. The labels are assigned to the triangles by first labelling the triangles adjacent to the edges of each polygon, and then visiting all the triangles with graph-based algorithms (i.e., depth-first search) without traversing constrained edges (the original boundaries of the input polygons). Triangles located inside the convex hull of the dataset, but not decomposing any polygons, are labelled with a special label ‘universe’. See Arroyo Ohori et al. (2012) for the details.
Identifying problems: gaps and overlaps. If the set of input polygons forms a planar partition, then every triangle will be labelled with one and only one label. Problems are easily identified: gaps are formed of triangles having no label, and overlaps of triangles having two or more labels.
Repairing problems: relabelling triangles. Repairing a gap or an overlap simply involves relabelling the triangles with an appropriate label, which means that the label assigned should be the same as one of the three neighbours, otherwise regions can become disconnected. Arroyo Ohori et al. (2012) proposes six repair operations. Four of them use triangles as a base: the label assigned is based on that of the three neighbouring triangles, for example the label present in the largest number of adjacent triangles is assigned (this method is used in Fig. 4e). Triangle-based operators are fast (purely local operations that are performed in constant time) and modify the area of each input polygon the least. However, the shape of the resulting polygons can be significantly different from the original (because of the ‘spikes’ that are created).
The other two methods use regions of adjacent triangles with equivalent sets of labels, which is slower but generally yields results in which the polygons have less spikes. Figure 6 shows one example where each of the eight problematic regions of Fig. 4a is assigned one label. The label is obtained with the longest boundary method, i.e., the boundary of the problem region is first decomposed according to the label of the triangles incident to it (but outside the problem region), and second the label is that of the longest portion of the boundary.
Notice that these repair operations can be used one after the other, for instance if first the repair according to the largest number of adjacent triangles has a tie, then this is solved by using another method (or randomly choose one)—this was used in Fig. 4e.
The extensions necessary for horizontal conflation of polygons
To align boundaries of polygons, we extend and modify the repair algorithm described above. The extended algorithm to align the boundaries of two or more datasets is described in Algorithm 1 and contains three improvements: (1) priority list of datasets; (2) gaps and overlaps treated differently; (3) subdivision of gap regions by adding extra constrained edges.
Priority list of datasets
The first extension is that we use a new operator based on a priority list of datasets. It is a generalisation of the concept of a reference dataset often used for edge-matching. We extend this concept so that several datasets can be used all at once (instead of performing edge-matching with only two datasets). The input of the algorithm has a priority list with all the datasets involved, ordered based on a given criterion. The criterion is usually the accuracy of the datasets, but others can be used. In a given problem region between two adjacent datasets, the one with the lowest accuracy (priority) should be moved/modified. The first dataset in the list is the master and others are its slaves; for instance, referring to Fig. 7, the list would be [M, T, S], since we assume that M has the higher accuracy, and S the lowest.
Gaps and overlaps handled differently
The gap and overlap regions are handled differently, and this allows us to modify the boundaries of datasets in a way that is consistent with the master-slave paradigm. A second modification is that the labels assigned to each triangle are formed by a tuple of the dataset and the unique identifier of the polygon: (dsid, pid). For an overlap region, the label used to relabel all triangles is that whose dsid is the highest in the priority list, e.g., in Fig. 7c the 3 overlapping regions (red regions) are filled with polygons from the dataset M (having labels \(m_1, m_2\) and \(m_3\)). For a gap, this label is the lowest in the priority list; the candidate labels are those adjacent to the gap region. If a gap region is adjacent to more than 1 polygon from the same dataset (see for instance Fig. 7b where both \(s_1\) and \(s_2\) could be assigned to the region), then the ID can be determined with the one of the repair methods mentioned above, such as the longest boundary.
However, aligning boundaries with this approach gives unsatisfactory results for gaps, that is we obtain results that are far from what a human being would manually do. For instance, in Fig. 7, the gap between the datasets is entirely assigned to the polygon \(s_2\), while we could argue that some parts should be assigned to \(s_1\) and \(t_1\). Observe that the overlap regions do not suffer from this problem since the constrained edges divide the region; in Fig. 7c the top overlap region between M and S is divided into two sub-regions by the constraints, and each gets the appropriate label (\(m_1\) and \(m_2\)).
Subdividing gap regions
To improve the labels assigned to gaps, we propose a heuristic to subdivide the gap regions into subregions. This is achieved by inserting extra constrained edges in the triangulation (as shown in Fig. 8). The constrained edge is not new edges, but rather existing edges that are labelled as constraints (thus no complex geometric operations are involved). Observe that while a generic constrained triangulation was sufficient to perform validation and repair in our original work (Arroyo Ohori et al. 2012), here a constrained Delaunay triangulations (CDT) is required. A CDT is a triangulation for which the triangles are as equilateral as possible (Chew 1987). We use the edges of the triangles (and their lengths), and having well-shaped triangles is an advantage.
For a given gap region, we proceed as follows (see Algorithm 2). First, we visit each vertex v on the boundary of the gap (there are 9 in Fig. 8a) and identify split vertices: vertices whose number of incident constrained edges in the CDT is more than 2. This allow us to identify where different polygons of the same dataset are adjacent (e.g., vertices a and d in Fig. 8a) and also where different datasets are adjacent (e.g., vertices b, c and e in Fig. 8a).
Second, for each split vertex, we try to insert one constrained edge inside the gap region (which means here that the edge is not on the boundary of the region). An edge whose both ends are split vertices is favoured, if there are none then the shortest edge is inserted as a constraint. Only edges that are shorter than a user-defined maximum length are inserted as constraints. It is possible that for a candidate vertex no constraint is inserted, either because there is no incident edges inside the gap region (e in Fig. 8a), or that the constraint is already present. In Fig. 8a, the extra constraints bd is added twice; the second insertion does not modify the CDT.
The algorithm for labelling gap regions is then used as described in Sect. 3.1, but now each subregion is processed separately; after the insertion of the extra constraint in Fig. 8a there are thus four gap regions. In the Algorithm 1, the relabelling of regions are not applied to the CDT directly (since the result of one could influence another, creating dependence on the order in which the triangles are visited). Instead, triangles that have been relabelled are saved in a separate list, and only after all the triangles have been visited are the new labels applied. Because the insertion of extra constraints isolates subregions, not all subregions can be directly relabelled. A subregion can be relabelled only if it is adjacent to two or more datasets. If a subregion is adjacent to only one (e.g., the lower-right orange region in Fig. 8a), then it is not repaired until one of its adjacent is relabelled (Fig. 8d). This implies that several ‘passes’ over all the triangles have to be performed, each pass tries to find a new label for triangles having 0 label and these are applied at the end of the pass.
Our heuristic has the added benefit of connecting lines that are the boundaries of polygons and/or datasets and of preserving better the area of polygons since gap regions are split and subregions are assigned to different polygons. We demonstrate in Sect. 4 the results we obtained with real-world datasets.
Edge-matching to a linear boundary
In practice, international and delimiting boundaries between countries and administrative entities are often available as lines, because these are agreed upon by the neighbouring parties. A trivial modification of our algorithm can be made for such cases: the linear boundary is converted to a polygon. As shown in Fig. 9, this can be easily done by first offsetting the line by a distance d (to either side of the line), and then linking the two lines (this can be seen as a ‘half-buffer’). We demonstrate in Sect. 4 how this performs with real-world datasets.
Spatial extent conflation
Another minor modification to Algorithm 1 allows us to perform what we refer to as spatial extent conflation. This is used to ensure that a set of polygons fits exactly in a spatial extent polygon that represents the entity higher in the hierarchy for instance. It is a common problem for practitioners who deal with administrative and territorial units, among othersFootnote 2. Each country is divided into administrative units at different administrative levels in a hierarchy. In the INSPIRE directiveFootnote 3, the hierarchy goes from the national level to the 6th level (sub-city level in most cases) (INSPIRE 2014). All the units at one level should fit exactly inside its parent unit, and there should not be any gaps.
Figure 10 shows the main modification to our algorithm: the polygon representing the spatial extent becomes a hole of a larger polygon (whose shape is irrelevant, it can be obtained by enlarging the axis-aligned bounding box of all the polygons by 10 % for instance).
The only modification necessary to Algorithm 1 is that a new label, let us call it spatialextent, is assigned to the triangles decomposing this polygon and it is assigned the highest priority. Triangles having this label are ignored during the process of reconstruction of the polygons.
In practice, the spatial extent polygon is not always a single polygon. As Fig. 11 shows, municipalities or counties are often disconnected (e.g., due to islands) and have inner rings (holes). Thus, the creation of the polygons used for the input can be obtained by first generating a large rectangle containing all the spatial extent polygons, and then doing a Boolean difference between the two.
Observe that aligning polygons to a spatial extent allows us to align the boundaries of very large hierarchical datasets since we can align them individually at each level. As shown in Fig. 12, if we begin the alignment at the highest level (the first level being the spatial extent of the polygons of the second level), and then continue iteratively to the lower administrative level using the previously generated results (i.e., in Fig. 12 the seven polygons are used as spatial extents for the thirdrd level, and so on), then the memory footprint of the alignment should remain relatively small. That is, the process is bounded by the size of the largest polygon and its decomposition one level lower. We demonstrate in the next section an example with real-world examples.