Solving the horizontal conflation problem with a constrained Delaunay triangulation
Datasets produced by different countries or organisations are seldom properly aligned and contain several discrepancies (e.g., gaps and overlaps). This problem has been so far almost exclusively tackled by snapping vertices based on a user-defined threshold. However, as we argue in this paper, this leads to invalid geometries, is error-prone, and leaves several discrepancies along the boundaries. We propose a novel algorithm to align the boundaries of adjacent datasets. It is based on a constrained Delaunay triangulation to identify and eliminate the discrepancies, and the alignment is performed without moving vertices with a snapping operator. This allows us to guarantee that the datasets have been properly conflated and that the polygons are geometrically valid. We present our algorithm, our implementation (based on the stable and fast triangulator in CGAL), and we show how it can be used it practice with different experiments with real-world datasets. Our experiments demonstrate that our approach is highly efficient and that it yields better results than snapping-based methods.
KeywordsConflation Edge-matching Triangulation Boundary alignment Data quality
We focus in this paper on one aspect of horizontal conflation (see Fig. 1): given two or more datasets composed of polygons (e.g., representing provinces, municipalities or cadastral parcels), we want to modify the boundaries of (some of) the polygons and “align” them so that gaps and overlapping areas are eliminated. As further explained in Sect. 2, this problem has been almost exclusively tackled by first isolating the boundaries of the datasets from the polygons they are incident to, and second using line-based algorithms to eliminate the discrepancies: corresponding real-world entities are matched using a threshold (a maximum searching distance), and then the alignment is performed by moving the lines to the same location (often referred to as “snapping”). We see several problems with this approach (we elaborate on these in Sect. 2). First, in our experience, finding a matching threshold that will resolve all discrepancies in a dataset is often impossible—a too small value prevents some lines to be matched, and a too large value creates invalid geometries somewhere else. Second, this problem is amplified if we consider that the edges are not in isolation but are the boundaries of polygons: the risk of collapsing parts of polygons to lines is high since it is tempting to use a large threshold to solve problems. Third, there are no reliable mechanisms to verify if the snapping was successful and what were the consequences of it on the geometries.
We present in Sect. 3 a novel algorithm to align the boundaries of adjacent datasets. Our algorithm differs significantly from other ones since it considers the polygon as its primitive (and does not isolate lines or focus solely on them), does not use a threshold to match features, and the alignment is performed without moving/snapping geometries. Instead, the discrepancies are identified and eliminated by using a constrained Delaunay triangulation (CDT) as a supporting structure. This allows us to guarantee that the datasets have been properly conflated and that the polygons are geometrically valid. We can furthermore indicate to the user what changes were made to the polygons.
The algorithm we present in this paper is an extension of our previous work for automatically repairing planar partitions: We filled small gaps and overlaps (often not visible to the user) by using a triangulation-based method (Arroyo Ohori et al. 2012). Our extensions are threefold: (1) gaps and overlaps are eliminated in a manner this is consistent with how practitioners conflate datasets, i.e., by modifying first geometries with the lowest accuracy; (2) large problem areas (such as those in Fig. 1a) are partitioned and each part is repaired independently, which ensures that the polygons involved to fix the area are not modify too much (the errors are distributed) and that the connection between features are preserved; (3) we propose a modification that allows us to perform what we call spatial extent alignment, e.g., to ensure that a set of polygons representing municipalities (or other administrative entities) fit exactly in a larger polygon representing their province. The latter improvements has the added benefits of allowing us to potentially align the boundaries of very large datasets, since we align them at different levels—from the neighbourhoods in a city to provinces in a country—and we can thus subdivide datasets into parts that fit in memory. We report in Sect. 4 on our implementation of the algorithm (it is based on the stable and fast triangulator in CGAL1) and on the experiments we have made with different real-world datasets in Europe. We also demonstrate how boundary lines, often used in practice, can be incorporated in our approach. Finally, we discuss in Sect. 5 the shortcomings of our method and future work.
2 Related work
The specific issue we are addressing in this paper—the horizontal conflation of datasets formed by polygons—is rarely explicitly discussed in the scientific literature. The algorithms implemented in commercial software and used by practitioners for this case are the generic ones developed for either edge-matching or for vertical conflation.
Edge-matching is a vague term that has no agreed-upon definition in geographical information systems (GIS). It is most often used to refer to linear features (e.g., roads or rivers) on both sides of a border (e.g., international or provincial) that need to be connected. Although its name implies otherwise, it usually means both finding the corresponding entities (matching) and aligning them (fixing the discrepancies). The matching between two features is performed by finding the closest pairs (of either vertex-vertex or vertex-edge), up to a given distance (a threshold, which is usually defined by the user). If two vertices/edges are matched, they are then snapped together so that they become the same object in the resulting dataset. Snapping is usually performed with one dataset being the reference dataset, that is, when matching, the geometries are moved to the ones in the reference dataset. The latter does not move because it is usually known that it is of higher accuracy. When the accuracy of both datasets is the same, or is not known, the errors can be distributed: two vertices to be snapped are then moved to their mid-point. The INSPIRE Directive mandates to use both methods (with and without reference dataset) for the harmonisation of European datasets (INSPIRE 2008, Annex B).
As shown in Fig. 2, if edge-matching is used for with polygons as input, then it implies that the vertices/edges on the boundary of each dataset are snapped to those in the adjacent dataset—the concept of lines being the boundaries of polygons is therefore ignored.
While snapping yields satisfactory results for simple cases, Arroyo Ohori et al. (2012) demonstrate that for more complex ones it is often impossible or impractical to find a threshold applicable to the whole dataset, and that it is prone to errors that cause invalid geometries. There is often no single value that can be used for the whole dataset since it requires an almost uniform distribution of points, and discrepancies (gaps/overlaps) that are of the similar sizes. Observe the result in Fig. 2c for instance, which snaps vertices from the top dataset to the closest geometry (vertex or line) in the reference dataset (bottom one). Even if other rules were used (e.g., snapping points only to other points), the results would still not be satisfactory. Also, the decision to snap or not parts of lines/polygons is often binary: All the vertices/edges closer than the threshold are snapped, the others not. While in theory the threshold value is linked to the accuracy of a dataset, in practice users do not always know how to translate the accuracy into a value, and if they choose the wrong value then their resulting dataset will not be properly edge-matched. Notice that while INSPIRE clearly states that each Thematic Working Group will define the appropriate threshold (INSPIRE 2008), this is in our experience wishful thinking since the geographical datasets related to one theme usually come from different sources that have very different accuracies.
If we consider that in practice datasets very rarely contain only one polygon, snapping increases the risk of creating topologically invalid polygons and intersection polygons (because the boundaries between the polygons will also be snapped to the closest geometry); Fig. 3 shows an example.
Klajnšek and Žalik (2005) consider, as is our case, datasets containing several polygons and find matching points and lines by using threshold distances. However, they do not align the boundaries, but instead simply removed from the dataset the matched geometries (since their aim is to create a dissolved dataset). The removals significantly simplify the problem and avoid creating intersections.
2.2 Vertical conflation
The term vertical conflation was coined by Saalfeld Saalfeld (1988) in the early 1980s for the integration of two datasets representing the same region (Ruiz et al. 2011). One dataset is always the reference (the master), and the other one (the slave) is aligned by first finding corresponding matching points (by using a threshold value) and then applying rubber sheeting, i.e., a function to move all the other points according to the deviation of the control points. Gillman (1985) and Saalfeld (1988) use a triangulation to subdivide the slave dataset into subregions (triangles) and apply an affine transformation to each; the process is repeated iteratively until the displacement is minimal. Rubber sheeting is meant to work with points and lines, and the concept of polygons (and their preservation) is not present. These could be reconstructed from the resulting lines, but there is n guarantee that the polygons will be preserved.
Beard and Chrisman (1988) use ideas of rubber sheeting for the horizontal conflation of map sheets, i.e., datasets having only horizontal and vertical boundaries. The threshold to find matches is applied to points within a certain distance of the boundaries, and the lines incident to these are moved based on a rubber-sheeting function. To avoid moving all the points inside the dataset, a second threshold is used: only vertices closer to the boundaries are involved. There is no guarantee that after a movement the polygon or lines are valid (or that they are free of intersections).
Doytsher (2000) tackles the exact same problem as we do by first matching vertices (also using a tolerance) and then moving all the vertices in the dataset according to a rubber-sheeting transformation. The RMS errors of the transformation are minimised, and while it is claimed that the topological properties of the polygons are preserved, it is not proved and we do not see how this could be the case (since there are no constraints when moving vertices).
3 Aligning the boundaries of datasets with a triangulation-based algorithm
Our approach to aligning the boundaries of polygons builds upon our previous work to automatically repair the small gaps and overlaps in planar partitions where all polygons are stored independently (with the Simple Features paradigm (OGC 2006), a shapefile for instance). In Arroyo Ohori et al. (2012), we used a constrained triangulation (CT) as a supporting structure because, as explained below, it permits us to fill the whole spatial extent of the datasets with triangles, and then the triangles allow us to identify easily the gaps and overlaps between different polygonal datasets. We use the idea of labelling each triangle with the label of the polygon it decomposes: gaps will have no labels and regions where polygons overlap will have more than one label. Repairing implies relabelling triangles so that each triangle has one and only one triangle.
We first briefly describe in Sect. 3.1 the original algorithm for repairing planar partitions. Then, in Sect. 3.2, we describe our extensions to this algorithm so that the boundaries of datasets can be properly aligned when the datasets are formed by several polygons and when the gaps/overlaps are large. We also propose two modifications to our approach (in Sects. 3.3, 3.4) so that two common conflation issues faced by practitioners can be solved: how to perform horizontal conflation of polygons against a boundary represented as a line, and spatial extent conflation.
3.1 Using a CT to repair a planar partition
the CT of the input segments forming the polygons is constructed;
each triangle in the CT is labelled with the identifier of the polygon inside which it is located;
problems are detected by identifying triangles with no label or with two or more labels;
gaps/overlaps are fixed locally with the most appropriate label;
modified polygons are reconstructed and returned in a GIS format (e.g., a shapefile).
If, as in Fig. 5, two polygons are adjacent by one edge e, then e will be inserted twice. Doing this is usually not a problem for triangulation libraries because they ignore points and segments at the same location (as is the case with the solution we use, see Sect. 4). Likewise, when edges are found to intersect, they are split with a new vertex created at the intersection point. These are the only vertices that are added during the conflation process.
Labelling triangles. The labels are assigned to the triangles by first labelling the triangles adjacent to the edges of each polygon, and then visiting all the triangles with graph-based algorithms (i.e., depth-first search) without traversing constrained edges (the original boundaries of the input polygons). Triangles located inside the convex hull of the dataset, but not decomposing any polygons, are labelled with a special label ‘universe’. See Arroyo Ohori et al. (2012) for the details.
Identifying problems: gaps and overlaps. If the set of input polygons forms a planar partition, then every triangle will be labelled with one and only one label. Problems are easily identified: gaps are formed of triangles having no label, and overlaps of triangles having two or more labels.
The other two methods use regions of adjacent triangles with equivalent sets of labels, which is slower but generally yields results in which the polygons have less spikes. Figure 6 shows one example where each of the eight problematic regions of Fig. 4a is assigned one label. The label is obtained with the longest boundary method, i.e., the boundary of the problem region is first decomposed according to the label of the triangles incident to it (but outside the problem region), and second the label is that of the longest portion of the boundary.
Notice that these repair operations can be used one after the other, for instance if first the repair according to the largest number of adjacent triangles has a tie, then this is solved by using another method (or randomly choose one)—this was used in Fig. 4e.
3.2 The extensions necessary for horizontal conflation of polygons
To align boundaries of polygons, we extend and modify the repair algorithm described above. The extended algorithm to align the boundaries of two or more datasets is described in Algorithm 1 and contains three improvements: (1) priority list of datasets; (2) gaps and overlaps treated differently; (3) subdivision of gap regions by adding extra constrained edges.
Priority list of datasets
Gaps and overlaps handled differently
The gap and overlap regions are handled differently, and this allows us to modify the boundaries of datasets in a way that is consistent with the master-slave paradigm. A second modification is that the labels assigned to each triangle are formed by a tuple of the dataset and the unique identifier of the polygon: (dsid, pid). For an overlap region, the label used to relabel all triangles is that whose dsid is the highest in the priority list, e.g., in Fig. 7c the 3 overlapping regions (red regions) are filled with polygons from the dataset M (having labels \(m_1, m_2\) and \(m_3\)). For a gap, this label is the lowest in the priority list; the candidate labels are those adjacent to the gap region. If a gap region is adjacent to more than 1 polygon from the same dataset (see for instance Fig. 7b where both \(s_1\) and \(s_2\) could be assigned to the region), then the ID can be determined with the one of the repair methods mentioned above, such as the longest boundary.
However, aligning boundaries with this approach gives unsatisfactory results for gaps, that is we obtain results that are far from what a human being would manually do. For instance, in Fig. 7, the gap between the datasets is entirely assigned to the polygon \(s_2\), while we could argue that some parts should be assigned to \(s_1\) and \(t_1\). Observe that the overlap regions do not suffer from this problem since the constrained edges divide the region; in Fig. 7c the top overlap region between M and S is divided into two sub-regions by the constraints, and each gets the appropriate label (\(m_1\) and \(m_2\)).
To improve the labels assigned to gaps, we propose a heuristic to subdivide the gap regions into subregions. This is achieved by inserting extra constrained edges in the triangulation (as shown in Fig. 8). The constrained edge is not new edges, but rather existing edges that are labelled as constraints (thus no complex geometric operations are involved). Observe that while a generic constrained triangulation was sufficient to perform validation and repair in our original work (Arroyo Ohori et al. 2012), here a constrained Delaunay triangulations (CDT) is required. A CDT is a triangulation for which the triangles are as equilateral as possible (Chew 1987). We use the edges of the triangles (and their lengths), and having well-shaped triangles is an advantage.
For a given gap region, we proceed as follows (see Algorithm 2). First, we visit each vertex v on the boundary of the gap (there are 9 in Fig. 8a) and identify split vertices: vertices whose number of incident constrained edges in the CDT is more than 2. This allow us to identify where different polygons of the same dataset are adjacent (e.g., vertices a and d in Fig. 8a) and also where different datasets are adjacent (e.g., vertices b, c and e in Fig. 8a).
Second, for each split vertex, we try to insert one constrained edge inside the gap region (which means here that the edge is not on the boundary of the region). An edge whose both ends are split vertices is favoured, if there are none then the shortest edge is inserted as a constraint. Only edges that are shorter than a user-defined maximum length are inserted as constraints. It is possible that for a candidate vertex no constraint is inserted, either because there is no incident edges inside the gap region (e in Fig. 8a), or that the constraint is already present. In Fig. 8a, the extra constraints bd is added twice; the second insertion does not modify the CDT.
The algorithm for labelling gap regions is then used as described in Sect. 3.1, but now each subregion is processed separately; after the insertion of the extra constraint in Fig. 8a there are thus four gap regions. In the Algorithm 1, the relabelling of regions are not applied to the CDT directly (since the result of one could influence another, creating dependence on the order in which the triangles are visited). Instead, triangles that have been relabelled are saved in a separate list, and only after all the triangles have been visited are the new labels applied. Because the insertion of extra constraints isolates subregions, not all subregions can be directly relabelled. A subregion can be relabelled only if it is adjacent to two or more datasets. If a subregion is adjacent to only one (e.g., the lower-right orange region in Fig. 8a), then it is not repaired until one of its adjacent is relabelled (Fig. 8d). This implies that several ‘passes’ over all the triangles have to be performed, each pass tries to find a new label for triangles having 0 label and these are applied at the end of the pass.
Our heuristic has the added benefit of connecting lines that are the boundaries of polygons and/or datasets and of preserving better the area of polygons since gap regions are split and subregions are assigned to different polygons. We demonstrate in Sect. 4 the results we obtained with real-world datasets.
3.3 Edge-matching to a linear boundary
In practice, international and delimiting boundaries between countries and administrative entities are often available as lines, because these are agreed upon by the neighbouring parties. A trivial modification of our algorithm can be made for such cases: the linear boundary is converted to a polygon. As shown in Fig. 9, this can be easily done by first offsetting the line by a distance d (to either side of the line), and then linking the two lines (this can be seen as a ‘half-buffer’). We demonstrate in Sect. 4 how this performs with real-world datasets.
3.4 Spatial extent conflation
Figure 10 shows the main modification to our algorithm: the polygon representing the spatial extent becomes a hole of a larger polygon (whose shape is irrelevant, it can be obtained by enlarging the axis-aligned bounding box of all the polygons by 10 % for instance).
Observe that aligning polygons to a spatial extent allows us to align the boundaries of very large hierarchical datasets since we can align them individually at each level. As shown in Fig. 12, if we begin the alignment at the highest level (the first level being the spatial extent of the polygons of the second level), and then continue iteratively to the lower administrative level using the previously generated results (i.e., in Fig. 12 the seven polygons are used as spatial extents for the thirdrd level, and so on), then the memory footprint of the alignment should remain relatively small. That is, the process is bounded by the size of the largest polygon and its decomposition one level lower. We demonstrate in the next section an example with real-world examples.
4 Implementation and experiments
Our prototype is open-source (under a GPL license) and can be freely downloaded at www.github.com/tudelft3d/pprepair... It can read most GIS formats, perform the alignment of the boundaries, and return to the user the input files with the boundaries modified; all the attributes of the polygons are preserved. Furthermore, as shown in Fig. 13, the user obtains an overview of which polygons were modified, and the modifications are also exported as a GIS file (in the form of polygons showing parts that were added/removed from the datasets). This is something practitioners often mention as useful (“how much was my dataset modified?”) and that is rather difficult to accomplish with traditional snapping-based alignment methods. The only solution is to perform a Boolean difference between the input dataset and the output, which is computationally expensive and prone to floating-point errors.
We have tested our implementation with several real-world datasets that are publicly available, and we report on some of these in the following. Our experiments cover a broad range of alignment problems that practitioners have to solve. Furthermore, our implementation handles MultiPolygons and polygons with holes, which are very common in practice when dealing with administrative subdivisions and other datasets related to the territory.
4.1 Experiment #1: Conflation of polygons
We have been able to align the boundaries successfully with our implementation, that is all output polygons are valid and there are no gaps nor overlaps along the boundary of France and its neighbours. With FME, there are several cases where there are gaps and overlaps, and several parts of polygons have collapsed, creating invalid geometries. We show in Fig. 15 the results for some specific locations. First, observe that in Fig. 15b there are still gaps and overlaps. These are mostly caused because only slave vertices are snapped to the master primitives without the insertion of extra vertices. With our implementation, the gaps are split into different regions and the alignment is performed without leaving any problematic regions. Second, observe in Fig. 15e how snapping makes the slave move to the wrong boundary (that of an administrative unit that is not on the boundary of the dataset), and how the very large gap is only partly aligned (some parts of it are within the given tolerance, others not). A large tolerance would have most likely solved this issue, but would also have modified the boundaries of polygons unacceptably more in other parts of the dataset. Third, it can be seen in Fig. 15h that part of the boundary between two French polygons is collapsed to the boundary of the master dataset; in our implementation this boundary is left intact. Fourth, in Fig. 15k, an enclave is snapped to the master dataset, which actually separates the polygon containing the enclave into unconnected parts.
4.2 Experiment #2: Conflation with an international boundary
As shown in Fig. 16, we have extracted the international boundary from the VMAP0 and used it as input. The results of the aligned GEOFLA is exactly the same as in the previous section.
4.3 Experiment #3: Spatial extent
We have performed a hierarchical conflation with the hierarchy of a dataset from the Statistics Netherlands 9. The dataset, let us call it CBS2009, is shown in Fig. 17 and is publicly available10. The 441 municipalities are divided into 2542 districts, and these are further divided into 11,574 neighbourhoods (many of these objects are MultiPolygons, as explained in Sect. 3.4). Each polygon has an identifier linking it to its parent. The conflation was performed in two steps: (1) all the district polygons with a given identifier are aligned to their respective municipality polygon (thus 441 times); (2) all the neighbourhood polygons with a given identifier are aligned to their respective (aligned and modified) district polygon (thus 2542 times). As an indication, the whole process—2983 boundary alignments—took 9m50s (4m45s computation; 5m05s for reading/writing to disc the datasets). While there were no very large errors, 98/441 municipalities and 2213/2542 districts were not properly subdivided (often by only a millimetre or less).
We have proposed a new algorithm to perform the horizontal conflation, and we have shown that it is suitable for several cases that practitioners face in their work. It can be used with polygons and lines, and supports the alignment of polygons to a spatial extent. Furthermore, the conflation can be performed with a local criteria, instead of a global one (the user-tolerance is usually for the whole dataset).
Our approach is highly efficient (since it is based on a highly optimised triangulator and only the labelling of triangles is involved), it avoids the pitfalls of choosing the appropriate threshold (if it even exists), and perhaps more importantly, it guarantees that valid geometries are constructed, which allows practitioners to use the output for further analysis. Anyone who has tried—and perhaps failed—to find the appropriate snapping threshold for a given dataset by using trial-and-error will recognise that our approach has great benefits. One can get an overview of which polygons were modified (including what part(s) of these, and their exact geometry), and collapsed geometries are completely avoided (which happen often with large snapping tolerance).
We plan in the future to modify the algorithm so that the conflation of only lines is possible, e.g., for highways or rivers crossing a boundary. These would be “linked”, with edges of the CT, although the use of a tolerance would still be necessary. We also plan to add more repair functions, particularly one where we can edge-match two polygons without the notion of a master and a slave and thus distribute errors. Triangles could be used to find the centreline of a region (Bader and Weibel 1998).
The Computational Geometry Algorithms Library: http://www.cgal.org.
These units are called in Europe ‘Nomenclature of Territorial Units for Statistics’, or NUTS.
The European directive that ensures that the spatial data infrastructures of the Member States are compatible with each others and usable.
Geospatial Data Abstraction Library: www.gdal.org.
Computational Geometry Algorithms Library: www.cgal.org.
CBS (Centraal Bureau voor de Statistiek).
- Bader M, Weibel R (1998) Detecting and resolving size and proximity conflicts in the generalization of polygonal maps. In: Proceedings 18th international cartographic conference, Stockholm, SwedenGoogle Scholar
- Chew L (1987) Constrained delaunay triangulations. In: Proceedings of the third annual symposium on computational geometry, pp 215–222Google Scholar
- Davis M (2003) Java conflation suite. Tech. rep., Vivid Solutions, draft. http://www.vividsolutions.com/jcs/
- Gillman DW (1985) Triangulations for rubber sheeting. In: Proceedings autocarto 7, pp 191–199Google Scholar
- INSPIRE (2008) Drafting team “data specifications”—deliverable d2.6: methodology for the development of data specifications. Tech. Rep. Identifier: D2.6_v3.0, infrastructure for spatial information in EuropeGoogle Scholar
- INSPIRE (2014) D2.8.i.4 data specification on administrative units—technical guidelines. Tech. Rep. Identifier: D2.8.I.4_v3.1, infrastructure for spatial information in EuropeGoogle Scholar
- Lynch MP, Saalfeld A (1985) Conflation: automated map compilation—a video game approach. In: Proceedings Auto-Carto VII, pp 343–352Google Scholar
- OGC (2006) OpenGIS implementation specification for geographical information—simple feature access. Open Geospatial Consortium inc., document 06-103r3Google Scholar
- Shewchuk JR (1997) Delaunay refinement mesh generation. Ph.D. thesis, School of Computer Science, Carnegie Mellon University, PittsburgGoogle Scholar
- Yuan S, Tao C (1999) Development of conflation components. In: Proceedings of geoinformatics. Ann Harbour, pp 1–13Google Scholar
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.