Method overview
The global localization method described in this paper uses 3D laser scans as sensor input data. It is matched against a 2D map of an urban environment, which contains information about building outlines as well as the street network. Data of this type can be retrieved from various sources, such as Google Maps, official cadastral maps or the OpenStreetMap project, which is used for the evaluation in this article.
The sensor data used for localization in this article comes from a 3D laser scanner. Only distance data is used, although appearance data in the form of laser intensities is often also available. Since the localization problem as discussed here is a 2D template matching problem, the initially 3D sensor data is reduced to a 2D representation by extracting vertical planar segments from the data, and reducing it further to a set of line segments representing these presumed building outlines. Matches between the building edges from the sensor data and the 2D building outline map are computed using a fast and simple template matching procedure known from image processing. Since the template matching problem for mapping has special properties which are not taken into account by standard procedures, the results of this approach can be improved upon. Information from the building map and street network are also used to further reduce the number of candidates valid for subsequent processing. The remaining candidate poses are then further refined by a variation of the chamfer matching procedure, which takes into account visibility considerations particular to the laser data matching problem, and penalizes matches where buildings that are absent in the sensor data appear in the corresponding map section. The result of this computation is used to rank the candidates and either extract the top candidate as the estimated pose, or use a ranked set of candidates for further processing, e.g., for the initialization of a SLAM system. The sequence of processing steps is also illustrated in Fig. 1.
Point cloud processing and building outline segment detection
Before the template matching problem of localizing the robot on the building outline map can be addressed, the input data must be reduced to a set of lines representing the presumed building outlines in the sensor’s field of view. To this end, the very dense point clouds are reduced in size as a first step. For this, the rectangular cuboid approximation framework (RMAP) (Khan et al. 2014) is used to convert the point cloud into an occupancy grid consisting of cuboid cells at a lower resolution, and reduce the number of noisy observations. In this data structure, normal vectors can be efficiently computed for each occupied cuboid cell. Since we are interested in building outlines, and the roll and pitch angles of the robot travelling on the street can be assumed to be known, vertical surfaces can be extracted from the occupancy grid by selecting cuboid cells that have a normal vector parallel to the ground plane.
These vertically oriented cuboid cells are then projected to the ground plane by setting their z coordinate to zero, and the number of cells per area unit is counted. The result is a histogram of the vertically oriented cuboid cells in the sensor range of the robot. For the goal of extracting building outlines from this representation, the normal information from the point cloud should be preserved, since only points that have a similar normal direction can belong to a common planar surface. We use this information by binning the yaw angles of the cuboid cells and creating separate histograms for each angle range. In each of these histograms, line segments are extracted using the Probabilistic Hough Transform (Matas et al. 2000). Parallel line segments with small distances between them and collinear lines with small gaps are merged to reduce noise in the resulting set of edges. The building outline extraction process is illustrated in Fig. 2, which shows both the histograms of oriented cuboids, and the line segments computed based on them.
Adapting directional chamfer matching to the localization problem
After the building outlines have been retrieved from the laser data, retrieving the robot pose in the building map becomes a template matching problem. Chamfer matching (Barrow et al. 1977) is a well-established method for template matching, which is especially suitable to find correspondences between sets of line segments. This section describes the idea of chamfer matching and extensions of its original cost function to adapt it to the problem of matching templates for localization.
Chamfer matching is designed to find a transformation of a template edge map in the robot coordinate frame \(U = \left\{{\mathbf{u}}_i\right\} , i = 1, \ldots , n\) such that it optimally matches a section of a query edge map \(V = \left\{ {\mathbf{v}}_i\right\} , i = 1, \ldots , m\) in the map coordinate frame. This transformation is a 2D Euclidean transformation \({\mathbf{s}} \in SE(2)\), where \({\mathbf{s}} = (\theta , t_x, t_y)\). It can be interpreted to define a pose of the robot in the coordinate frame of the map, where its location is given by \((t_x, t_y)\), and its heading by \(\theta \). The effect of this transformation on the robot measurements can be calculated by a rotation and a subsequent translation as
$$\begin{aligned} {\mathbf{W}}({\mathbf{x; \,s}}) = \left( \begin{matrix} \cos (\theta ) & - \sin (\theta )\\ \sin (\theta ) & \cos (\theta ) \end{matrix} \right) {\mathbf{x}} + \left( \begin{matrix} t_x \\ t_y \end{matrix}\right) \end{aligned}$$
(1)
The optimal alignment of the query edge map with the template map is the result of the transformation which minimizes a distance function d between the two maps
$$\begin{aligned} \hat{{\mathbf{s}}} = \mathop {\arg\!\min }\limits_{{{\mathbf{s}} \in SE(2)}} d\left( {{\mathbf{W}}(U,s),V} \right). \end{aligned}$$
(2)
In the following, let the transformed query edge set \({\mathbf{W}}(U, s)\) be denoted by \(\hat{U}\).
Different distance functions can be used. For standard Chamfer matching, the distance function is given by the minimal distances to a template edge point for each point in the query edge map
$$\begin{aligned} d_{CM} \left( \hat{U}, V\right) = \frac{1}{n} \sum _{\hat{\mathbf{u}}_i \in \hat{U}} \min _{{\mathbf{v}}_j \in V} \left| \hat{\mathbf{u}}_i - {\mathbf{v}}_j \right| . \end{aligned}$$
(3)
For edge maps consisting of linear segments, it is more robust and efficient to consider the orientation for the edge, and penalize matches between edge points with different directions. This reasoning leads to the distance function of directional Chamfer Matching (DCM) (Liu et al. 2010)
$$\begin{aligned} d_{DCM} \left( \hat{U}, V\right) = \frac{1}{n} \sum _{\hat{\mathbf{u}}_i \in \hat{U}} \min _{{\mathbf{v}}_j \in V} \left| \hat{\mathbf{u}}_i - {\mathbf{v}}_j \right| + \lambda \left| \phi (\hat{\mathbf{u}}_i) - \phi ({\mathbf{v}}_j)\right| , \end{aligned}$$
(4)
where an edge orientation \(\phi \) is determined for each edge point, and the distance of the orientations is determined as the minimal rotation necessary between them. In applications where it is acceptable to discretize the space of edge orientations, the optimization (2) can be efficiently computed by computing a distance transform tensor, which contains the cost contributions for each query edge point. This approximation is formulated in the fast directional chamfer matching (FDCM) method (Liu et al. 2010). In cases where the template and query edge maps can be represented as sets of linear segments, the summation of individual contributions per point can be replaced by computations only involving the end points of the line segments by computing an integral distance transform.
These cost functions are designed for the task of finding simple query edge maps in template edge maps derived from cluttered images. It is expected that, for a good match between template and query edge map transformation, each edge in the query edge map is close to a matching edge in the template edge map. All edges at larger distances are not considered for cost computation. For the application of localizing a set of building edges in a building outline map, where, due to the structured nature of typical building maps, there can be many areas that are similar to parts of what the robot sensors observe, it is desirable to also penalize matches where some part of the template that should exist in the query is not there. This is illustrated in Fig. 3, which shows two possible transformations of a template edge map, both of which result in the same (D)CM cost values, but one of them is clearly a worse match than the other, since the building edges derived from the scan do not contain a building that would be expected to be observed.
While this information about which edges of the template map should be matched to edges in the query map is not available in a general template matching task, an estimate of the expected observation for the localization task can be generated by extracting all the lines visible in the map from a given robot pose. We denote this set of edges visible from a position \((t_x, t_y)\) by \(V_e(t_x, t_y)\). With this definition, a forward cost function that takes only the expected observations for a given robot position into account can be defined as
$$\begin{aligned} d_{f}\left( \hat{U}, V, {\mathbf{s}}\right) = \frac{1}{n} \sum _{\hat{{\mathbf{u}}}_i} \min _{{\mathbf{v}}_j \in V_e(t_x, t_y)} \left| \hat{{\mathbf{u}}}_i - {\mathbf{v}}_{j}\right| . \end{aligned}$$
(5)
Furthermore, knowledge about the expected observation also allows to define a reverse cost function that describes the extent to which the expected observation \(V_e\) is represented in the actual observation \(\hat{U}\)
$$\begin{aligned} d_{r}\left( \hat{U}, V, {\mathbf{s}}\right) = \frac{1}{n} \sum _{{\mathbf{v}}_i \in V_e(t_x, t_y)} \min _{{\mathbf{u}}_j \in \hat{U}} \left| \hat{\mathbf{u}}_i - {\mathbf{v}}_{j}\right| . \end{aligned}$$
(6)
Finally, the forward cost (5) and reverse cost (6) can be combined to form a cost function that is symmetric in the expected template edge map and the query map
$$\begin{aligned} d_{s}\left( \hat{U}, V, {\mathbf{s}}\right) = \frac{1}{2} \left( d_{f}\left( \hat{U}, V, {\mathbf{s}}\right) + d_{r}\left( \hat{U}, V, {\mathbf{s}}\right) \right) . \end{aligned}$$
(7)
A directional extension of these latter three cost functions similar to (4) is possible analogously.
Computing the optimization (2) for these latter cost functions is significantly more complex than the cost functions (3) and (4), as the set of visible edges, which constitute the template edge map used in the computation of the cost function, depends on the translation of the considered coordinate frame transformation. This means that a computation of a distance transform tensor, which is independent of the coordinate transformation and allows the efficient computation in the FDCM approach, is not possible when the area covered by the template map is large. Even though visibility analysis can be implemented efficiently using a Binary Space Partition (BSP) tree (Fuchs et al. 1983), a brute force optimization of (2) with either cost function \(d_f\), \(d_r\), or \(d_s\) can be prohibitively computationally expensive. For this reason, in this work we adopt a heuristic approach by assuming that minimizers of these cost functions also result in low values of the simpler cost function \(d_{DCM}\), if not the globally optimal ones. Under this assumption, the FDCM method can be used in a first pass to generate a set of pose candidates \(C = \{{\mathbf{c}}_i\} = \{(\theta _i, t_{x, i}, t_{y, i})\}, i=1,\ldots , n_C\) that result in values of \(d_{DCM}\) within a given factor of its global minimum. Only for these transformations, the visible lines are computed, and the more complex cost functions are evaluated.
Filtering position candidates using OpenStreetMap information
The number of poses to consider for valid localization candidates can be restricted further with additional knowledge available from the building map. For instance, poses that lie inside buildings can be discarded. Furthermore, if, like in our case, the robot travels alongside the road, poses that are more than a given distance removed from any edge of the road network can be discarded as well. For the experiments carried out in this paper, we consider only pose candidates that are less than 12 m removed from street elements in the OpenStreetMap network. Of the many candidate poses generated by the first FDCM optimization, many are invalid according to either their position inside a building or their distance from a marked road, and thus do not have to be considered for further evaluation. This is illustrated in Fig. 4, which shows the positions of candidate poses for an input scene that produces many matches within the area considered for localization. The figure visualizes which points are considered as valid candidates and which ones are discarded based on the criteria laid out above.