Navigating by Touch: Haptic Monte Carlo Localization via Geometric Sensing and Terrain Classification

Legged robot navigation in extreme environments can hinder the use of cameras and laser scanners due to darkness, air obfuscation or sensor damage. In these conditions, proprioceptive sensing will continue to work reliably. In this paper, we propose a purely proprioceptive localization algorithm which fuses information from both geometry and terrain class, to localize a legged robot within a prior map. First, a terrain classifier computes the probability that a foot has stepped on a particular terrain class from sensed foot forces. Then, a Monte Carlo-based estimator fuses this terrain class probability with the geometric information of the foot contact points. Results are demonstrated showing this approach operating online and onboard a ANYmal B300 quadruped robot traversing a series of terrain courses with different geometries and terrain types over more than 1.2km. The method keeps the localization error below 20cm using only the information coming from the feet, IMU, and joints of the quadruped.

An ANYmal robot (Hutter et al. (2016)) in a sewer with two feet in a slippery, wet depression and two feet in a dry, elevated area. With prior information about terrain type and geometry, it is possible for the robot to localized in the world using only touch. This would be extremely useful in dark and foggy environment (Image courtesy of RSL/ETH). dust, dirt and water vapor, which can significantly impair a robot's vision system. Additionally, camera or laser sensor failure may leave only proprioceptive sensors (i.e., IMU and joint encoders) at the robot's disposal.
Blind quadrupedal locomotion has achieved impressive levels of reactive robustness without requiring vision sensors (Focchi et al. (2020); Lee et al. (2020)). However, without the ability to also localize proprioceptively, a robot would still be incapable of executing goals and completing missions or inspections.
Availability of a prior map is a reasonable assumption for inspection tasks in engineered environments. In our previous work (Buchanan et al. (2020)), we demonstrated that proprioceptive localization could be achieved online when prior geometric knowledge about the environment is available. However, the terrain morphology must include sufficiently informative features (e.g., irregular steps) such that a robot can use them as landmarks for localization.

arXiv:2108.08015v1 [cs.RO] 18 Aug 2021
To overcome this limitation, in this paper we propose a novel localization system that uses Sequential Monte Carlo at its core to fuse geometric as well as tactile semantic information from a terrain classifier to effectively and reliably localize the ANYmal B300 quadruped robot (Hutter et al. (2016)) using limited sensing data.

Contributions
The contributions of this paper can be summarized as follows: -A novel proprioceptive legged robot localization system that fuses terrain geometry and semantics. The localization is performed in all the 6-DoF of the robot, instead of the x, y and yaw dimensions as in previous works. To the best of the author's knowledge, this is also the first such system that performs localization using semantics when completely blind. -A terrain classification method employing signal masking in the 1D convolutional modules, making it possible to process variable length signals from footsteps without the need to truncate or pad them. -Extensive testing on an ANYmal B300 quadruped robot in three experiments including different geometries (both 2.5D and 3D) and terrain types, for a total duration of 2.5 h and more than 1.2 km of traveled distance. In 2.5D experiments localization error is kept down to 10 cm on feature rich terrain and on average below 20 cm on all terrain while exploiting terrain semantics. The localization is used online to effectively guide the robot towards its planned goals. The algorithm was also demonstrated to work with 3D maps, localizing by probing vertical surfaces.
The remainder of this document is structured as follows: Section 2 summarizes relevant research in the fields of terrain classification, legged haptic localization and the related problem of in-hand tactile localization; Section 3 defines the mathematical background of the legged haptic localization problem; Section 4 describes our proposed haptic localization algorithm; Section 5 describes the implementation details to deploy our algorithm on a quadruped platform; Section 6 presents the experimental results collected using the ANYmal robot; Section 7 provides an interpretation of the results and discusses the limitations of the approach; finally, Section 8 concludes the paper with final remarks.

Related Works
Pioneering work which exploits a robot's legs, not just for locomotion, but also to infer terrain information such as friction, stiffness and geometry has been presented by Krotkov (1990). This idea has recently been revisited to perform terrain vibration analysis or employ terrain classification to improve locomotion parameter selection. Since we are interested in using terrain classification for localization, we cover the most relevant works applied to legged robots in Section 2.1. Works on proprioceptive localization in manipulation and legged robots are described in Sections 2.2 and 2.3, respectively.

Tactile Terrain Classification
The first tactile terrain classification method for walking robots was presented by Hoepflinger et al. (2010) and concerned experiments with a single leg detached from a robot's body. This seminal work utilized force measurements and motor currents to successfully distinguish between four terrain types. These results paved the way for the application of terrain classification methods on actual legged robots.
More recently, Kolvenbach et al. (2019) also used a single leg on a real, standing ANYmal robot to differentiate between four different types of soil by probing. Two types of feet (point and planar) were used to collect force, torque and IMU measurements that were processed by an SVM classifier. The system showed that the tactile information can be used to differentiate between visually similar soils but the probing action is not particularly practical as it controls each part of the legged robot and prevents the system from performing other tasks at the same time.
A terrain classification system that could operate while the robot walks was presented by Wellhausen et al. (2019). At the beginning of operation, their legged robot is trained to assign a terrain negotiation cost based on force/torque sensors. Once operating, their system assigns a terrain negotiation cost from images based on previous feet-to-image correspondences and terrain classification based on proprioceptive sensors. The ability to predict the terrain negotiation based on images is then used to plan robot's movement and avoid high cost terrains.
However, in complete darkness, which is our intended domain, vision-based sensors are of limited use for terrain classification, so we focus more on purely proprioceptive sensing. In our previous work (Bednarek et al. (2019a)), we showed how more complex and data-intensive deep learning models can be used to increase terrain classification accuracy. This system achieved an accuracy of almost 98% with six terrain classes and measurements taken from forcetorque sensors during a statically stable walk. However, this approach was limited to fixed-length input signals, and thus did not generalize well to irregular walking patterns, different speeds, or walking on slopes, as they would produce variable length signals. The dependency on fixed-length inputs requires truncation and consequently information loss. In this work, we develop a novel masking mechanism in the convolutional layer, so as to process signals of variable lengths.
More recently, Lee et al. (2020) showed a completely different approach to terrain classification for locomotion. Their idea is to use an end-to-end deep learning controller based on proprioceptive signals to adapt the gait to rough terrains. Such a system does not explicitly perform tactile terrain classification, but an internal representation of the terrain type is implicitly stored inside the memory of the network. In our work, we opt for a modular approach that is not based on end-to-end deep learning. We develop a deep learning module that explicitly returns a terrain class, which can then be used by the localization estimator.

Tactile Localization in Manipulation
Tactile localization involves the estimation of the 6-DoF pose of an object (of known shape) in the robot's base frame by means of kinematics of the robot's fingers and its tactile sensors. The pose of the object (in robot base coordinates) is inferred by matching the set of all positions of the sensed contacts (also in base coordinates) with the 3D model of the object being grasped.
Since the object can have any shape, the probability distribution of its pose given tactile measurements can be multimodal. For this reason, tactile localization has typically been addressed using Sequential Monte Carlo (SMC) methods, a subfamily of which are called particle filters (Fox et al. (2001)). In contrast to Gaussian-based methods, SMC can maintain a discrete approximation of an arbitrary distribution by generating many hypotheses (often called particles) from a known proposal distribution. If the number of particles is large enough and the proposal distribution can cover the state space well (i.e., it captures the areas of high density of the unknown underlying distribution), the target distribution will be well approximated by the set of state particles and their associated importance weights. SMC is, however, sensitive to the dimension of the state space, which should be low enough to avoid combinatorial explosions or particle depletion. State-of-the-art methods aim to reduce this dimensionality and also to sample the state space in an efficient manner. In the SLAM community, FastSLAM achieved this by careful design of the proposal distribution and adaptive importance resampling to avoid particle depletion (Montemerlo et al. (2007)). Vezzani et al. (2017) proposed an algorithm for tactile localization using the Unscented Particle Filter (UPF) and tested it on the iCub robot (equipped with contact sensors at the fingertips) to localize four different objects in the robot's reference frame. The algorithm is recursive and can process data in real-time as new measurements become available. The object and the robot's base are assumed to be static, allowing the pose to be estimated as a fixed parameter. This assumption allows the use of a window of past measurements to better address the sparsity of the measurements, which consist of a series of single finger touches. For legged haptic localization, the assumption of both a static robot and terrain does not always hold and more general methods are required. Chalon et al. (2013) proposed a particle filtering method for online in-hand object localization that tracks the pose of an object while it is being manipulated by a fixed base arm. The estimated pose is subsequently used to improve the performance of pick and place tasks (e.g., by placing a bottle in a upright position). The particles representing the object's pose are initialized by a Gaussian distribution around the true initial pose of the object, acquired by a vision system (used only at start). The particle weights are updated by penalizing finger/object co-penetration and the distance between the object and the fingertip that detected the contact. Finally, their system selects the particle with the highest weight as the output estimate.
To adapt the tactile localization problem to legged robots, one can instead imagine the robot as a hand trying to "grasp" the ground. The objective of legged haptic localization is estimating the pose of the robot relative to a fixed object (in this case the terrain) by means of its "fingers" (in this case the robot's feet), instead of the pose of the object being grasped relative to the robot. This problem is generally harder, because a legged robot cannot envelop the entire terrain at once as a hand can do with an object. This implies more uncertainty, as the robot needs to constantly move to collect enough informative samples.

Haptic Localization of Legged Robots
The first example of haptic localization applied to legged robots is from Chitta et al. (2007). In their work, they presented a proprioceptive localization algorithm based on a particle filter for the LittleDog, a small electric quadruped. The robot was commanded to perform a statically stable gait over a known irregular terrain course, using a motion capture system to feed the controller. While walking, the algorithm approximated the probability distribution of the base state with a set of particles. The state was reduced from six to three dimensions: linear position on the -plane and yaw. Each particle was sampled from the uncertainty of the odometry, while the weight of a particle was determined by the L2 norm of the height error between the map and the contact location of the feet. The algorithm was run offline on eight logged trials of 50 s each. Schwendner et al. (2014) demonstrated haptic localization on a wheeled robot with protruding spikes, which act as a set of passive and rigid legs. The spikes detected contacts with the ground, which were compared to a prior 2.5D elevation map. The unique design of their robot enabled multiple contact measurements at a high rate for each wheel. They used the many contacts to perform plane fitting against the prior map and improve localization over larger, flatter terrain. They also performed terrain classification but with the use of a camera, which we do not require in our proposed work. They demonstrated an average position error 39 cm in five experiments, each approximately 100 m.
In Buchanan et al. (2020), we presented an SMC method that estimates the past trajectory at every step instead of estimating just the most recent pose. This enabled the estimation of poses which are globally consistent. Furthermore, the localization was performed for the full 6-DoF of the robot, instead of just the , and yaw dimensions as in Chitta et al. (2007) and Schwendner et al. (2014). The localization system was experimentally demonstrated online and onboard an ANYmal robot and used in a closed loop navigation system to successfully execute a planned path, with a localization accuracy of 10 cm on feature rich terrain. When walking on flat areas, the localization uncertainty increased due to the lack of constraints on the -plane. This behavior was expected, but undesirable. We are therefore motivated to use terrain classification techniques described in Section 2.1 to incorporate more information.

Let
= [p , ] be a robot's pose at time , composed of the base position in the world frame p ∈ R 3 and orientation ∈ (3). With a slight abuse of notation, we will use the same symbol for its homogeneous matrix form ∈ (3).

Quadruped State Definition
We assume that for each timestep , an estimate of the robot pose˜and its covariance Σ ∈ R 6×6 are available from an inertial-legged odometric estimator, such as Bloesch et al. (2018); Fink and Semini (2020); Hartley et al. (2020). The uncertainties for the rotation manifold are maintained in the Lie tangent space, as in Forster et al. (2017). We also assume that the location of the robot's end effectors in the base are known from forward kinematics and their binary contact states K = ( LF , RF , LH , RH ) ∈ B 4 from inverse dynamics. Finally, where available, a foot sensor measures the force acting on For simplicity, we neglect any uncertainties due to inaccuracy in joint encoder readings or limb flexibility. Therefore, the propagation of the uncertainty from the base to the end effectors is straightforward to compute. For brevity, the union of the aforementioned states (pose, contacts, and forces) at time will be referred as the quadruped state

Prior Map
Our approach can localize against 2.5D terrain elevation maps as well as full 3D maps. Terrain classification is meant to be carried out while the robot is walking, with no dedicated probing actions, therefore 2.5D maps are augmented with a terrain class category for each cell. This enables our method to overcome the degeneracy caused by featureless geometries (e.g., flat grounds, which are uninformative about the robot position on the -plane). Point clouds are used for 3D maps and only contain geometric information. To distinguish between the three types of map, we will refer to M when 2.5D only, M 3 when 3D, and M when 2.5D augmented with class information, respectively. An example of an M map colorized by terrain class is shown in Figure 2.

Estimation Objective
Our goal is to use a sequence of quadruped states and their corresponding uncertainties Σ to estimate the most likely sequence of robot states up to time as: such that the likelihood of the contact points to be on the map is maximized.

Proposed Method
To perform localization, we sample a predefined number of particles at regular intervals from the pose distribution provided by the odometry (as described in Section 4.1) and we compute the likelihoods of the measurements by comparing each particle to the prior map, so as to update the weights of the particle estimator (see Section 4.2).

Locomotion Control and Sampling Strategy
Blind locomotion requires cautious footstep placement, hence we opt for a statically stable gait, which guarantees stability at all times even when the motion is stopped mid flight phase. Since only one leg can be moved at the time, as soon as the swing leg touches the ground the robot enters into a four-point support phase; at this time, the quadruped state estimate Q and the estimated terrain class˜for a given foot position are collected. Then, a new set of particles is sampled in a manner similar to Chitta et al. (2007): where Δ˜=˜− 1 −1˜i s the pose increment measured by the onboard state estimator at times − 1 and .
At time , the new particles are thus sampled from a normal distribution centered at the pose estimated from the odometry with its corresponding covariance. Since roll and pitch angles are observable from inertial-legged estimators, their estimates have low uncertainty. In practice, this allows us to reduce the number of necessary particles along these two dimensions, mitigating issues related to highdimensional sampling in 6-DoF.

Measurement Likelihood Model for 2.5D Data
We use the same measurement model for 2.5D measurements as in our previous work (Buchanan et al. (2020)). The measurement likelihood is modeled as a univariate Gaussian centered at the local elevation of each cell. We use a manually selected variance of 1 cm. Given a particle state , the estimated position of a contact in world coordinates for an individual foot is defined as the concatenation of the estimated robot base pose and the location of the end effector, in base coordinates: Thus, the measurements and their relative likelihood functions for the -th particle and a specific foot are (Figure 3, left): where: is the vertical component of the estimated contact point location in world coordinates of foot , according to the -th particle; M ( , , , ) is the corresponding map elevation at the coordinates of d .

Measurement Likelihood Model for 3D Data
Our method can incorporate contact events from 3D probing. This is useful for areas where the floor does not provide enough information to localize. In this case, the robot can probe walls and 3D objects with its feet. To better model this situation, we represent the prior map M 3 ∈ R 3× by a 3D point cloud with points. The likelihood of a particular contact point is computed using the Euclidean distance between the foot and the nearest point in the map. This likelihood is again modeled as a zero-mean Gaussian evaluated at the Euclidean distance between the estimated contact point d and its nearest neighbor on the map, with variance : where kd(M 3 , d ) is the function that returns the nearest neighbor of d on the map M 3 , computed from its k-d tree (Figure 3, middle).

Terrain classification
We define the haptic terrain classification function : S ↦ → C, as the function that associates an element from the signal domain S to an integer from the class counter-domain C. The set S : { : ∈ R ( )×6 }, where ( ) is a length of the signal , includes sequences of variable length signals such as the forces and torques (6 values in total) generated by a robot's foot touchdown event while taking a step. The set C is defined as the integers from 0 to − 1, where is the total number of terrain classes that the robot is expected to be walking on (in our case, = 8).
Having defined the problem as such, we introduce the following: a classification method : S ↦ → C, which approximates the function . As an implementation of we used a neural network; a dataset consisting of a list of pairs : [( , )], where ∈ S, ∈ C. Such a dataset was divided into two subsets, training and validation, with a ratio of 80:20; a training process formulated as an approximation of the function using the function by the minimization of cross-entropy between the probability distributions generated by these functions.

Terrain Class Measurement Likelihood Model
To localize with the support of terrain classes, we represent the prior map as a 2-dimensional grid whose cells are associated with a terrain class. An example is provided by Figure  2, where each cell is colorized according to its class.
Measurements are represented as a piece-wise cost function. In the case where the estimated terrain class˜for a given foot position d does match the class , at that location in the prior map = M ( , ) the probability is a constant value. This constant value is the maximum value of a zero-mean univariate Gaussian with manually selected variance of 5 cm (which is the width of the foot). If the estimated class does not agree with the expected class in the map, the same Gaussian distribution is used to model the likelihood, where the input is the distance to the closest point in the map with the expected class. This function is shown as The function nc(M , d ) returns the nearest 2D point in the map M to the 2D foot position d , which has the estimated class˜. This last case is shown in Figure 3, right.
We assume elevation and terrain class measurements ( , ) are conditionally independent. Therefore, their joint probability can be computed as

Implementation
The block diagram of our system is shown in Figure 4. The internal estimator on the ANYmal robot, TSIF (Bloesch et al. (2018)), provides the odometry for the particle estimator, while the neural network estimates the class. This information is compared against the prior map to provide an estimate of the robot's trajectory, X * . Pseudocode for the particle estimator (green block in Figure 4) is listed in Algorithm 1. At time , the estimates of the terrain class˜and the robot pose˜are collected. The pose estimate is used to compute the relative motion Algorithm 1: Haptic Sequential Monte Carlo Localization Δ˜− 1: , propagate forward the state of each particle , and draw a sample from the distribution centered in Δ˜− 1 with covariance Σ = ( , , , , , ).
The weight of a particle is then updated by multiplying it by the likelihood that each foot is in contact with the map and the terrain class. In our implementation, we modify the likelihood functions from Equation 5 as: where is a minimum weight threshold, so that outlier contact measurements do not immediately lead to degeneracy. Resampling is triggered when the variance of the weights rises above a certain threshold. This is necessary to avoid dispersion of the particle set across the state space, with many particles with low weight. By triggering this process when the variance of the weights increases, the particles can first track the dominant modes of the underlying distribution.

Particle Statistics
The pose estimate for the -th iteration, * , is computed from the weighted mean of all the particle poses. However, as we showed in our previous work, the particle distribution is often multi-modal. This motivated us to selectively update different dimensions of the robot pose. If the variance of the particle positions in the and axes are low (i.e., , , , ), we assume a well defined estimate and update the robot's full pose. However, if they are high, we update only the component of the robot's location, which is always low as the robot keeps contact with the ground.
In practice, with the terrain classification we found the terrain course was sufficiently detailed to keep particle position variance in the -plane low, therefore -only updates were rare. The threshold we used was a standard deviation of 10 cm, which corresponds to the typical uncertainty in our experiments.
To avoid particle degeneracy, importance sampling can be done in areas with higher likelihood. For example, if a grass terrain is detected, some particles could be injected in every grass terrain in the map. Further investigation on the benefits of importance sampling are left for future work.

Terrain Classifier Network
The neural network architecture used for the terrain classification module is depicted in Figure 5 and consists of three consecutive components: convolutional, recurrent, and predictive. Both the convolutional and recurrent components process variable-length data. Thus, for the convolution part to work properly, masking of the padded values must be used to prevent these values from affecting the forward-pass result.
In Bednarek et al. (2019b), we have shown that both convolutional (if fixed-length signals are used) and recursive neural components can be successfully used to solve the terrain classification problem based on signals from force and torque sensors. Moreover, in Bednarek et al. (2019a) we introduced a model that uses both the convolutional and recursive components, achieving the best results in the classification of terrains. The problem of this architecture is the ability to process only fixed-length signals, so for the needs of our application we have adapted this model to processing of signals of variable length, adding the intermediate layer masking module.
The first component consists of two residual layers (ResLay). The ResLay used in our work is an adaptation of the layer used by He et al. (2016) with 2D convolutions replaced by 1D and support masking. The recurrent component uses two bidirectional layers (Bidir) with two GRU (Cho et al. (2014)) cells in each. The output of the recurrent component is an average of two resulting hidden states of the The main blocks of neural architecture are convolutional (conv), recursive (GRU bidir), and fully connected (FC) blocks. The masking mechanism used in variable-length signal processing blocks takes appropriate masks, marked in green, orange, and red. These masks correspond to the individual signal lengths, taking into account the initial length and the reduction of size by convolution with stride equal to 2. last Bidir. The final output of the neural network is produced by the predictive component, which takes the recurrent component's output, and using two fully connected layers (FC) produces a probability distribution from which classification is taken.
The consecutive layers of the model are presented in Figure 5. Each convolution block executes the following operations: batch normalization (Ioffe and Szegedy (2015)), dropout (Srivastava et al. (2014)), and ELU activation function (Clevert et al. (2016)). All are modified to support masking. We used kernel size of 5 in each convolution layer. The output from each ResLay block is two times smaller thanks to the use of stride in convolutions (marked as /2). Dropout is also used in every Bidir and FC (which also use batch normalization).
The model uses a dropout rate of 0.3 and a batch normalization momentum of 0.6. The proposed neural network consists of 1, 374, 920 trainable parameters.

Training
The learning process was carried out using the k-fold crossvalidation methodology, with = 5. The AdamW optimizer from Loshchilov and Hutter (2019) was used to minimize the loss function, with the following parameters: learning rate: 5e-4 with exponential decay, weight decay: 1e-4 with cosine decay.
The learning process lasted for at least 1000 epochs, after which the process continued until the result improved for the last 100 epochs. The size of each mini-batch was 256.

Experimental Results
We extensively evaluated the performance of our algorithm in three different experiments, each one targeting a different type of localization. These are described in more detail in Sections 6.2, 6.3, and 6.4.

Evaluation Protocol
There are two different modalities in our algorithm: HL-G (Haptic Localization Geometric), which uses only geometric information, and HL-GC (Haptic Localization Geometric and Class), which uses both geometric and terrain class information.
HL-G was tested in Experiments 1 and 2 using 2.5D and 3D prior maps, respectively. HL-GC was tested in Experiment 3 with 2.5D maps augmented with terrain class information. We also compared the performance against the HL-G modality in Experiment 3.

Evaluation Metrics
We quantitatively evaluated localization performance by computing the the mean of the Absolute Translation Error (ATE) as described by Sturm et al. (2012): where T andT are the robot's ground truth and estimated poses, respectively. In contrast to Sturm et al. (2012), we do not perform the alignment of trajectories as robot's ground truth and estimated poses are represented in the same coordinate system. A qualitative evaluation was also performed for Experiment 1 and 2 by assessing the ability of the robot to reach its planned goals or end effector targets while using the localization online. This demonstrated the benefit of the localization when used in the loop with the onboard motion planner.  (Bloesch et al. (2018)); HL-G = Haptic Localization with Geometry only.

Ground Truth and Prior Map
The ground truth trajectories were collected by motion capture systems at 100 Hz. Both the robot and the terrain courses were accurately tracked with mm accuracy via reflective markers installed on them. At start of the experiment, the relative position of the robot within the map was measured using ground truth and used for initialization only. Thereafter, the pose of the robot was estimated using the particle filter. To account for initial errors, particles at the start were sampled from a Gaussian centered at the initial robot pose with a covariance of 20 cm.
The prior maps were captured with survey grade laser scanners (Leica BLK-360 and SURPHASER 100HSX) which provide registered point clouds with sub-centimeter accuracy.

Experiment 1: 2.5D Terrain Course
In this experiment, the robot was commanded to navigate between four walking goals at the corners of a rectangle. One of the edges required crossing a 4.2 m terrain course composed of a 12°ascending ramp, a 13 cm high chevron pattern, an asymmetric composition of uneven square blocks and a 12°descending ramp ( Figure 6). After crossing the wooden course, the robot returned to the starting position across a portion of flat ground, which tests how the system behaves in feature-deprived conditions. While blind reactive locomotion has been developed by a number of groups including Di Carlo et al. (2018) and Focchi et al. (2020), unfortunately our blind controller was not sufficiently reliable to cross this terrain course so we resorted to use of the statically stable gait from Fankhauser et al. (2018) which used a depth camera to aid footstep planning. However, the localization was performed without access to any camera information.
To demonstrate repeatability, we performed five trials of this experiment, for a total distance traveled of more than 0.5 km and 1 h13 min duration. A summary of the experi-B W Fig. 6 Experiment 1: ANYmal haptic localization experiments. The robot traverses the terrain, turns 90 deg right and comes back to the initial position passing trough the flat area. The goals given to the planner are marked by the dark red disks, while the planned route is a dashed green line (one goal is out of the camera field of view). The world frame W is fixed to the ground, while the base frame B is rigidly attached to the robot's chassis. The mutual pose between the robot and the terrain course is bootstrapped with the motion capture. ments is presented in Table 1, where HL-G shows an overall improvement between 50 % to 85 % in the Absolute Translation Error (ATE) compared to the onboard state estimator. ATE is 33 cm on average, which reduces to 10 cm when evaluating only the feature-rich portion of the experiments (i.e., the terrain course traversal).
For trials 1 and 2, the robot was manually operated to traverse the terrain course, completing two and four loops, respectively. In trials 3-5, the robot was commanded to follow the rectangular path autonomously. In these trials, the haptic localization algorithm was run online in closed-loop and effectively guided the robot towards the goals ( Figure  7). Using only the prior map and the contact events only, the robot stayed localized in all the runs and successfully tracked the planned goals. This can be seen in Figure 7, where the estimated trajectory (in blue) diverges from ground truth on the -plane when the robot is walking on the flat ground. This is due to growing uncertainty from lack of geometric information, however the covariance reduces significantly and the cluster mean re-aligns with the ground truth when the robot reaches the terrain course. Figure 8 shows in detail the estimator performance for each of the three linear dimensions and yaw. Since position and yaw are unobservable, the drift on these states is unbounded. In particular, the error on the odometry filter (TSIF Bloesch et al. (2018), magenta dashed line) is dominated by upward drift (due to kinematics errors and impact nonlinearities, see third plot) and yaw drift (due to IMU gyro bias, see bottom plot). This drift is estimated and compensated for by the haptic localization (blue solid line), allowing accurate tracking of the ground truth (green solid line) in all dimensions. This can be noted particularly at the four peaks in the -axis plot, where the estimated trajectory and ground truth overlap. These times coincide with the robot is at the top of the terrain course.

Experiment 2: Online Haptic Exploration on Vertical Surfaces
The second experiment involved a haptic wall following task with the robot starting in front of a wall but with an uncertain location. The particles were again initialized with 20 cm position covariance. To test the capability to recover from an initial error, we applied a 10 cm offset in both and from the robot's true position in the map. At start, the robot was commanded to walk 1 m to the right (negative direction) and press a button on the wall. To accomplish the task, the robot needed to "feel its way" by alternating probing motions with its right front foot and walking laterally to localize inside the room. The fixed number of probing motions was pre-scripted so with each step to the right, the robot probed both in front and to its right. The whole experiment was executed blindly with the static controller from Fankhauser et al. (2018).
As shown in Figure 9 the robot was able to correct its localization and complete the task of touching the button. The initial probe to the front reduced uncertainty in the robot's and directions, which reduces the particles to an ellipsoidal elongated along . As the robot moves, uncertainty in the direction increases slightly. By touching the wall on the right, the robot re-localized in all three dimensions in much the same way as a human following a wall in the dark would. The re-localization allows the robot to press the button, demonstrating the generalization of our algorithm to 3D. This would enable a robot to localize by probing a known piece of mining machinery, allowing it to perform maintenance tasks. The final position error was: [7.7, −3.7, −0.2] centimeters in the , and directions.

Experiment 3: Terrain Classification
In the third experiment, we demonstrate the localization using terrain semantic information, which has been tested on a custom designed terrain course. Multiple 1 × 1 m 2 tiles of different terrain materials were placed on a 3.5 × 7 m 2 area. The course includes a 20 cm high platform with two ramps with different terrain materials, as shown in Fig 10. The different terrain types used were gum, carpet, PVC, foam, sand, artificial grass, ceramic and gravel.
For the terrain classifier training, we gathered an additional dataset of the robot walking on the different patches that consists of 8773 steps. The dataset was split into 7018 training and 1775 testing samples. The network was trained following the steps described in section 5.2.1. We obtained a classification accuracy of 94.12 %. The mean and standard The top row shows the robot performing the experiment while the bottom row shows the particle distribution and localization estimate. The particle set is colorized by normalized weight according to the jet colormap (i.e., dark blue = lowest weight, dark red = highest weight). First, on the bottom left, the robot has an initial distribution with poses equally weighed. The robot makes a forward probe and then moves to the right. Now the particles are distributed as an ellipse with high uncertainty to the left and right of the robot. Then, the robot makes a probe to the right and touches an obstacle; the particle cloud collapses into a tight cluster. Since the robot is now localized, it is able to complete the task of pressing the button on the wall. deviation of the accuracy was estimated from k-fold crossvalidation to be 94.03 % and 0.09, respectively. The robot was equipped with sensorized feet from Valsecchi et al. (2020) and autonomously walked between preprogrammed waypoints placed over the entire course, including several passes over the ramp. Large sections of the trajectory were only on the flat terrain tiles, forcing the algo-rithm to rely mostly on terrain classification for localization. Unlike Experiment 1, the robot was able to walk completely blind and no exteroception was used for footstep planning. A statically stable gait was used such that one foot was in the air at a time.
To demonstrate repeatability, we have performed three trials of this type, for a total distance traveled of more than 0.7 km and 1 h7 min duration. We compare results produced using HL-G (Geometry) and HL-GC (Geometry and Terrain Classification). As the majority of the terrain course is flat, there is not enough information for geometry only localization to be continuously accurate. Only when using terrain class information as well the robot can localize in all parts of the terrain course.
A summary of the experiments is presented in Table 2, where HL-GC shows an overall improvement between 14 % to 56 % in the Absolute Translation Error (ATE) compared to HL-G. Using only the prior knowledge of the terrain geometry and class, the robot stayed localized in all the runs and bounded the linearly growing drift of the state estimator. This can be seen in Figure 11, where the estimated trajectory (in red) is able to stay near the ground truth trajectory (green). In areas where there are large patches of the same material, such as the gravel (dark blue) and ceramic (yellow), there is not enough information to localize in the -plane and the localization drifts. When the robot crosses the boundary into a new terrain type the localization is able to correct. Figure 12 shows in detail the estimator performance for each of the three linear dimensions and yaw. As in Experiment 1, the error on the odometry filter (TSIF, magenta dashed line) of the robot is dominated by upward and yaw drift. This drift is estimated and compensated for by the haptic localization (red solid line), allowing accurate tracking of the ground truth (green solid line) in all dimensions .

Discussion
The results presented in Sections 6.2 and 6.3 demonstrate that terrain with a moderate degree of geometric complexity (such as Figure 6) already provides enough information to bound the uncertainty of the robot's location. The effectiveness of a purely geometric approach is obviously limited by the actual terrain morphology in a real world situation, which would need to contain enough features such that all the DoF of the robot are constrained once the robot has touched them.
In the case where there is not enough geometric information, we have shown in Section 6.4 that terrain semantics can be used to localize. With sufficiently diverse terrain types (as shown in Figure 10), boundary crossings from one terrain A video showing all of these experiments is attached as supplementary material A: At a boundary crossing, the particle mean diverges from the ground truth. However, as the particle cloud nears the ramp, the geometric information gives higher likelihood to the particles in the center of the terrain. B: Another boundary crossing which in two separated crossings triggers good localization updates.
to another provide enough information to correct for drift in the -plane. Figure 13 shows the evolution of the particles up to the first half of the terrain course for Experiment 1, Trial 2. As the robot walks through, the particle cluster becomes concentrated, indicating good convergence to the most likely robot pose.

Analysis of Particle Distribution on Geometric Terrain
In the third subfigure, it can be noted how the probability distribution over the robot's pose follows a bimodal distribution, which is visible as two distinct clusters of particles. This situation justifies the use of particle filters, as they are able to model non-Gaussian distributions which can arise from a particular terrain morphology. In this case, the bimodal distribution is caused by the two identical gaps in between the chevrons. In such situations, a weighted average of the particle cluster would lead to a poor approximation of the true pose distribution. Therefore, the particle evolution illustrated in Section 5.1 is crucial to reject such an update. Figure 14 shows data from Experiment 3, Trial 1. We compare results from HL-G, HL-GC and HL-C (Haptic Localization using class data only). We can see that even with only class information, this method is able to keep localization error bounded in the -plane (mean ATE for the class only trajectory was 0.63 m). In the third subplot from the top, the class only trajectory drifts upward in a similar way to TSIF in Figure 12. This is because of the absence of any measurement in the , hence our method relies on the proprioceptive state estimator.

Analysis of Particle Distribution on Terrain Class
Further analysis of the effect of terrain class on localization is shown in Figure 13. Here, we show the evolution of particles from HL-GC in Experiment 3, Trial 3. The particles, which initially are normally distributed around the starting position, quickly converge along the axis as the floor elevation information is used. Once the boundary transition from green to red occurs, the particles correct for drift in the direction.

Conclusion
We have presented a haptic localization algorithm for quadrupedal robots based on Sequential Monte Carlo methods. Fig. 13 Evolution of two particle sets during experiments. The particle set is colorized by normalized weight according to the jet colormap (i.e., dark blue = lowest weight, dark red = highest weight). The green line indicates the ground truth trajectory. Top (Experiment 1): A) At start, all the particles have the same weight and are normally distributed at the starting position. B) After a few steps on the ramp, the robot pose is well estimated on and directions, but there is uncertainty on . C) When the robot approaches the chevron the particle set divides into two clusters, indicating two strong hypotheses as to the robot pose. D) After a few more steps on the chevron, the robot is fully localized and the particles are tightly clustered. Bottom (Experiment 3): As the robot walks from left to right, the particle cloud makes two terrain class transitions. As the robot crosses the first transition, the cloud becomes more narrow in the direction as error along this axis is corrected.
The algorithm can fuse geometric information (in 2.5D or 3D) as well as terrain semantics to localize against a prior map. We have demonstrated that even using only geometric information, walking over a non-degenerate terrain course containing slopes and interested geometry can reduce localization error to 10 cm. Our method also works if the robot probes vertical surfaces, measuring its environment in full 3D. Finally, we have shown how in areas of even sparser geometric information, terrain semantics can be used to augment this geometry.
The proposed approach demonstrated an average of 20 cm position error over all areas of a terrain course with different terrain classes and geometries. The ability to localize purely proprioceptively is valuable for repetitive autonomous tasks in vision-denied conditions, such as inspections of sewage systems. This method could also serve as a backup localization system in case of sensor failure -enabling a robot to complete its task and return to base. Here we show the difference in results when terrain class information and elevation are selectively removed. The median ATE for using only class data was 0.63 m and for using only elevation data was 0.23 m and for using both was 0.14 m.