1 Introduction

Identifying traversable paths and, accordingly, taking proper control actions is a fundamental requirement for a mobile robot to safely and autonomously navigate in non-urban environments. This capability is needed in several robotic applications where autonomous vehicle navigation plays a crucial role, such as search and rescue (Delmerico et al. 2019), precision farming (Yandun Narváez et al. 2018) and planetary exploration (Hewitt et al. 2017). Several methods have been proposed for autonomous navigation of ground vehicles in unstructured environments. Among them, terrain traversability analysis is a widely adopted approach, which has recently gained momentum thanks to the development of learning-based methods (Guastella and Muscato 2021; Borges et al. 2022).

Some works address terrain traversability analysis as a traversal cost regression problem, using inverse reinforcement learning (Pflueger et al. 2019; Zhu et al. 2019 to derive a map of costs for a subsequent path planning phase. Other approaches identify the terrain type (Rothrock et al. 2016; Gonzalez and Iagnemma 2018) or pose the problem as a binary classification task (i.e. “traversable” or “non-traversable”) (Chavez-Garcia et al. 2018; Holder and Breckon 2018) or as a per-region regression task (Palazzo et al. 2020).

One of the main limitations of existing learning-based approaches tackling traversability estimation is the need for annotated training data. Regardless of the specific formulation of traversability, it is necessary to collect a large set of images, generally requiring a robot to be operated on the target environment (or a similar one). Then, a human operator needs to analyze each individual frame and mark traversable areas. A second limitation of existing approaches is the lack of strategies to use traversability predictions for navigation. Although existing methods show promising results in terms of prediction accuracy, it is not obvious how to use their outputs to drive a ground vehicle.

In this work, we propose a method to tackle both limitations, by (1) training a model on a combination of annotated synthetic data, used for supervised training, and unannotated real data, used for unsupervised domain adaptation; (2) developing a simple, yet effective navigation strategy using the estimated traversability.

In particular, a synthetic dataset is first generated using the MIDGARD (Vecchio et al. 2022) simulation environment. Training is then carried out by combining self-supervision (Caron et al. 2020) and domain adaptation (Ganin and Lempitsky 2015) techniques to ensure that the model, trained in a supervised way on synthetic images and annotations, behaves correctly when processing real-world data.

Furthermore, we propose a navigation algorithm that employs the egocentric outcome of the traversability prediction by defining a control law to directly steer the vehicle towards the safest area. On-field and simulated tests show that the proposed approach is capable of autonomously exploring unknown environments, while avoiding obstacles and harsh terrains.

To summarize, the contributions of our paper are the following:

  • We create a photorealistic synthetic dataset for traversability estimation in non-urban environments, with computer-generated annotations for supervised training of learning-based models.

  • We propose a deep learning method for traversability estimation from RGB images, employing supervised training on synthetic data, self-supervised pre-training of the traversability predictor, and unsupervised adaptation on real data, thus relieving the user from the burden of manual annotation. Performance analysis shows that the proposed model outperforms existing approaches when real annotations are used.

  • We perform both simulated and on-field validation of the proposed method and define an autonomous navigation approach based on the traversability outcome.

2 Related work

2.1 Terrain traversability analysis

Traversability anticipation has been proposed as a long-range prediction problem based on visual data since early works (Howard et al. 2006; Hadsell et al. 2008), as an alternative to the limited perception range of LIDARs or stereo cameras. More recent deep-learning models perform image segmentation for terrain classification, either as a binary (i.e., traversable or not) Holder and Breckon 2018 or as a multi-class classification problem (Rothrock et al. 2016; Valada et al. 2016; Maturana et al. 2018). However, the literature does not offer an established approach for translating the traversability outcome into driving commands for the vehicle. The most common solution is to project scene classification output from the image plane to a polar top-view map of the vehicle’s surroundings (Howard et al. 2006; Hadsell et al. 2008; Maturana et al. 2018), thus falling back into a path planning problem. Other works leveraging inverse reinforcement learning aim at directly inferring traversability costmaps and planning trajectories from the demonstrations by an expert and the environment point cloud (Zhang et al. 2018; Zhu et al. 2020), but rely heavily on 3D LIDARs rather than passive RGB cameras.

Besides terrain traversability analysis, end-to-end methods have recently been proposed for off-road navigation, where a direct mapping from exteroceptive data (typically a combination of geometric and visual data) and driving commands is performed (Pan et al. 2020; Nguyen et al. 2020). Successful examples have also been reported on the navigation of multi-rotors for outdoor unstructured environments, such as forests (Smolyanskiy et al. 2017; Giusti et al. 2016). However, end-to-end methods tend to make it more difficult to grasp the relationship between the perceived environment and the chosen driving action, due to the lack of interpretable intermediate representations.

Inspired by Palazzo et al. (2020), our approach provides an immediate interpretation of a scene, without the need to construct a top-view costmap, that can be directly used to drive the vehicle towards traversable regions within the camera field of view. The proposed approach mainly differs from Palazzo et al. (2020) in the usage of computer-generated annotations on a synthetic dataset, bridging the gap between synthetic and real images by means of a self-supervision approach (specifically adapted to the problem at hand; see Sect. 3.5) and unsupervised domain adaptation (Sect. 3.7). Additionally, we improve the model architecture by pre-training and freezing the feature extraction backbone network, significantly speeding up training, while increasing the complexity of the traversability estimation network to compensate for the extraction of more generic backbone features (Sect. 3.4). Finally, we replace the “greedy” navigation algorithm in Palazzo et al. (2020) with a more robust approach (Sect. 4), and provide an extensive analysis of the performance of our approach in a real-world scenario (Sect. 6).

2.2 Synthetic data collection in simulated environments

Training on synthetic data has proven to be a suitable alternative to training on real-world data for autonomous navigation. Recent work has focused on the creation of large-scale, high-resolution synthetic datasets (Haltakov et al. 2013; Richter et al. 2016; Gaidon et al. 2016; Richter et al. 2017; Müller et al. 2021), as well as the development of embodied simulators for training (Vecchio et al. 2022; Skinner et al. 2016; Kolve et al. 2017; Dosovitskiy et al. 2017; Shah et al. 2018; Xia et al. 2018; Savva et al. 2019; Song et al. 2020; Kadian et al. 2020). Several of these simulation platforms have been used to support dataset generation for navigation tasks, both in structured (Dosovitskiy et al. 2017; Savva et al. 2019) and unstructured (Shah et al. 2018; Song et al. 2020) environments:

  • AirSim (Shah et al. 2018) is an open source simulator which supports software- and hardware-in-the-loop simulation, providing both an interactive mode for live agents’ training and a data collection mode.

  • OAISYS (Müller et al. 2021) is a photorealistic terrain simulation pipeline for unstructured outdoor environments, built on top of Blender.Footnote 1 It is designed for the collection of synthetic datasets and is capable of generating large varieties of scenes with automatic annotations in terms of instance segmentation, semantic segmentation, and depth.

  • MIDGARD (Vecchio et al. 2022) features an interactive mode as well as a data collection mode for creating automatically-annotated datasets. It provides a wide variety of sensors including depth, semantic and instance segmentation, and a traversability-annotator tool.

In this work, we use MIDGARD to automatically collect the synthetic training dataset, as described in Sect. 5.2.

Fig. 1
figure 1

The proposed framework. Real and synthetic images are fed to a pre-trained feature extractor. Visual embeddings are then fed to a traversability estimation module. The traversability estimator is first pre-trained through self-supervision on both real and synthetic images, and then trained in a supervised way using only annotations for the synthetic data. Features from both domains are fed to a domain classification layer, trained to distinguish between real and synthetic features: gradients estimated through backpropagation are altered by a gradient reversal layer (GRL) (Ganin and Lempitsky 2015) to maximize classification loss and match feature distributions between real and synthetic images

2.3 Domain adaptation

It is recognized that the availability of annotated data is often limited by the required efforts for their collection. Moreover, the performance of deep learning models typically degrades when applied to a data distribution that does not match the training one (either real or synthetic), which represents a further difficulty in the usage of pre-trained models on a custom task. Several transfer learning and domain adaptation methods have therefore been proposed in the literature (Wang and Deng 2018), which aim at dealing with the distribution shift between different data domains. In this work, we focus on unsupervised domain adaptation: we assume that we have an annotated source dataset, on which a model can be trained supervisedly, and an unannotated target dataset, on which we intend to ultimately employ the trained model. A straightforward approach to unsupervised domain adaptation treats it as a classification task on the target domain, using pseudo-labels estimated for target samples by a model trained on the source domain (Yan et al. 2017; Saito et al. 2017; Zhang et al. 2015; Long et al. 2016). The success of these approaches usually depends on the similarity between the source and target distributions, and thus on how accurate pseudo-labels are. Other approaches aim, instead, at minimizing the difference between feature statistics on the two domains: many of these methods are based on introducing a maximum mean discrepancy loss term to the training objective (Borgwardt et al. 2006; Ghifary et al. 2014; Long et al. 2015; Zellinger et al. 2017). A similar objective can be pursued by leveraging generative adversarial networks (GANs) (Goodfellow et al. 2014; Liu and Tuzel 2016; Yoo et al. 2016; Shrivastava et al. 2017; Bousmalis et al. 2017). Inspired by the adversarial competition of GANs, the gradient reversal layer (GRL) method (Ganin and Lempitsky 2015) learns an intermediate representation that is designed to maximize domain classification error, so that similar features are extracted from the two domains. In this work, we opt for this approach due to its simplicity and its recent success in complex domain adaptation scenarios (Palazzo et al. 2020; Bellitto et al. 2020).

2.4 Self-supervised learning

Model self-supervision aims at learning features from unannotated data, in the attempt to alleviate the need for human-crafted ground truth and reduce the performance gap with supervised networks pre-training (Chen et al. 2020; He et al. 2020; Misra and Maaten 2020. Some approaches formulate self-supervision as an instance discrimination task (Dosovitskiy et al. 2015), where each image in the dataset is considered as a single class. Other methods pose the problem by defining a contrastive loss (Hadsell et al. 2006) that attempts to extract similar features from an image and its transformations, while pushing away features from different images. This removes the notion of instance classes by directly comparing image features, enforcing invariance across features of corresponding transformations. Since comparing all possible image pairs in a large dataset is computationally challenging, many works approximate the loss by reducing the number of comparisons to random subsets of images during training (Chen et al. 2020; He et al. 2020; Wu et al. 2018). Among these, a recent approach Caron et al. (2020) proposes to learn features by “swapping assignments between multiple views” (SwAV) of the same image. In this work, we adapt SwAV to make it suitable to our traversability task, in order to initialize the model with features that apply to both real and synthetic data.

3 Traversability prediction

3.1 Overview

An overview of the proposed architecture for traversability prediction is shown in Fig. 1. A feature extraction backbone is used to compute compact representations of the input images. Visual features are then processed by two model branches: one estimating the traversability, initially pre-trained with self-supervision on both real and synthetic images, and then fine-tuned in a supervised way on synthetic images only; the other is, instead, trained to distinguish between real and synthetic images, thus supporting unsupervised domain adaptation to unannotated real images.

3.2 Problem formulation

Following the definition in Palazzo et al. (2020), we pose the traversability estimation as a vector regression problem, where each component of the target vector indicates the traversability score of a corresponding region in the input image. More in detail, we divide the input RGB images into a set of vertical bands and regress an array of traversability scores related to the traversable horizon within each band. Although this formulation may seem to oversimplify the traversability estimation problem, it properly drives the vehicle to avoid potentially dangerous terrain areas, since there is no need to know any further information on the environment beyond the traversable horizon within each band. The navigation capabilities of the robotic platform are implicitly taken into account during the annotation process (e.g., the maximum traversable surface steepness). Automatic annotation for synthetic data is described in Sect. 5.2.

We tackle the problem as an unsupervised domain adaptation task, where we enforce the model, trained in a supervised way on a source synthetic dataset, to generalize to an unannotated target dataset of real images.

More formally, given a tensor \(\textbf{I} \in \mathcal {I}\) of size \(C \times H \times W\) representing an RGB image, this is divided into a set of k vertical bands: \(\left\{ \textbf{I}_1, \textbf{I}_2, \dots , \textbf{I}_k \right\} \), with:

$$\begin{aligned} \textbf{I}_i = \textbf{I}_{\left[ 0:C-1, 0:H-1,\frac{iW}{k}:\frac{(i+1)W}{k}-1 \right] }, \end{aligned}$$
(1)

where \(\textbf{I}_{\left[ \cdot \right] }\) denotes subtensor indexing and the  :  operator selects a range of coordinates along the corresponding dimension. Each resulting portion has size \(C \times H \times \frac{W}{k}\): if W is not divisible by k, the width of each vertical area is suitably rounded. The number of vertical bands k is chosen to be an odd number, as the band position within the image frame determines a potential direction to drive the vehicle (see Sect. 6). Indeed, when k is odd, the presence of a central band avoids an undesired oscillatory behavior during navigation when the platform is supposed to go straight. Figure 2 shows an example of the resulting division for a given input image. Traversability scores for each region are encoded by a vector \(\textbf{t} \in \mathbb {R}^k\) for each of the image subregions, ranging from 0 (not traversable) to 1 (fully traversable).

Fig. 2
figure 2

Example of image division with \(k=9\) vertical bands; for each band, a traversability score indicates the position (in terms of image height) of the closest non-traversable elements within the band. Image from Palazzo et al. (2020)

Given an input image, our objective is to estimate a set of traversability scores \(\tilde{\textbf{t}}\) that approximate target values \(\textbf{t}\), by means of a deep model implementing a function \(f: \mathcal {I} \rightarrow [0,1]^k\). Given an input image \(\textbf{I}\), f is trained to estimate \(\tilde{\textbf{t}} = f(\textbf{I})\) that is as close as possible to the actual \({\textbf {t}}\).

Unlike the approach introduced in Palazzo et al. (2020), we do not aim to learn f in a supervised way on a real dataset, i.e., by training the model with the correct manually-annotated traversability scores. Instead, we assume that data used for training our model can be split into a source domain \(\mathcal {D}_s\) and a target domain \(\mathcal {D}_t\). The source domain \(\mathcal {D}_s\) includes pairs \((\textbf{I}, \textbf{t})\) of synthetic images with computer-generated traversability scores. The target domain \(\mathcal {D}_t\), instead, includes real images for which no manual annotations are available during training.

3.3 Feature extraction backbone

A DeepLabV2 Chen et al. (2016) backbone, based on ResNet-101 He et al. (2016) and pre-trained on COCO Stuff 10k Caesar et al. (2016), is employed to extract visual features from an input RGB image. During training, we freeze (i.e., do not fine-tune) the parameters of the backbone. This choice aims at reducing overfitting on relatively small datasets, due to the complexity of the backbone. As an additional benefit, pre-computing image features speeds up the training process. Our experiments (Sect. 5) show that freezing the feature extraction backbone indeed improves performance compared to fine-tuning the entire architecture.

3.4 Traversability estimation

The architecture of our traversability estimation network is designed to process the features extracted from the DeepLabV2 backbone and regress a traversability score for each image portion. In detail, it receives input features from the DeepLabV2 backbone and processes them through a cascade of convolutional layers, aimed at gradually increasing the number of features while reducing spatial dimensions; a final fully-connected layer with k-dimensional output predicts traversability scores. Architectural details of the layers in the traversability estimation network are presented in Table 1. Each convolutional layer is followed by batch normalization and ReLU activation.

Table 1 Architectural details of the traversability estimation network in the proposed model

3.5 Self-supervised initialization

Inspired by Palazzo et al. (2020), we employ domain adaptation (described in the next section) to simultaneously train our model to perform traversability prediction on synthetic images and to adapt itself to perform the same task on real images: the objective is to encourage the model to learn features that work equally well on both the synthetic source domain and the real-world target domain. However, unlike (Palazzo et al. 2020), we do not fine-tune the DeepLabV2 backbone during training; hence, the representation it extracts does not adapt to the specific characteristics of the target domain. To mitigate this issue, we perform a self-supervised initialization step, where we pre-train our model on both real and synthetic datasets, without using traversability annotations, in order to learn features applicable to both domains before the supervised training phase.

The self-supervision approach we employ in this work is SwAV (Caron et al. 2020). SwAV is a clustering-based method with the objective of learning a representation of input data such that it is possible to predict the cluster assignment of one view of an image from the representation of another view, thus enforcing consistency between features extracted from variants of the same image.

Formally, given two image representations \(\textbf{z}_s\) and \(\textbf{z}_t\), computed from different views of the same image, a model is trained to compute their “codes” (i.e., cluster assignments) \(\textbf{q}_s\) and \(\textbf{q}_t\) by matching them to a set of learnable cluster prototypes \(\left\{ \textbf{c}_1,..., \textbf{c}_K \right\} \).

The self-supervision objective is a “swapped” cluster assignment prediction, where the model is trained to predict \(\textbf{q}_s\) from \(\textbf{z}_t\) and \(\textbf{q}_t\) from \(\textbf{z}_s\), by minimizing the cross-entropy between the target code and the probability distribution obtained by projecting features on the set of prototype vectors:

$$\begin{aligned} \mathcal {L}\left( \textbf{z}_t , \textbf{z}_s \right) = -\sum _k \textbf{q}_s^{(k)} \log \textbf{p}_t^{(k)} -\sum _k \textbf{q}_t^{(k)} \log \textbf{p}_s^{(k)}, \end{aligned}$$
(2)

where \(\textbf{p}_t^{(k)}\) (and, similarly, \(\textbf{p}_s^{(k)}\)) is computed as:

$$\begin{aligned} \textbf{p}_t^{(k)} = \frac{\exp {\frac{1}{\tau }\textbf{z}_t^\top \textbf{c}_k}}{\sum _{k'}\exp {\frac{1}{\tau }\textbf{z}_t^\top \textbf{c}_{k'}}}, \end{aligned}$$
(3)

with \(\tau \) being a temperature parameter to “flatten” the softmax distribution. Intuitively, \(\textbf{p}_t^{(k)}\) estimates the probability that feature \(\textbf{z}_t\) is associated to cluster k: the objective of training is to ensure that such a probability is maximal for the cluster corresponding to the code of the other view of the image, \(\textbf{q}_s\). As a result, the model learns to project both views to similar representations, hence learning to extract reusable features that take into account visual context and semantics.

In this work, we introduce a variant of SwAV, which differs from the original formulation in two main aspects.

First, SwAV is specifically designed for image classification, and it is based on the assumption that all patches within an image share the same cluster assignment, since they all refer to a single depicted object. In our case, this hypothesis is counter-productive, as different patches within the same image may exhibit significantly different traversability properties. For this reason, we propose a revised consistency assumption, shifting from the original “same image, same cluster” hypothesis to an assumption that places emphasis on local feature similarity rather than global image homogeneity, motivated by the observation that horizontally contiguous patches are likely to exhibit similar visual characteristics (see Fig. 2). By focusing on local continuity, our approach encourages the extraction of features that are coherent within smaller spatial regions, thus enhancing the network’s sensitivity to subtle variations in terrain. It is important to clarify that our assumption regarding horizontally contiguous patches is not absolute. While we posit that neighboring patches often share similar traversability properties, there will be instances where this local similarity does not hold; however, by leveraging this assumption as a general rule rather than a strict law, our method effectively encourages the learning of discriminative features. In practice, this principle serves as a heuristic guide for the network to learn meaningful representations that are beneficial for distinguishing traversability in complex outdoor environments, without being overly constrained by the occasional exceptions to the rule.

As a second difference from the original formulation, we apply SwAV’s clustering-based procedure to backbone features rather than image patches. This is motivated by the observation that the pre-trained DeepLabV2 is already able to correctly classify pixels from both real and synthetic input images (see Fig. 3), thanks to the photorealism of our synthetic data. This finding also supports our decision of freezing the DeeplabV2 feature extractor. However, a visualization of DeeplabV2 features with t-SNE (Maaten 2014), in Fig. 4, shows that features extracted from synthetic and real features are markedly clustered in different regions of the projected space, emphasizing a distribution shift between the two sets of data: this may cause issues to the traversability estimation model, which is trained from scratch and might overfit the training distribution, negatively affecting its generalization to the other.

Given these premises, we apply SwAV using features extracted by the backbone as input and enforcing local similarity between features computed by the last convolutional layer of the traversability estimation model.

Formally, given a \(F\times H\times W\) feature map, where F denotes the number of features and H and W the spatial dimensions, at each iteration of self-supervised training we extract a small \(P \times Q\) region, where \(P < Q\) to make the shape of the region approximately horizontal. Then, we sample two randomly-sized patches within that region, and apply the SwAV cluster-assignment procedure to features from each patch.

Fig. 3
figure 3

Distributions of DeeplabV2 features on real and synthetic images, projected through t-SNE. The significant distribution shifts justifies the employment of a self-supervision approach on the features themselves, rather than on input images

3.6 Supervised training on source domain

The traversability estimation branch of the model is trained in a supervised way on synthetic data. Since we formulated our estimation problem as a regression task, we optimize model parameters by minimizing the mean square error (MSE) loss between the estimated traversability vector \(\tilde{\textbf{t}}\) and the correct \(\textbf{t}\) for a given image:

$$\begin{aligned} \mathcal {L} = \sum _{(\textbf{I}, \textbf{t})} \sum _{j=1}^k \left( t_{j} - \tilde{t}_j \right) ^2 , \end{aligned}$$
(4)

where \((\textbf{I}, \textbf{t})\) pair consists of an input image and the corresponding target traversability values, and \(\tilde{\textbf{t}} = f(\textbf{I})\).

We additionally extend the objective function by including the safety-preserving loss term introduced in Palazzo et al. (2020). Given \(t_i\) and \(\tilde{t}_i\) to be, respectively, the ground-truth traversability value for a certain image subregion and the value predicted by the model, it has been observed that the sign of \(\left( t_i - \tilde{t}_i \right) \) is significant from a safety perspective: if \(t_i < \tilde{t}_i\), the model has estimated a certain path to be more traversable than it actually is, which is something that we want to discourage. Indeed, it is preferable to sacrifice accuracy and provide more conservative predictions, than making overly optimistic decisions that may lead to collisions or overturnings.

The total loss, including \(L_2\) regularization, thus becomes:

$$\begin{aligned} \mathcal {L}_s = \sum _{(\textbf{I}, \textbf{t})} \sum _{j=1}^k \left[ \left( \tilde{t}_j - t_{j} \right) ^2 + \alpha \max \left( 0, \tilde{t}_j - t_j \right) ^2 \right] + \rho \left\Vert \varvec{\theta } \right\Vert _2^2 \end{aligned}$$
(5)

where \(\alpha \) weighs the importance between prediction accuracy and conservativeness, while \(\rho \) controls the strength of regularization on model parameters \(\varvec{\theta }\).

Fig. 4
figure 4

Segmentation outputs of the employed DeepLabV2 backbone on synthetic and real image samples, demonstrating the high realism of synthetic images and the accuracy of the segmentation model

3.7 Unsupervised domain adaptation

Self-supervision on synthetic and real data helps to learn initial features for both annotated and unannotated data; however, this does not guarantee that feature activations for inputs from the two domains correspond. Hence, a model trained on synthetic images may not generalize on real images due to the persisting distribution shift.

For this reason, it is necessary to push the feature distributions together, so that the model behaves in a similar way regardless of the domain of the input images. As in Palazzo et al. (2020), we employ gradient reversal layers (GRL) (Ganin and Lempitsky 2015) to accomplish this.

Thus, alongside the traversability estimation branch, trained in a supervised way on the synthetic source domain, we introduce a separate domain classification branch. As shown in Fig. 1, this classifier receives intermediate features computed from both the source (synthetic) and target (real) domains and aims at discriminating whether an input image comes from one or the other. The key idea of the approach consists in pushing the model to learn intermediate features that prevent the domain classification branch from succeeding at its task. Intuitively, if the source and target feature distributions cannot be distinguished, the traversability estimation branch should work equally well on either domain. This mechanism is implemented by introducing a gradient reversal layer which changes the signs of gradients (appropriately scaled by a \(\lambda \) hyperparameter) of the domain classification loss. As a result, during training the domain classifier attempts to correctly distinguish the two domains, while features extracted before the GRL are driven to make such classification fail, thus becoming domain-agnostic.

More formally, we can define the training set for the domain classifier as:

$$\begin{aligned} \mathcal {D}_d = \left\{ \left( \textbf{I}, l_s \right) \right\} _{\left( \textbf{I}, \textbf{t} \right) \in \mathcal {D}_s} \cup \left\{ \left( \textbf{I}, l_t \right) \right\} _{\textbf{I} \in \mathcal {D}_t} , \end{aligned}$$
(6)

where \(l_s\) and \(l_t\) are employed as domain labels — in practice, they are assigned the values 0 and 1.

Let \(h(\textbf{I})\) be the intermediate features extracted for input image \(\textbf{I}\), and \(g(h(\textbf{I}))\) the output of the domain classifier, which can be interpreted as the likelihood of the input belonging to one of the two domains. The domain classifier is trained with standard binary cross-entropy loss:

$$\begin{aligned} \mathcal {L}_d = - \sum _{\left( \textbf{I}, l \right) \in \mathcal {D}_d} \Big [ l \log ( g(h(\textbf{I}))) + ( 1-l ) \log ( 1 - g(h(\textbf{I})) ) \Big ] , \end{aligned}$$
(7)

with \(l \in \{0,1\}\) being the domain label associated to input \(\textbf{I}\).

In our model, intermediate features are extracted at the output of the last residual layer (see Table 1) and are spatially reduced from \(1024\times 8\times 8\) to \(1024\times 2\times 2\) through adaptive max pooling. GRL is inserted at this point, and is followed by the domain classifier, i.e., a multi-layer perceptron with ReLU activations and hidden layers of sizes 1024 and 256. The output of the domain classifier is a scalar value, constrained between 0 and 1 by a sigmoid activation.

4 Navigation control

In order to properly assess the effectiveness of the proposed traversability prediction approach in a navigation setting, we hereby propose a strategy to translate the predicted traversability into control commands to the vehicle.

Intuitively, the most straightforward way to identify vehicle direction is to select the band with the highest predicted traversability score. However, if the selected band is next to low-score ones, the vehicle may bump into a close obstacle or move in between two poorly traversable areas. To address this limitation, we design a simple, yet effective, selection strategy, reported in Algorithm 1: given the vector \(\tilde{\textbf{t}} = \left\{ t_1, \dots , t_k\right\} \) of predicted traversability scores, the vehicle is directed towards band \(i_{opt}\) such that \(t_{i_{opt}}\) is the highest score which satisfies the following boolean expression:

$$\begin{aligned} t_{i_{opt}} - t_{i_{opt}-1}> \delta \wedge t_{i_{opt}} - t_{i_{opt}+1} > \delta , \end{aligned}$$
(8)

where \(\delta \) is a configurable parameter. This rule ensures that the chosen direction is traversable not only in the selected band, but also in the adjacent ones (up to a difference by \(\delta \)), thus leaving room for trajectory adjustments. In our experiments, setting \(\delta =0.2\) provided a fair trade-off between a too optimistic and a too conservative band selection.

Algorithm 1
figure g

Band selection algorithm

Inspired by Loquercio et al. (2018), the band position \(i_{opt}\) within the image frame provides the direction of the vehicle, whereas the traversability score \(t_{opt}\) modulates the linear forward velocity: the lower the traversability score, the slower the vehicle has to move, and vice versa. Note that the boundary bands (i.e., the leftmost and the rightmost ones) are purposefully excluded in the band selection as they would lead to an inherently unsafe choice, since we lack traversability estimates outside of the camera field of view.

If it is not possible to select a proper band or if the identified traversability score \(t_{opt}\) is lower than a critical score \(t_{crit}=0.15\), we enable a recovery mode: the vehicle starts to slowly rotate in place at 0.1 rad/s in order to find alternative viable paths.

According to the above design principles, the linear velocity v and the angular velocity \(\omega \) of the vehicle are computed through Eqs. 9 and 10.

$$\begin{aligned} v&= \alpha (t_{opt} - t_{crit}) \end{aligned}$$
(9)
$$\begin{aligned} \omega&= \beta (\lceil k/2 \rceil - i_{opt}) \end{aligned}$$
(10)

The linear velocity is proportional to the difference between the critical score and the chosen band score. The \(\alpha \) coefficient is set according to the maximum velocity of the robot. The angular velocity is, instead, proportional to the position of the band with respect to the middle band. The \(\beta \) term is set according to the maximum angular velocity of the platform. In our tests we set \(\alpha \) and \(\beta \), respectively, to 0.6 and 0.1, thus obtaining \(v\in [0, 0.51]\) m/s and \(\omega \in \{\pm 0.3,\pm 0.2,\pm 0.1,0\}\) rad/s (since \(k=9\) in our case).

5 Traversability results

In this section, we first introduce the datasets employed in our work: the simulated environment in which traversability annotations are automatically generated, and the real dataset introduced in Palazzo et al. (2020) that we employ for unsupervised domain adaptation.

Then, we evaluate the accuracy of our traversability prediction approach on two different training setups. We first assess model performance with standard supervised training on annotated real images. This analysis allows us to establish an upper bound on the expected accuracy of the proposed approach. Then, we evaluate the quality of traversability predictions when training our model in a supervised way on synthetic images and unsupervisedly on real images. A thorough experimental protocol is followed to evaluate the impact of each component of the proposed architecture.

5.1 Real dataset acquisition

We hereby introduce the acquisition protocol of the real dataset employed in our experiments. Details can be found in Palazzo et al. (2020).

Video sequences are recorded by teleoperating an unmanned rubber-tracked ground robot employed for navigation in rough outdoor environments. The robot is equipped with a ZED stereo camera by Stereolabs acquiring \(1280\times 720\) RGB images of the terrain in front of the vehicle at 15 fps. Only images from the right stream of the stereo system are employed. The original data acquisition provides video sequences for on-road and off-road (terrain) scenarios. In this work, we focus on the off-road scenario, containing 419 selected images, since our simulated environment targets non-urban scenes.

Data annotation on this dataset was performed by a human operator. However, the approach presented in this paper does not employ manual annotations for training: we only use them to carry out performance analysis and to run supervised experiments on real images.

Fig. 5
figure 5

Left: An overview of the map used to collect the synthetic dataset. The map features a grass-covered meadow, some rocky areas and some denser forest regions. It also includes a small lake and river. Right: Some sample assets used to populate the synthetic scene. The top three rows contain medium to large assets used as obstacles, while the last row contains ground assets (grass and leaves) and a terrain texture sample

5.2 Synthetic data acquisition

The synthetic dataset is generated using the MIDGARD simulator (Vecchio et al. 2022), which produces photorealistic images and provides the tools for automatic data annotation. The dataset was collected in a custom version of MIDGARD’s native meadow scene, consisting of a large map (\(3,600~\hbox {m}^{2}\)) replicating the features of the real-world dataset. The synthetic dataset was collected in several hand-guided navigation sessions: in each session we introduced some variability in terrain deformation and vegetation distribution. An overview of the entire map and sample elements (obstacles and terrain types) included in the scene are provided in Fig. 5.

We acquire a total of 2271 RGB frames, each with engine-generated traversability annotation, by simulating rays propagated from the robot and detecting collisions to objects in the scene, as exemplified in Figs. 6 and 7.

In detail, the annotation process is carried out in three stages:

  1. 1.

    A set of non-traversable object types is defined (e.g., rocks, trees, branches) and automatically marked with a NonTraversable flag, which can be imagined as an invisible overlay over the selected scene elements.

  2. 2.

    A set of rays is projected from the robot perspective: if a ray impacts an object/surface marked as NonTraversable, the corresponding portion of the camera view is annotated accordingly.

  3. 3.

    If a ray impacts a ground patch with an average slope angle greater than a threshold it is also annotated as non-traversable, regardless of the NonTraversable flag. The threshold is based on the robot’s climbing capabilities: in our experiments, we set it to 25°.

To simplify the acquisition and annotation process, we render \(512\times 512\) frames with a camera field of view of \(90^{\circ }\), then both the frames and the annotations are cropped to 16:9 aspect ratio. To cover the entire \(90^{\circ }\times 90^{\circ }\) field of view of the captured frames, we split the view into 9 sectors, each covering a range of \(10^{\circ }\), both horizontally and vertically, and cast 3 trace-lines for each sector, resulting in an approximation gap between traces of \(3.33^{\circ }\), for a total of 729 traces. A sector is considered traversable if all the traces in that sector intersect a traversable surface; otherwise, that sector is considered non-traversable. The 27 vertical traces cast for each horizontal sector are used to detect the maximum traversable horizon height (Fig. 8).

In order to prevent the presence of near-duplicate frames, we ensured that any two consecutive frames in the dataset exhibited a minimum linear or angular displacement, following the same criteria as in Palazzo et al. (2020). Overall, it took about 1.5 h of hand-guided navigation in the simulator to collect the synthetic dataset of 2271 computer-annotated frames.

Fig. 6
figure 6

A simplified aerial-view graphics of the line tracing to detect non-traversable regions in the simulated frame

Fig. 7
figure 7

Visualization of trace-rays cast from the robot perspective for automatic annotation of synthetic images: aerial view of the robot agent (left) and robot camera perspective (right)

5.3 Model training and evaluation procedure

Input images are pre-processed by resizing the shortest side to 128 pixels (keeping aspect ratio) and standardizing each color channel. After feature extraction by the DeepLabV2 backbone, the resulting feature maps have the shortest spatial dimension of size 17: input images from the real dataset produce feature maps of size \(17\times 29\), while feature maps from synthetic images are \(17\times 17\). When performing self-supervision, we modify SwAV to predict the same cluster assignments to feature patches of spatial size between \(2\times 2\) and \(4\times 4\) (with random aspect ratio) extracted from a \(4\times 6\) area.

Fig. 8
figure 8

Examples of the automatic annotation process on the synthetic dataset

The training procedure for self-supervision employs a mini-batch SGD optimizer (batch size: 16), with cosine learning rate annealing from 0.6 to 0.0006, for 400 training epochs.

As for traversability estimation, we train our model with the Adam optimizer for 5000 epochs, using a linear learning rate schedule from \(10^{-4}\) to \(10^{-8}\). Following Palazzo et al. (2020), we set the \(\alpha \) hyperparameter for the safety-preserving loss term to 1.5. Weight decay factor \(\rho \) is \(5\cdot 10^{-4}\), and the \(\lambda \) hyperparameter that scales gradient reversal for domain adaptation is set to 0.1. This training setting is applied to both real and synthetic datasets.

Since the real dataset consists of a sequential stream of frames captured while operating the robot, we define the training and test splits by using the first 80% of the sequence for training and the rest for test, thus minimizing the risk of near-duplicates in the two sets. For consistency, we also split the synthetic dataset in the same way.

The model accuracy is computed on the real-dataset test set, in terms of mean absolute error (MAE), estimated by averaging the absolute errors of traversability scores predicted in each image, and then averaging the results over all test images. In order to account for random model initialization and to better assess differences in performance, we report the mean and standard deviation of MAE over 10 runs of the training and evaluation protocol.

Our model is implemented using the PyTorch library. We use an NVIDIA Titan X (Pascal) GPU for training, and an on-board NVIDIA Jetson TX1 (Maxwell) GPU for inference on the robot, which is able to process 15 frames per second.

Fig. 9
figure 9

A comparison between the output of the methods in Table 3 and the ground-truth for real images. From left to right we have: (1) input, (2) ground truth, (3) base (training on synthetic only), (4) domain adaptation on unlabeled real data, (5) self-supervised pre-training, (6) self-supervision + domain adaptation

5.4 Performance analysis in the supervised setting

We hereby present the results achieved by our model (and variants thereof) when it is trained in a supervised way on real images. This setup represents an ideal case, with manual annotations available on dataset created from real acquisitions. In the next section, we will show how our proposed approach, with unsupervised domain adaptation on real images, compares to these results.

Table 2 shows the average MAE for several supervised training configurations of our approach, namely, when synthetic data are used alongside the real ones (Synth.) and when self-supervised initialization, as described in Sect. 3.5, is performed (Self-sup.). As a reference for comparison, we report the results from the traversability estimation model in Palazzo et al. (2020). For a thorough comparison, since in Palazzo et al. (2020) the feature extraction backbone is also fine-tuned during training, we first carry out a similar experiment by disabling parameter freezing. It is worth to note that, when not freezing the backbone, our approach obtains comparable results with Palazzo et al. (2020); a slight difference in accuracy is due to architectural changes and random model initialization.

Table 2 Performance evaluation, in terms of average MAE on the traversability scores, in the supervised setup with manual annotations on real images

The results obtained when introducing the proposed training enhancements show that our model is able to outperform the baseline by a statistically significant margin. In this setup, with the availability of real-image ground-truth annotations during training, the impact of employing synthetic images and of performing self-supervision is limited. This can be expected, since the error signal provided by real images is probably the most important factor that drives learning. We can also notice that backbone freezing has a positive impact on results, reducing model complexity and leading our approach to better generalize on test data. Therefore, we enable backbone freezing in all the following experiments on unsupervised domain adaptation.

5.5 Performance analysis in the unsupervised domain-adaptation setting

In this setting, we evaluate the accuracy of our model when no annotations on the real dataset are available. Table 3 shows the average MAE when the model is supervisedly trained on synthetic images only and demonstrates how performance varies when we gradually integrate self-supervision (“Self-sup.”) and unsupervised domain adaptation (“Adapt.”). Note that when no self-supervision or domain adaptation is employed (as in the case of Palazzo et al. (2020), which we include in the comparison as a baseline), the model is simply trained on synthetic images and tested on real images.

Table 3 Performance evaluation, in terms of average MAE on the traversability scores, in the domain adaptation setup, where no manual annotations on real images are available

Results show that the model successfully learns traversability features from synthetic images and, unsupervisedly, from real images. It is interesting to note that results without any form of adaptation (i.e., when only synthetic images are used) are already relatively accurate, demonstrating the realism of the proposed simulation framework. Then, the integration of self-supervision and unsupervised domain adaptation independently improve the accuracy of the model. The full variant of our approach further reduces the traversability estimation error, achieving an average MAE of 0.104, which is very close to the value of 0.089 obtained when real images are supervisedly used at training time.

The outputs of different setups of our model applied to three sample images are compared in Fig. 9, qualitatively showing the improved traversability prediction of our final model. When training on synthetic data and directly testing on real data (“Base” column), model predictions exhibit large estimation errors; the integration of domain adaptation alone does improve results on simple terrain (e.g., second row). Self-supervision has a visibly positive impact on the results, though it appears to cause overly optimistic predictions in some regions. Integrating self-supervision and domain adaptation further improves predictions, keeping them close to ground-truth annotations but making them more conservative and safe.

6 Navigation results

In this section we evaluate the performance of the proposed traversability-driven navigation method in both on-simulation and on-field settings.

6.1 On-simulation navigation results

Simulated experiments are designed to evaluate the traversability-driven navigation approach in several challenging scenarios and to compare it to the state of the art. The experiments are carried out in the MIDGARD simulation environment for a total control over scene features and geometry.

We first evaluate qualitatively our navigation approach under three challenging scenarios where traversability estimation may lead to erroneous navigation outcomes:

  • Steep slope occluding horizon. A steep upslope or downslope may limit the estimated traversability value due to the sky covering a large part of the view, causing the agent to stop. Figure 10 presents an extreme example where the horizon is fully occluded. We report the predicted traversability maps in three key points: (1) upslope, before reaching the top of the hill; (2) at the top of the hill; (3) downslope, while going down the hill. In all of these cases, the estimated traversability never falls below the critical threshold of \(t_{crit} = 0.15\), at which the control algorithm stops the agent. Note that this behavior also depends on the camera setup, which in our experiments is tilted downward with an angle of \(5^{\circ }\), replicating the setup of the real-world robot.

  • Mid-air obstacles, such as horizontal branches in front of the vehicle camera, may limit visibility and, consequently, the traversability estimation. The example in Fig. 11 shows how the proposed approach suitably lowers the traversability scores as the agent gets closer to the overhead branch up to a point where they reach the critical threshold and the vehicle stops. Suitable camera positioning, at the front top of the robot, is necessary to have occlusions appearing in the middle/lower portion of the frame, in order to be suitably marked as non-traversable.

  • Narrow path. Narrow paths represent another critical navigation scenario, where small mistakes may cause serious damage to the robot. To validate our navigation algorithm in such a setting, we crafted a scene consisting of a small traversable path between two rocky walls, as shown in Fig. 12. In this test, as exemplified in the sequence of frames captured from the virtual camera, the robot is able to effectively travel the path, without exhibiting any oscillatory behavior that might divert it from the optimal course.

Fig. 10
figure 10

Traversability prediction in steep slope setting. The top image shows a side-view of the hill with the acquisition locations marked, shown below in order with the corresponding traversability predictions. In all frames, the traversability range is always above the \(t_{crit}=0.15\) threshold

Fig. 11
figure 11

Examples of traversability estimated when facing a horizontal obstacle in mid-air. Each row of images shows a side view of the scene, the agent’s camera view and predicted traversability

Fig. 12
figure 12

Top: aerial view of the scene in our narrow path navigation experiment. Bottom: two frames, with corresponding traversability prediction, at two different points of the path

We then perform a quantitative assessment of navigation performance, in terms of average traveled distance and time, and compare it to the state of the art. More specifically, we compare our approach with an end-to-end reinforcement learning navigation method (Xie et al. 2017), which employs a D3QN agent to perform obstacle avoidance from RGB and depth inputs. For a fair comparison to our method (that relies only on RGB), we leave out intermediate depth estimation from Xie et al. (2017) and work on RGB only. The reward function used to train the agent is the same as described in Xie et al. (2017). At each step, the agent receives a reward computed as \(r = vT \cos {\omega }\), where v and \(\omega \) are the local linear and angular velocity, and T is the time between each step. Additionally, the agent receives a negative reward of -10 when a collision is detected.

In addition, we also test two variants of our navigation approach. Our first variant is designed to investigate the possible presence of a correlation bias between distance and position of elements in the 2D image, which might tend to mark areas in the bottom part of the camera as traversable, possibly erroneously. To assess the impact of this bias, we apply a depth-based post-prediction criterion over the maximum estimated traversability: points in the scene which are further than a certain depth threshold are marked as non-traversable, regardless of the predicted traversability score (we refer to this variant to as depth-constrained). Under this constraint, overconfident predictions about far elements in the scene are prevented. In our experiments, we set the depth threshold to 15 ms, which is within the range supported by the real ZED camera.

Our second variant is designed to assess the effect of the proposed navigation approach, compared to a simpler alternative which applies a manually-set convolutional kernel to identify the ideal band. In detail, each traversability score \(s_i\) is updated as \(s_i \leftarrow s_i - \text {abs}(c) \), where \(c\) is obtained by applying a \([-0.5, 1, -0.5]\) convolutional kernel at vector location \(i\). Intuitively, the effect of this operation is to prefer locations that a) have a large traversability score to start with, and b) have similar scores in the neighbor bands. The band with the largest updated score is then selected as the direction towards which the robot should move. Linear and angular velocities are set as per Eq. 9 (with \(t_{crit} = 0\)) and Eq. 10.

Results, in terms of average traveled distance and navigation time before a collision over 200 navigation episodes, are given in Table 4: our method significantly outperforms the end-to-end learning approach in Xie et al. (2017) in both metrics. Interestingly, including the maximum depth constraintFootnote 2 seems to yield worse performance: this may be due to the reduced traversable horizon forced by the distance threshold, which limits navigation options. As for our experiment with the “naïve” navigation rule, results show that this approach, as can be expected, yields low navigation performance. From a manual inspection of the robot’s behavior in this setting, choosing the target direction “greedily” often leads the robot to either be surrounded by non-traversable regions or oscillate between nearby bands, due to the lack of a criterion for ensuring consistency of traversability scores across nearby bands. It is interesting to note that, quantitatively, this causes the robot to navigate for shorter distances before incurring into obstacles, while counter-intuitively increasing the duration of the simulation, since frequently switching directions leads to spending more time performing in-place rotations than linear displacements.

Our findings thus suggest that assessing accurately the height of the traversable region in 2D projected images adequately supports complex navigation tasks, while being more interpretable than end-to-end navigation approaches.

Table 4 Performance evaluation of navigation methods, in terms of average navigated distance and average navigation duration, before a collision is detected

6.2 On-field navigation results

On-field experiments have been carried out in a previously unseen real-world scenario to evaluate the performance of the traversability prediction model, especially in terms of sim-to-real transfer. The test environment features small to large rocks, tree branches and trunks, and high vegetation, thus being far more challenging compared to the real-world dataset used in training. The tracked vehicle and hardware setup introduced in 5.1 and adopted for the dataset acquisition (Palazzo et al. 2020) are used also for running the on-field test campaign.

The proposed approach allows the vehicle to smoothly navigate without experiencing any oscillatory behavior in the angular velocity, despite its discretization, thanks to the stable outcome of the band selection algorithm over time. More in detail, we observe that the selected band position is stable or smoothly transitioning to adjacent bands in consecutive video frames during navigation. A video showing some sessions of autonomous navigation during the on-field tests performed in the considered environment can be found in the supplementary material.

On-field experiments highlight the model’s ability to properly identify traversable areas even in scenes differing from the training dataset, thus demonstrating the generalization capability and robustness of the model. Some examples are reported in Fig. 13, showing convincing predictions of the traversable horizon for the observed scenarios.

Performance decreases when dealing with very close vegetation, resulting in occasional misclassification of non-traversable obstacles such as rocks, thus posing a potential risk to the vehicle. Some failure cases are reported in Fig. 14. We argue that this behavior is caused by the limited number of similar examples in both real and synthetic datasets: as a consequence, obstacles occluded by vegetation can be mistakenly perceived as traversable. Comparing the results in Fig. 13 to those in Fig. 14, it can be observed how the model is still able to recognize the traversable path as long as it has enough contextual awareness and the visual appearance of obstacles is not overly occluded by vegetation.

Fig. 13
figure 13

Successful traversability inference examples from the on-field tests: real images (left), related non-traversable area prediction in blue (right)

Fig. 14
figure 14

Failure examples from the on-field tests: real images (left), related non-traversable area prediction in blue (right)

We have also compared our method with the reinforcement learning approach in Xie et al. (2017). While the latter provides appreciable results, thanks to retraining performed in simulation, it also occasionally resulted in collisions with rocks or walls. This kind of navigation mistakes could be ascribed to the inability of the model to deal with the change of lighting conditions or to a potential delay in the decision making of the agent, which results into trajectories colliding with obstacles. A video showing some video samples of scenarios where (Xie et al. 2017) fails, whereas our approach succeeds, can be found in the supplementary material. However, it is worth recalling that it is hard to properly identify potential sources of failures, since in Xie et al. (2017) we do not have a predicted traversability outcome, but the navigation control action. Thus, in the provided video we show the on-board camera view during navigation up to the moment before the robot bumps into an obstacle.

7 Conclusions

In this paper, we introduced a novel method for traversability prediction based on self-supervision and domain adaptation. To enhance the capabilities of the model we created a large, fully annotated, synthetic dataset in a simulated environment, relieving from the burden of manually labeling a real dataset. The proposed approach outperforms previous methods for traversability estimation (Palazzo et al. 2020), while relaxing the need for supervision on real data.

On-field tests, performed in an unseen outdoor and unstructured scenario, confirm the effectiveness of the proposed method to accurately estimate traversability, thus enabling the vehicle to autonomously navigate in the target environment.

Future developments include enhancing model results when dealing with new unseen objects, such as occlusions between vegetation and non-traversable obstacles in rocks. A possible solution may include to integrate traversability predictions with confidence scores by a segmentation model, such as DeepLabV2, in order to detect critical regions (where prediction confidence is expected to be low). Uncertainty regions can be then treated as non-traversable when the prediction confidence falls below a threshold.

A related research direction concerns the limitations of the proposed approach for domain adaptation. Indeed, while numerical experiments and on-field tests demonstrated that our method effectively transfers knowledge from synthetic scenarios to real (and even unseen) environments, our preliminary results show that the model fails in presence of extreme domain changes, e.g., when performing supervised training on a synthetic “volcanic” scene and domain adaptation on a grassy environment, with previously-unseen obstacles such as trees).

In terms of the formulation of the training objective, we also intend to address the lack of distinction between traversability errors performed in the lower part of the image (i.e., closer to the robot) than in the higher part. A possible solution would be to weigh the training samples based on the ground-truth traversability scores, giving more importance to lower ones; to this aim, we should beforehand ensure a uniform distribution over traversability values collected during the hand-guided navigation sessions.

Finally, we intend to investigate sensor fusion approaches to integrate RGB data with depth cameras and/or LIDAR scans, to further improve the accuracy of the model in the presence of ambiguous visual features, as well as to explore model architectures and learning methodologies for point cloud inputs—e.g., 3D convolutions (Huang and You 2016), graph neural networks (Shi and Rajkumar 2020), transformers (Guo et al. 2021). The proposed approach could be further extended by integrating it with learning-based path planners, such as an end-to-end reinforcement learning–based navigation agent.

8 Supplementary information

A video showing some sessions of autonomous navigation during the on-field tests performed in the considered environment can be found in the supplementary material.