1 Introduction

Given an RGB monocular image, monocular depth estimation (MDE) is the task of estimating the distance of each pixel of that image from the camera. We often refer to this distance as depth of the pixel and to the output of an MDE model as a depth map. The depth prediction task has traditionally been solved by leveraging deep convolutional neural networks (CNNs), both in a supervised [1,2,3] and self-supervised [4, 5] fashion. Popular datasets for this task are NYU-v2 [6] and KITTI [7], containing mainly indoor images, and Make3D [8] and Mid-Air [9], containing mainly outdoor images. Recently, large meta-datasets for self-supervised MDE learning have also been proposed by combining appropriately different datasets [10].

MDE finds application in a number of different use cases, ranging from autonomous vehicles and robotic systems to 3D architectural modeling and terrestrial surveys. Furthermore, it turns out to be of key importance to the flight of unmanned aerial vehicles (UAVs) [11,12,13], which often can only rely on a single front camera as the main sensing device [11]. Such cameras are used not only for collision avoidance—as the primary task—but also for other outdoor tasks such as object detection, 3D reconstruction and digital terrain model (DTM) generation [12]. As such, efficient depth estimation and drone flight has become essential in several industrial applications, including visual navigation in industrial platforms [14], aerial surveillance and safety [15], and aerial object detection and crowd counting [4]. Unfortunately, due to cost and design constraints, it is often impractical to equip UAVs with devices modern enough, able to directly provide in-depth maps. Installed cameras cannot sense distances, which is crucial to a large number of tasks, hence the need for estimating depth of objects from a single monocular camera.

Recent advances in the field of image processing [16] have paved the way to brand-new approaches to MDE, that make use of the vision transformer (ViT) architecture instead of plain CNNs to estimate the depth of a single pixel. Differently from a CNN, a ViT model makes use of multi-head attention (MHA) layers as its core component. In particular, in a ViT layer the original image is decomposed into a series of non-overlapping patches, and each block of the model alternates between in-patch operations and between-patch operations via the MHA mechanism, with the benefit of providing global receptive fields at every layer of the network thanks to the cross-attention mechanism. Due to this and to their scalability to ever-larger datasets, ViT are quickly becoming the state-of-the-art in computer vision when pre-trained on sufficiently diverse datasets [16], and they have been exploited successfully also for dense prediction tasks such as MDE.

A prominent example in this category is the dense prediction transformer (DPT) [17], an architecture which yields substantial improvements on dense prediction tasks. Just like other transformer-based architectures, it delivers optimal results especially when a large amount of training data is available, and can be fine-tuned on smaller datasets after an initial training. DPT is of particular interest for monocular depth estimation due to its strong pre-training stage executed on a dataset significantly larger than the previous state-of-the-art [17], that allows to obtain high zero-shot performance on smaller benchmarks. However, most of the results obtained up-to-now [17] refer to classical indoor datasets or outdoor datasets obtained by leveraging ground cameras (e.g., mounted on top of a moving car). In this paper, we evaluate instead whether ViT-based architectures for MDE can benefit also flight drone scenarios, by fine-tuning DPT on the Mid-Air dataset [9], a multi-purpose synthetic dataset for low-altitude drone flights, achieving state-of-the-art performances after the fine-tuning.

In a further analysis, we assess for the first time whether such transformer-based models for dense prediction are susceptible to adversarial attacks, which are already known to affect CNNs. Several works [18,19,20] bring evidence that CNN-based models are vulnerable to adversarial examples attacks can alter significantly the output of the model. Adversarial attacks can be obtained in two different ways: either as small perturbations added to each input image, in some cases not even visible to the human eye [18]; or in the form of more visible patches of various size and shape, that can be applied to several images at different locations, affecting the output of a model in a near-universal fashion [21]. An example of the latter case is found in [22], where the authors perform an attack against an image classifier, training and obtaining a small patch, which can then be placed anywhere within the field of view of the classifier, and causes the classifier to output a targeted class. An attack akin to this one, but targeting a convolutional model which performs MDE, is described in [23]. In general, a number of analogous attacks, coming in different flavors, are described in the literature. However, the impact of such patch-based attacks on transformers has not yet been extensively researched and remains mainly unexplored.

1.1 Contributions of this work

We provide two major contributions in this paper. First, we validate that fine-tuning DPT models can provide state-of-the-art performance in depth estimation from a drone flight, reaching or surpassing the accuracy of video models but starting from a single static frame. Second, we probe the hypothesis, rather widespread in the literature, that vision transformers are less vulnerable to adversarial attacks with respect to CNN-based models. To this end, we conduct a series of experiments, devising patch attacks against DPT and CNNs for depth estimation. Our results suggest that such attacks are still possible against DPT, yet less effective than against CNNs. We show that adversarial patches, trained in the same fashion against DPT and CNNs for MDE, yield higher error rates on the latter. Together, the combination of these two results point to the fact that ViT-based models such as the DPT are vastly superior to convolutional architectures, both in terms of accuracy and robustness, when considering UAVs and general flight scenarios.

1.2 Organization of the paper

The rest of the paper is organized as follows. In Sect. 2, we provide a brief overview of related works, concerning adversarial attacks for dense targets, and neural architectures for MDE. Next, in Sect. 3, we describe the DPT architecture, the Mid-Air dataset, and the design of our patch adversarial attack. We provide extensive experimental results in Sect. 4, before concluding and highlighting possible future works in Sect. 5.

2 Related works

2.1 Adversarial attacks for structured targets

In this section we focus on adversarial attacks and defenses for structured computer vision targets, i.e., object detection, semantic segmentation, and depth estimation. For a broader overview of the field, we refer the reader to a number of comprehensive surveys, including [24,25,26].

Despite several works which first explored the possibility of fooling a classifier applied to spam detection [27,28,29], the renewed interest for adversarial attacks dates back to [30], where Szegedy et al. showed that CNNs for classification could be attacked by applying perturbations to input images and thus fooled into a misclassification. Over the past years, since this weakness was discovered, a wide range of different attacks have been introduced in the literature, targeting deep networks designed not only for classification, but also for object detection [31,32,33,34], and semantic segmentation [35].

Adversarial attacks are usually performed by computing an additive perturbation to the input image, obtained by solving a problem of loss maximization [24]. The resulting perturbed images can fool deep learning models into performing wrong predictions, even with a high level of accuracy, and in some cases can even be ‘universal’ [36], meaning they are able to fool a network on any image and they can transfer to different architectures, irrespective of their composition or their weights.

Perturbations can come in many forms, from simple noise added to the sample, to a single pixel [37] or a patch [20,21,22]. Whereas the former two are dangerous in that the perturbed image can look very similar to the original one (and changes might be imperceptible to the human eye), patch-based attacks represent a more critical threat because a trained patch can be printed and applied to real world images, thus allowing physical attacks [18, 19].

Even though the robustness of ViTs and its variants against adversarial attacks is still far from being widely investigated, recent works [38] show that transformers are slightly less vulnerable to adversarial attacks when compared to CNNs on classical classification benchmarks, although their robustness to universal attacks and attacks for dense prediction tasks is still unexplored. Our results seem to substantiate this hypothesis, bringing evidence that CNNs are actually more vulnerable to patch attacks when compared to ViTs for dense prediction.

2.2 Monocular depth estimation

In this section, we introduce and briefly discuss the MDE task. For a deeper analysis of the subject we refer the reader to more comprehensive surveys like [39,40,41].

MDE is the task of estimating the distance of every pixel of an image from a monocular camera. Recovering a depth map, i.e., the depth information for each pixel, looks to be a trivial task when done by humans or when leveraging stereo images, where we can infer depth from additional information, thanks to the perception from multiple angles or scale relative to known objects (for humans). In the case of monocular images, the task is far from trivial as we can only leverage the intensity of pixels, and has been tackled using deep learning methods which yielded more promising results than hand crafted ones [42]. In particular, MDE has traditionally been solved using CNNs and, more recently, architectures based on the ViT family [17]. These models have been trained in both supervised and self-supervised fashion. In particular, the DPT variant we consider [17] is pre-trained on MIX-6, a large dataset consisting of more than 1.5 million annotated images.

In the case of absence of ground truth data, self-supervised approaches have achieved state-of-the-art results by adopting different methods, like replacing the use of explicit depth data with easier-to-obtain binocular stereo footage [4], devising alternative loss functions [5], or building up image reconstruction networks or generative adversarial networks (GANs) [43].

3 Methods

In this section, we review the main components of our experimental evaluation: MDE and the DPT architecture in Sects. 3.1 and 3.2, the Mid-Air dataset in Sect. 3.3, and our proposed procedure for generating adversarial patches in Sect. 3.4.

3.1 Setup of the problem

In the case of supervised learning, the problem of predicting a depth map from a single RGB image can be viewed as that of estimating a non-linear mapping \(\Psi : {\mathcal {X}} \rightarrow {\mathcal {Y}}\), where \({\mathcal {X}}\) is the domain of RGB images \(x \in {\mathbb {R}}^{H \times W \times 3}\) and \({\mathcal {Y}}\) is the set of depth maps. To this end, we are given a training set \(S = \{(x,y)\}_i^N\), where each sample is composed by an RGB image x and the corresponding depth map y. In our scenario, the input image will be a frame captured by a camera mounted on top of a flying drone, like described later on in Sect. 3.3.

Such problem is solved by minimizing an objective loss function. Common criteria for this task are the absolute relative error (AbsRel) as in (1) and the root mean square error (RMSE) as in (2), where \({\hat{y}} = f(x)\) is the MDE model, and the summation is done over all the indices i of pixels for which a ground-truth is available:

$$\begin{aligned} \text {AbsRel}(y, {\hat{y}})&= \frac{1}{N} \sum _i \frac{ |y_i - {\hat{y}}_i |}{y_i}, \end{aligned}$$
(1)
$$\begin{aligned} \text {RMSE}(y, {\hat{y}})&= \sqrt{\frac{1}{N} \sum _i { |y_i - {\hat{y}}_i |}^2}. \end{aligned}$$
(2)

As stated in Sect. 2, the distance in the ground truth dataset depends on the type of measurement that was taken when building the dataset (e.g., metric vs. relative distance, dense vs. sparse depth maps). In our experimental evaluation, we mostly consider metric depth maps with no missing values.

3.2 Dense prediction transformer

Fig. 1
figure 1

Schematic representation of a DPT model applied to the task of monocular depth estimation (adapted and modified from [17])

The DPT model [17] is an encoder–decoder model which can be pre-trained for the MDE task. Differently from standard convolutive models, it employs transformer blocks [16] both in the encoder and the decoder. Since transformer blocks do not modify the spatial resolution of the image, it also adds specialized components to reassemble the final depth maps at varying sizes, as described next. In this section, we summarize the major components using the same nomenclature as [17], and we refer to [17] for a fuller overview of each module. For simplicity, a schematic overview is also provided in Fig. 1. Because our aim is to evaluate the pre-trained DPT performance on a UAV dataset, we do not modify the architecture of the model with respect to [17].

The first step of the DPT model is to encode the input image into a sequence of tokens, which have a similar meaning to character or words in natural language processing applications. To this end, the image is split into non-overlapping patches of size \(16 \times 16\), which are then collected in row-major order and vectorized. Denoting by \(x_i\) the vector describing the i-th patch, each patch is linearly projected to the final embeddings as:

$$\begin{aligned} h_i = \left[ g(x_i) \;\; \Vert \;\; p_i \right] \,, \end{aligned}$$
(3)

where \(g(\cdot )\) can be either a trainable linear projection (standard DPT model) or a pre-trained ResNet-34 encoder (hybrid DPT model), \(p_i\) is a trainable positional embedding, and \(\Vert\) denotes concatenation. An additional trainable token \(h_{\text {aux}}\), inspired to the class token of ViTs, is concatenated at the beginning of the patch tokens to provide an additional degree of freedom to the architecture. The final set of tokens h becomes the input to the encoder part of the DPT architecture.

The encoder itself is composed by a stack of transformer blocks, each of which is composed by several transformer layers (details on the hyper-parameter are provided later on in Sect. 4.1). A generic transformer layer is built as follows [16]:

$$\begin{aligned} z= \, & {} \text {MHA}\left( \text {LN}(h)\right) + h, \end{aligned}$$
(4)
$$\begin{aligned} o= \, & {} \text {MLP}\left( \text {LN}(z)\right) + z, \end{aligned}$$
(5)

where LN denotes layer normalization, MHA multi-head attention, and MLP a generic feed-forward network applied on each token independently, like the two residual connections.

The decoder of the DPT model is instead built from 4 separate blocks (called Reassemble blocks), taking as input the output of three intermediate layers of the encoder and its final output (see Fig. 1). Each block outputs an image of resolution \(\frac{H}{s} \times \frac{W}{s}\), where s is 32 for the block acting on the encoder output, and then progressively decreases to 16, 8, and 4 (i.e., blocks closer to the original image are decoded to higher resolutions). These outputs are then progressively aggregated from the lowest-resolution one using so-called Fusion blocks, which apply two residual convolutional layers, followed by an upsampling by a factor of 2, so that the final output of the decoder has shape \(\frac{H}{2} \times \frac{W}{2}\).

The Reassemble block is instead composed of two operations: Read and Resample. Denote by t the input to the block, where \(t_i\) is the ith token, and there are \(N_{p}+1\) tokens, corresponding to the original \(N_p\) patches and to the readout token. The Read operation is used to merge the readout token inside the other \(N_p\) tokens, by a linear projection:

$$\begin{aligned} t_i = \text {GELU}\left( W\left[ t_i \Vert t_0 \right] \right) \,, \end{aligned}$$
(6)

where \(t_0\) is the readout token, W is a trainable matrix, and GELU is the Gaussian Error Linear Unit activation function. The resulting \(N_p\) tokens are then rearranged into a image-like shape, which is propagated to a strided transposed convolution to achieve the final dimensionality (Resample). To obtain the final depth map estimation, a final head is appended to the output of the decoder, composed of a \(3\times 3\) convolutional layer, a \(1 \times 1\) convolutional layer to project to a single channel, and a final \(2 \times 2\) upsampling operation. Because the DPT is pre-trained on a mixture of datasets whose depth maps have different semantics, a customized scale-invariant loss [17] is adopted, where two scale and shift values are optimized separately for each dataset and are applied to the (unscaled) output of the DPT model, allowing it to generalize even to scenarios that were not considered in the training set.

3.3 Dataset

As mentioned in Sect. 1, the Mid-Air dataset is used for the fine-tuning of the model in order to evaluate its performance on in-flight operations. Mid-Air (Montefiore Institute Dataset of Aerial Images and Records) [9] is a multi-purpose synthetic dataset containing outdoor videos captured by manually flying a drone in a virtual environment. Because it is comprised of images captured in low-altitude drone flights, it is particularly suitable to our needs. Mid-Air contains synchronized data of multiple sensors, for a total of more than 420k video frames with resolution \(1024\times 1024\) simulated in various climate conditions (4 weather setups and 3 different seasons). In order to reduce memory usage the images were down-scaled to a resolution of \(512\times 512\) during fine-tuning. Two examples of images and corresponding ground truths taken from the dataset are shown in Fig. 2.

Fig. 2
figure 2

Examples of images and corresponding ground-truth depths from the dataset Mid-Air in two different climate settings

3.4 Adversarial patch attack for MDE

Fig. 3
figure 3

Schematic description of our patch attack for MDE

Apart from evaluating the performance of DPT on the Mid-Air dataset (Sect. 3.3), we want to test its robustness to a particular type of adversarial attacks, namely, patch attacks. A patch [22] is a small region which is trained to fool a classifier or other model whenever it is applied on top of an image, before the image itself is sent to the model for prediction.

Our method for generating adversarial attacks is inspired to the one proposed in [23], where a patch is randomly initialized as a tensor of size \(R\times R\times C\), where \(R\times R\) is the patch resolution (R being smaller than the smallest size of the image to be attacked) and \(C = 3\) is the number of channels (RGB). The patch is then reduced to a circle with radius R/2 by applying a circular mask to each channel of the patch. The number of trainable parameters in the patch is then equal to \(\pi \times \frac{R^2}{4}\times C\). Finally, the patch is transformed before being applied to an image by superposition, and it is trained so that is provides the highest perturbation to any image on the training set, irrespective of the transformation. This is shown visually in Fig. 3.

More in detail, we train the patch by maximizing a perturbation loss over a training set (which in our case is taken as a subset of the test data in order to avoid data leakage from the fine-tuning phase, see Sect. 4), in order to maximize discrepancies between attacked images’ depths and original depths:

$$\begin{aligned} P = \underset{P}{\arg \max }\ \;\; {\mathbb {E}}_{T \in {\mathcal {T}}} \left[ \sum _i L(f(x_i + T(P)),~ y_i) \right] , \end{aligned}$$
(7)

where f(x) is the MDE model, \({\mathcal {T}}\) is a family of transformations over the original patch (see below), \(+\) is intended as the application of the transformed patch over the original image (which we do by elementwise summation during training), L is an error metric used to compute discrepancies between predictions and ground truths, and the sum \(\sum _i\) is over all images in the training set of the adversarial patch. In practice, during the training, before the application on each image (both in training and test), the following transformations \({\mathcal {T}}\) are applied to the patch to make the patch as robust as possible:

  • Re-scaling with a random scale factor \(s^2, s\in [0.45, 0.55]\) meaning that the actual resolution of the applied patch is  R/4;

  • Rotation with random angle \(\alpha \in [-20,20]\) (degrees);

  • Horizontal and vertical translation from the center of the image, with random horizontal offset \(x \in [-0.3\times W, 0.3\times W]\) and random vertical offset \(y \in [-0.05\times H, 0.3\times H]\), where WH are the width and height of the attacked image and the values are negative for left and up offsets, and positive for right and down offsets;

  • Perspective transformation with random independent translations of the four corners with a distortion factor \(d \in [-0.1, 0.1]\) for each coordinate of each of the four corners of the patch.

A set of losses \(L(\cdot , \cdot )\) has been tested during hyper-parameter tuning in order to maximize the overall error induced by applying the attacks to the validation set. Using the same notation as Sect. 3.1, denote by \(\hat{y} = f(x)\) the output of the MDE model, and assume all summations below are made over all pixels for which a depth ground truth is available (the total number of such pixels is denoted by P, i.e., for dense depth maps like in Mid-Air we will have \(P=HW\)). Apart from the AbsRel (Eq. (1)) and the RMSE (Eq. (2)), we also consider:

  • A mean absolute error (MAE) loss, where the \(\ell _2\) norm is replaced with its absolute value:

    $$\begin{aligned} L(y, {\hat{y}}) = \frac{1}{P} \sum _i |y_{i} - {\hat{y}}_{i}|\end{aligned}$$
    (8)
  • The Scale Invariant Error (SIE, [1]) loss:

    $$\begin{aligned} L(y, {\hat{y}}) = \frac{1}{P} \sum _i d_{i}^2 - \frac{\lambda }{P^2} \left( \sum _i d_{i}\right) ^2, \end{aligned}$$
    (9)

    where \(d_{i} = \log y_{i} - \log \hat{y}_{i}\) is the per-pixel difference between predicted and ground-truth log-depth maps, and \(\lambda\) is a hyper-parameter.

In addition, we note during the fine-tuning phase (see Sect. 4.1) that high-accuracy of the model is obtained by learning to predict accurately short-term distances (< 200 m) which are prevalent in the dataset. Based on this observation, we also test a variant of the AbsRel loss function, where each pixel is weighted based on its ground truth depth, which we call distance-weighted error (DWE):

$$\begin{aligned} L(y, {\hat{y}}) = \frac{1}{P} \sum _i W({\hat{y}}_{i}) \Vert y_{i} - {\hat{y}}_{i}\Vert . \end{aligned}$$
(10)

The per-pixel weights are computed as a function of the ground truth depth using a truncated Normal density function of the depth, \(W(a)= {\mathcal {N}}(a; \mu , \sigma )\), and the parameters of the truncated Normal \(\mu\) and \(\sigma\) are computed w.r.t. the extreme values of the interval in which the attack is focused as \(\mu = \frac{1}{2} (\max (I)+\min (I))\) and \(\sigma = (\max (I)-\min (I))/2\). In particular, we set \(\min (I)=0\) and \(\max (I)=300\) based on the above reasoning.

The loss that produced the overall best attacking performances was the AbsRel, and while DWE showed some ability in focusing the attack on a depth interval of interest and excluding depths outside of the interval (by properly tuning the weighting parameters), the error induced by these localized attacks was smaller than the one induced by a global attack on the whole depth map with the AbsRel. Further experiments may bring better results in depth-localized attacks, and we leave these for future work.

After training, the trained patch is evaluated by randomly applying it on top of the remaining part of the test set, and computing its average effect on the predictions. Note that, even if the patch is highly localized, due to the effect it has on the non-linear processing of the neural networks, its perturbation is expected to have a global receptive field.

4 Experiments

4.1 Fine-tuning over Mid-Air

Table 1 Results of the fine-tuning procedure on the Mid-Air dataset
Table 2 Additional performance comparisons with other state-of-the-art models, considering a maximum distance of 80 m from the camera
Fig. 4
figure 4

Results of the fine-tuning procedure, aggregated with respect to the true distance of each pixel

For our experiments we use the pre-trained DPT hybrid model [17], and we fine-tune it on the training part of the Mid-Air dataset. As baseline, we compare the results of the attacks on DPT with the results obtained attacking a state-of-the-art Fully Convolutional Residual Network (FCRN) model having a ResNet50 encoder pre-trained on ImageNet [47] and trained on the same train-set used for fine-tuning DPT. The pre-trained DPT model was trained by its authors with the procedure described in [17] using the MIX-6 Dataset designed by the same authors. In particular, the hybrid variant of DPT includes a ResNet50 encoder pre-trained on ImageNet [47]. The pre-trained weights of the DPT model and the implementation of the model architecture are available online at https://github.com/isl-org/DPT.

The FCRN model was built from pre-trained ResNet50 blocks forming the encoder part of the architecture and with a final decoding part bringing back the intermediate feature maps to the original input size, consisting of unpooling and convolutional (trainable) layers initialized randomly and a final bilinear upsampling to align the output shape to the input shape. For our experiments, the model was trained on a subset consisting of 50,000 images coming from the Mid-Air Dataset for 6 epochs using the scale-invariant loss (9), with different learning rates for the encoder and decoder parts of the network. For the ResNet50 encoder (pre-trained on ImageNet), the learning rate was set to \(10^{-3}\), i.e., 1/10 of the learning rate set to \(10^{-2}\) for the de-convolution block.

The fine-tuning procedure of DPT used for our experiments follows the indications provided by the authors, i.e., we use the validation set for finding the optimal scale and shift parameters for the scale-invariant loss on the Mid-Air dataset in order to align the predictions to the average scale and shift of the dataset ground-truth depths. The tuned scale-invariant loss was then used for fine-tuning DPT for 10 epochs with a learning rate \(lr=10^{-7}\) over the same train-set used for training the FCRN model. The procedure to compute an estimate of the correct scale and shift parameters is proposed by the authors of DPT [17], and consists in finding the values of scale and shift that minimize the AbsRel over the training set, and then using these values both in training and testing to align the outputs to the ground-truths by applying the optimal parameters to the predicted depths as \(d^* = s*d + r\), denoting with s,  r the scale and shift, and with \(d, d^*\) the unaligned and aligned output depths, respectively.

This alignment is necessary in order to make use of the unscaled spatial information coming from the pre-training of DPT, otherwise the unscaled depth-maps produced by DPT without the alignment could be so far from the ground-truth depths that the training would be basically starting from scratch, given the possibly high values of the losses.

The 50,000 training images were selected in order to reflect the variability introduced by the different simulated seasons, so that most of the different climate and lighting variations of the environment could be seen during training, with the exception of the summer season which was not simulated by the authors of Mid-Air.

We show the results of this experimental section in Table 1, and we plot the AbsRel over the test set as a function of the pixel’s distance in Fig. 4.

In order to more easily read the table, we give a quick description of the metrics reported in it.

The Logarithmic RMSE, reported as \(\text {RMSE}_{10}\), is defined as

$$\begin{aligned} \text {RMSE}_{10}(y, {\hat{y}}) = \sqrt{\frac{1}{P}\sum \limits _{i} \left[ \log _{10}(y_{i}) - \log _{10}({\hat{y}}_{i})\right] ^2} \end{aligned}$$

and unlike the RMSE, it is less influenced by the higher values of the absolute difference found in the distant parts of the image given the use of the log-depths as arguments. The threshold accuracies, reported in the table as \(\delta _p, ~p\in {1,2,3}\), are defined as the ratios of pixels for which the ratio between predicted and true depth and its inverse are both smaller than the thresholds \(\delta _p = 1.25^p,~ p\in \{1,2,3\}\). These ratios are expressed by the formula:

$$\begin{aligned} \delta _p(y, {\hat{y}}) = {\frac{1}{P}\sum \limits _{i} \max \left( \frac{y_{i}}{{\hat{y}}_{i}}, \frac{{\hat{y}}_{i}}{y_{i}}\right) \le \delta _p, \quad ~p \in \{1,2,3\}} \end{aligned}$$

and higher values for these ratios indicate more accurate models.

For each experiment, we evaluate three different variants by clamping the maximum predicted distance to a given threshold (respectively, of 80 m, 500 m, and 1 km). We note that for all scenarios, the zero-shot DPT has good performances which, however, are not on-par with a standard FCRN architecture. However, after fine-tuning, the DPT model obtains state-of-the-art results in all scenarios. For example, for the case where the maximum distance is 500 m, we obtain a \(51\%\) relative improvement in RMSE over the zero-shot version, and a \(13\%\) improvement over the FCRN model. Other metrics have a similar behavior, e.g., we obtain a \(61\%\) improvement in AbsRel over the zero-shot DPT, and a \(29\%\) relative improvement in AbsRel over the FCRN.

It is interesting to observe from Fig. 4 that the poor results of the zero-shot DPT variant stems from a high AbsRel over short (< 300 m) distances, which are quite common in the Mid-Air dataset, and which are impacted significantly by the fine-tuning procedure.

In Table 2, we provide an additional benchmark comparison of the architecture with recent state-of-the-art models for MDE, including models that make use of auxiliary data for the training. In the case of M4Depth [46], in particular, the training is performed leveraging a captured input sequence together with the motion information about the sequence itself. DPT is remarkably competitive with this architecture as well, which is the current state-of-the-art on the MidAir dataset, while being able to predict the depth map starting from a single frame. The fine-tuned version is also significantly better than alternative state-of-the-art models including Monodepth2 [5] and ManyDepth [45].

The excellent scores attained by DPT come at the cost of a higher computational complexity, as shown in Table 3, where we provide a comparison in terms of Multiply-and-Accumulate (MAC) operations and millions of parameters. Nevertheless, DPT is still able to predict accurate depth maps in extraordinary short times (few tens of milliseconds), even without resorting to last generation hardware, due to a high degree of parallelism through its wide and rather shallow structure [17].

Table 3 Complexity of benchmark models, in terms of MAC operations and millions of parameters. In this table we exclude sequence-based models

4.2 Adversarial attacks to transformers for MDE

Fig. 5
figure 5

a Trained patch for DPT; b the same trained patch after a random transformation from \({\mathcal {T}}\) is applied on top of it

Fig. 6
figure 6

Example of a patch trained attacking DPT on Mid-Air images, transformed and applied on an image of the test-set

Different losses and parameter configurations were evaluated through a random-search hyper-parameter tuning before finding the optimal loss and configuration used in our experiments. The best results on the validation set used for the tuning were obtained by training with the AbsRel as training loss using the Adam optimization algorithm for 50 epochs with a learning rate of 500. The trainable patch was initialized randomly as a parameters tensor of shape \(256\times 256\times 3\), and a circular mask is applied to all channels in order to obtain an approximately circular RGB patch of diameter 256 pixels. We show an example of trained patch for the fine-tuned DPT model in Fig. 5a, an example of transformed patch in Fig. 5b, and an example of application on the test set in Fig. 6.

Table 4 Results of the attacks on the fine-tuned DPT model and the FCRN model trained on Mid-Air on a test set of 400 images coming from Mid-Air
Table 5 Percentage variations of the metrics after the attacks for the two models

In Tables 4 and 5, we report the average performances of DPT and the FCRN models on the test set (obtained after removing the 400 images used to train the adversarial patch), both with and without using the patch itself. By comparing these results with the performances on the unattacked test set we observe that in average the error induced by the adversarial patch is unsatisfactory for DPT, especially comparing these results with the ones obtained on the model FCRN by training a new patch against this model with the same configuration used against DPT.

The best configuration yielded results that, although visible, can’t be interpreted as successful attacks, since the predictions on most input images remain quite accurate (except for the portion of the images covered by the patch) after the application of the adversarial patch. The inputs for which the patch is actually causing some visible error are the ones for which accurate estimation is already harder, such as inputs in which fog or snow covers most of the scene. This can be seen in Fig. 7, while in Fig. 8 an example of a failed attack can be seen for a more common Mid-Air setting, where the highest errors are the ones regarding the portion of the image covered by the patch itself. By comparisons, the attacks on the FCRN model (Figs. 9 and 10) are significantly more powerful, as can also be seen by the relative errors in Table 4.

Fig. 7
figure 7

Example of successful attack on DPT with a hard input

Fig. 8
figure 8

Example of failed attack on DPT with a normal input

Fig. 9
figure 9

Example of attack on FCRN with a normal input

Fig. 10
figure 10

Example of attack on FCRN with a hard input

4.3 Transfer experiments

An interesting research question is to determine if the patch trained for attacking one model can be used to successfully attack a different model. This would configure as a black-box adversarial patch attack, where the architecture of the attacked model is not known and possibly very different from the one for which the patches are trained. In our setting, the two model architectures are indeed very different (transformer/CNN), although they both share a pre-trained ResNet50 encoder, so if the attack is influencing the predictions at the embedding extraction stage we might expect to see some influence on the predictions because of these shared embedding blocks.

Table 6 Percentage variations of the metrics after attacking each model with the patch trained for attacking the other

The results reported in Table 6 show that FCRN suffers more in terms of AbsRel error with the patch trained for DPT, while DPT suffers more in terms of all the other metrics, but predictions for both models are actually almost unmodified when the patch trained on a different model is applied.

Examples in Figs. 11 and 12 show that the effect on hard inputs is the same already observed for the attacks on DPT with its own trained patch, possibly indicating that on these inputs the effect that the patch induces on the encoder part of both architectures could be the most influential on predictions for these kinds of inputs, although this claim is beyond the scope of this experiment and requires further investigation to be verified or refuted.

Fig. 11
figure 11

Example of transfer attack on DPT with regular (left) and hard (right) input, using a patch trained to attack FCRN

Fig. 12
figure 12

Example of transfer attack on FCRN with regular (left) and hard (right) input, using a patch trained to attack DPT

5 Conclusions and future works

In this paper, we evaluated for the first time the performance of a ViT model on a MDE task performed from a flying drone. We show that the model is capable of good performances even in a zero-shot context, while it achieves state-of-the-art results after a brief fine-tuning phase. Next, we evaluated its robustness to an adversarial attack (in the form of a pre-trained patch), showing it possesses a strong degree of robustness against this kind of attacks, while competitive fully convolutional models can be fooled with a high degree of precision. Overall, our results highlight the strong potential of this class of models for performing MDE in scenarios involving low-altitude flying drones, where robustness is also important. Future work will study the performance of the model on a real setup, with the collection of a large dataset of low-altitude flight scenarios, and an analysis of the robustness of DPT against physical versions of our patch attack and more general corruptions and attacks, as well as a study of the transferability of the attacks to a wider range of scenarios, datasets, and architectures.