1 Introduction

Landing on small celestial bodies has become one of the most important challenges in the pursuit of space exploration and exploitation. Some missions in the past years have achieved huge milestones towards this goal, by taking advantage of different strategies. The Rosetta mission (Schwehm and Schulz 1999; Glassmeier et al. 2007) deployed a lander, Boehnhardt et al. (2017), whose objective was to obtain information about the comet’s gravitational field and surface properties. Kawaguchi et al. (2008), OSIRIS-REx Lauretta et al. (2017), and Watanabe et al. (2017) used a different approach and performed Touch-and-Go (TaG) manoeuvres, which consist of a controlled touch-down on the surface of the celestial body, followed immediately by a powered ascent.

Interacting with the surface of the body of interest has, thus, gained prominence in the mission design and analysis component of the current space exploration activities, even if it poses some of the most challenging tasks. Landing sequences require very high degrees of autonomy due to the high latency with ground stations on Earth, so it is key to be able to rectify the original instructions given to the spacecraft based on the inputs that its on-board sensors receive.

One of the systems in charge of this is the HDA system, whose typical tasks include (but are not limited to) shadow detection, feature detection, slope estimation, or surface roughness estimation. Two main types of HDA systems can be distinguished: passive and active. Passive systems rely only on camera inputs, while active systems use, additionally, LiDAR. The difference lies in the way to interact with the environment. LiDARs, for instance, emit a beam that is then registered in the sensor once it bounces and comes back. This is considered an active sensor because it registers an observation originally sourced from the sensor itself. Passive systems, on the other hand, entail sensors that only receive some kind of input observation, without proactively emitting any kind of interaction with the environment. Cameras are passive sensors because they only take whatever information is available in the environment they work in.

Some of the earliest works on HDA systems were presented by Epp and Smith (2007); Johnson et al. (2008), where the use of active HDA systems for landing on the lunar surface is discussed and analysed. There, different parametric analyses were performed including sensor performance, trajectory angle, or vehicle hazard tolerance. Although the use of passive systems has also been discussed previously by Huertas et al. (2006), it lacked performance when it came to the estimation of the slopes, as was also concluded in the posterior analysis performed by Neveu et al. (2015).

Having access to the digital elevation maps (DEMs) generated by LiDAR sensors makes the slope and roughness estimation processes much easier and accurate, but shapes the mission design around these active sensors that could weigh up to 25kg and require up to 200W of power (Huertas et al. 2006). In addition, they offer very restrictive field-of-view (FoV) angles and resolutions and take considerable volume inside the spacecraft’s bus. Cameras, on the other hand, offer much more flexibility from a mission design perspective in terms of mass, volume, and power consumption, which is key for missions whose scientific return is highly valuable.

All of the aforementioned pieces of work related to HDA relied on traditional image processing (IP) techniques, which pose the additional challenge of having to adapt computationally expensive techniques to the limited space-proven hardware available for space missions. It was only recently that the use of artificial intelligence (AI) techniques (in particular artificial neural networks (ANNs)) was proposed to tackle this problem (Lunghi et al. 2015), because of their great generalisation properties, which become very relevant in problems where it is impossible to determine in advance the different hazards present in the terminal landing area.

Additionally, ANN techniques are highly computationally efficient due to their working principle being based on polynomial operations, which is a desirable feature when an on-board implementation is considered. In Lunghi et al. (2016), two different architectures are compared: multilayer feedforward and cascade. In that work, information about the altitude and attitude of the spacecraft are also provided as additional input to the HDA algorithm that, then, generates the hazard map.

Following the same research line, in Silburt et al. (2019), a CNN was trained to recognise and count lunar craters. The networks also showed promising results regarding the transferability of its predictions after testing it on a database with images corresponding to Mercury, instead of the Moon.

In Pasqualetto Cassinis et al. (2020); Sharma and D’Amico (2020), CNNs were trained to estimate the position and pose of a non-cooperative spacecraft, using only monocular vision observations. In Sharma and D’Amico (2020), in particular, the authors also detail how a database generation functionality needed to be developed, due to the scarcity of resources available for the training. In that same year, in Tomita et al. (2020), the superiority of CNN-based algorithms over traditional techniques was demonstrated.

More recently, Silvestrini et al. (2022); Pugliatti et al. (2022) have shown how these networks can be used to perform orbital navigation using lunar craters and binary asteroid system geometry, respectively. The authors show how the great generalisation capabilities of such algorithms greatly improve the quality of the predictions made by the networks. The applicability of these algorithms to landing sequences and hazard detection is discussed and, in Pugliatti et al. (2022), a U-shaped network is developed and trained to detect features on small celestial bodies, using an ad hoc data set generation tool.

Because of the benefits they offer, CNNs have been used in many different projects in the last few years. However, one of the main drawbacks of deep learning (DL) architectures is that they require a relatively large amount of data to be trained (Al-Moosawi et al. 2021; Gao et al. 2021). On top of that, since the training scheme follows a supervised approach, the labelling of the available data is an effort required to ensure the quality of the training. These two factors make it very expensive to properly train such a network from scratch.

Considering all the above, and in spite of the amount of research put towards HDA systems and IP techniques, very few researchers have tackled the problem of combining both fields. In particular, developing algorithms capable of processing passive observations and still accounting for the geometry or roughness of the surface of a celestial body is a task that still remains unexplored. AI-based passive HDA systems are a very challenging problem mainly because they involve the prediction of a 3D geometry from visual, 2D inputs. However, finding a solution would greatly benefit future space missions to planetary bodies by providing a low-impact solution (in terms of both cost and mission design) that would increase the autonomy of the spacecraft and its capability to interact with the bodies it would explore.

Thus, in this work, passive HDA systems based on CNNs are developed and tested against different image data sets. CNNs offer the benefit of achieving high accuracies, at the cost of a more expensive training phase. Three separate layers are proposed for the complete system: shadow detection, feature detection, and slope estimation. The input to the algorithm is the raw image as it would be taken by a camera, without any additional information about the spacecraft’s state. The algorithm developed under this research is named astroHda and has been included in the Astrodynamics Simulator (AstroSim) suite (Peñarroya et al. 2022). Unlike previously mentioned works, astroHda is intended to be capable of predicting hazards for any small body it could encounter, taking advantage of the generalisation capabilities that deep learning techniques such as CNN offer.

In the following, Sect. 2 talks about the architecture chosen for the neural networks and the techniques exploited for its development. After that, Sect. 3 introduces the followed training scheme and explains how the images used for the training and included in the data sets are generated. Section 4 shows the results obtained with the developed networks and analyses the different elements involved in the HDA system. It also explains how safety maps are created and includes network predictions for additionally rendered images not used for the training, and images taken by real sensors. Finally, Sect. 5 gathers some conclusions about the results obtained and discusses strengths and weaknesses of the developed techniques.

2 Architecture

CNNs are multilayer feedforward neural networks that apply image analysis filters as convolutions. This architecture is a type of DL method with the capability of very accurately classifying images and detecting patterns or other features in pictures. By stacking several layers, a pixel-by-pixel classification can be obtained as output. In each of these layers, multiple filters are convoluted over the input to the layer to obtain the so-called feature maps. Thus, CNNs are composed of mainly three types of layers:

  1. 1.

    Convolutional layers, where the image filters are applied in order to identify a certain pattern or feature.

  2. 2.

    Pooling layers, where the feature maps are reduced to their main attributes to reduce the computational cost while maintaining the main feature characteristics of the map.

  3. 3.

    Fully connected layers, which are responsible for the final classification and take into consideration all the previously stacked layers.

As it has been introduced in Sect. 1, training a DL network from zero could become quite expensive computationally. An alternative, proposed in Akçay et al. (2016), is transfer learning. This method is based on taking advantage of pre-trained networks as a baseline for the training of a neural network destined for a different task (or maybe a similar one). The weights obtained after training such a network are used as initial values for the training phase of a particular project. Some of the layers are initially frozen so that the first steps of the training do not modify the core hidden layers of the network. The reason why these core layers are frozen is that they contain the key feature maps that were previously learned by the network, which consist of more basic features that can be easily applied to many different image classification types. These key features include straight lines, circles, directionalities, or other basic geometric entities. On the other hand, the last layers of a network learn features that become much more case-specific, and are, thus, not frozen from the beginning to allow the pre-trained network to adapt to the new data set. For instance, a network that has been trained to differentiate and classify different object images, as the one shown in Liu et al. (2014), would offer more detailed features as the layers go deeper. Figure 1 shows how, for deeper layers, the detail and case specificity of the feature maps increase (Zeiler et al. 2014).

Fig. 1
figure 1

Feature maps at different stages of a network. The feature maps at the initial layers encode edges and direction, i.e. horizontal or vertical lines, the feature maps obtained at the middle of the network visualise textures and patterns, and the feature maps at the last layers depict parts of objects and objects in the image (Vignesh 2020)

Making use of such a technique is paramount when training a neural network for space applications, where the amount of available real data is relatively limited and not pre-processed for supervised learning. The next section will explain how this problem is tackled using both transfer learning and a pipeline for the generation of an artificial data set for the specific study case of a landing sequence in an asteroid environment.

Another of the main problems of deep neural networks is what is called vanishing gradients. In plane architectures, it could happen that the derivatives for the weights of each layer are small and that, when multiplying them (following the chain rule) for the back-propagation, they become even smaller to a point where an update of the initial layers could lose its effectiveness. In He et al. (2015), the idea of using residual layers is proposed, where the vanishing gradients problem is minimised by using skip connections, where a number of convolutional layers are skipped, at every basic block of the network. Figure 2 shows how a skip connection looks like in a basic convolution block. The input, \(x_l\), is bypassed downstream to the output of the convolution filter. This allows for much deeper architectures, without overloading the training with the computational effort of excess parameters.

Taking advantage of this structure, a novel architecture was proposed in He et al. (2015), which was named residual network (ResNet)-34, referring to the 34 layers it is composed of. (Table 1 gives a summary of the structure of the network.) The network was trained using the ImageNet data set (Deng et al. 2009), which consists of 1000 classes (therefore, the last layer of the network in Table 1), distributed around 1.28 million training images and 50.000 validation images. After more than 500.000 epochs of training, the model was found to considerably reduce the top-1 error (by \(3.5\%\)) with respect to its non-residual counterpart, due to the improvement in the degradation problem.

Fig. 2
figure 2

Comparison between a traditional convolutional block (left) and a residual block (right)

Table 1 ResNet-34 architecture

For the results shown in this work, ResNet-34 has been chosen as the backbone architecture, because, as has been shown above, it is very effective when applied to semantic segmentation tasks. The higher depths that are achievable when using residual networks greatly benefit the capability of the network to detect certain patterns and shapes, without raising the computational burden of the training phase. Several ResNet architectures are available, the most-used ones ranging from 18 to 101. ResNet-34, which was introduced in Table 1, offers a great trade-off for the problem at stake and it is also available in the fastai pre-trained networks collection, which makes its implementation much simpler. The weights with which the network is initialised correspond to those obtained by training it with the ImageNet data set, as introduced in He et al. (2015). This transfer learning approach reduces the computational load needed to fit the network to the particular needs of the study case posed in this paper.

3 Data set generation and training

According to Ripley (2007); Ghilardi et al. (2022), CNN-based approaches require the assessment of the network in three stages:

  • Training set: a set of examples used for learning, that is to fit the parameters of the classifier.

  • Validation set: a set of examples used to tune the parameters of a classifier, for example, to choose the number of hidden units in a neural network.

  • Test set: a set of examples used only to assess the performance of a fully specified classifier.

Fig. 3
figure 3

Arbitrary rendered images for comet 67P/Churyumov-Gerasimenko generated using AstroSim. Notice how features (craters and boulders) have been randomly distributed around the surface of the comet to enrich the data set

Since the training scheme follows a supervised approach, the generation of ground truth images is fundamental for the creation of the database. The images used as input to the CNN should be the raw input as they would appear on the camera, i.e. the original mesh, with shadows and features (this will be referred to as featured mesh).

Thus, the ground truth information required for the astroHda database must include labelling about shadowed areas, feature location, and slope information. The former can be obtained by pixel subtraction between the featured body and the mesh observed from the same position and with the same orientation but once all of its features have been removed (this will be the naked mesh). For the latter, i.e. the slope estimation, the slope at each facelet of the mesh is computed, as depicted in Fig. 4, using the local gravity vector as a vertical reference for the computation of the slope. To compute the gravity vector, and exploiting the fact that a shape model is available, AstroSim is used to compute the gravitational potential at the centre of each facet using a polyhedral model-based gravity model, implemented following (Werner 1994). Then, the slope estimation can be tackled in two different manners: with a regression algorithm or a classification method.

The first option implies predicting the numerical value of the slope at each pixel corresponding to the body’s surface. Since the pixels need to be ultimately translated into “safe” or “dangerous” areas, a simplified approach was chosen, where a threshold is defined to categorise facelets between safe and unsafe. Transforming these regression problems into a classification problem also allows for the use of a similar neural network structure to the one used for shadow or feature detection.

Fig. 4
figure 4

Example of the slope computation result for asteroid Lutetia. Around the body, an arbitrary orbit is displayed in orange

The slope threshold was set to \(15^\circ \) in this work based on Wang et al. (2020). Then, the equivalent image for the featured mesh is generated colouring the mesh accordingly (this will be called sloped mesh).

Images for the training database were generated using four different body meshes, one crater model, and three boulder models in three random combinations and distributions. To render the images, ten different orbits were simulated for each of the obtained bodies (the naked mesh, the three featured bodies, and the sloped mesh) and twenty pictures were taken from the spacecraft using a nadir-pointing attitude profile.

Fig. 5
figure 5

Examples of the different base meshes and feature distributions for the training and validation data set

Fig. 6
figure 6

Examples of the masks generated for an arbitrary rendering of Bennu. Blue represents deep space, red represents dangerous areas, and green represents safe areas

The resulting set contains 2000 images that are then processed to generate the masks used as ground truth for the semantic segmentation training, using the pipeline described before. Examples of the different meshes used (with overlaid feature distributions) can be observed in Fig. 5. The corresponding masks for a rendering of Bennu from the training set are also depicted in Fig. 6, where the labels for deep space, danger, and safe areas are depicted. Table 2 shows the different labels used and their colour codes.

This training data set is divided into training and validation sets (as described at the beginning of Sect. 2) using a random split of 80–20, respectively. This means that 80% of the data set is being used to compute the optimal change in the internal weights of the network, and 20% as online cross-check to estimate whether the network is fitting too closely to the images it is being trained on. The rendered images are downsized to a 300x300 pixel resolution, and they are normalised using the parameters obtained from ImageNet (He et al. 2015). The reason is that the initial weights for the architectures used are extracted from the ImageNet data set, so the normalisation is also performed accordingly, exploiting transfer learning (Akçay et al. 2016) as a way to more efficiently train the networks developed for this work. Additionally, to increase the robustness of the training, the augmentations collected in Table 3 are implemented on the data set, and an exemplification of the resulting images is shown in Fig. 7, for the slope estimation layer. There, it becomes clear how the scaling for the image or the granularity added by the inclusion of the Gaussian noise impact the rendered images.

Table 2 Label meaning for each of the layers developed for astroHda
Table 3 Augmentation parameters for training data set
Fig. 7
figure 7

Example of the augmentations applied to the training set, for the slope estimation layer. In the images shown, the label colouring is superimposed over the original image

The accuracy metric used for the training relies on the dice loss from fastai Howard (2018), which builds on the traditional Intersection-over-Union methods. The optimiser used is Ranger, which is a combination of RAdam and Lookahead optimisers (Wright et al. 2021). As for the loss function, FlattenedLoss of CrossEntropyLoss is utilised, following what is common practice in semantic segmentation problems.

The training scheme, as mentioned before, is based on a transfer learning approach, where the network weights from ResNet-34 (used as the backbone) for ImageNet are taken as an initial guess. A first stage of 10 epochs is performed, where only the last layer of the network is available for tuning, followed by a second phase of 12 epochs, where all the weights are unfrozen and freed to be modified by the optimiser to achieve a better fit to the data set used for this development.

Fig. 8
figure 8

Examples of the different base meshes and feature distributions for the test data set

The obtained networks are then tested using a new, smaller, data set, which is generated using the same pipeline described before, but with four new base meshes and features. This set consists of 100 images, and its role is to check the accuracy of the developed functionalities when used outside the data set they have been trained with. Examples of the used renderings can be observed in Fig. 8.

4 Results

The networks are tested on each of the aforementioned data sets to evaluate the predictions they are capable of doing. To understand how well the developed algorithms can be transferred to a HIL scenario, the networks are tested on a number of real-world images, even if the training data set included none of this type of images.

4.1 Shadow detection

Shadow detection was initially implemented following the Multi-Task Mean Teacher (MTMT) approach proposed by Chen et al. (2020) without further training. In their paper, they implement an MTMT training scheme where a CNN is trained to perform three different tasks simultaneously, namely, shadow count, shadow detection, and edge detection. Representative results of the obtained predictions can be observed in Fig. 9. The network was trained using real-world images, and, in the results obtained, it can be seen how the prediction struggles to find a limit between shadowed and deep-space areas.

Fig. 9
figure 9

Early shadow detection examples based on the MTMT method by Chen et al. (2020). Input is an image rendered using astroRender. Output is a binary map with ones (white pixels) where shadows are detected

Fig. 10
figure 10

Shadow detection example. Inputs and masks are overlaid. Labels map danger areas in red, safe areas in green, and deep space in blue

Trying to improve the results obtained using the MTMT network, a new network scheme (based on the approach introduced in Sect. 2) was developed. The architecture described in Table 1 was adopted, only with slight modifications to the last layer so that the output labels would be reduced to four labels (as introduced in Table 2): illuminated, shadowed, deep space, or undetected.

The labelling methodology chosen was to have a pixel-per-pixel label, as it is typically done in semantic segmentation tasks, where the last layers of the network are filtered through a softmax layer which assigns a label to each image coordinate. In the particular case of the shadow detection algorithm, safe areas correspond to zones where the illumination conditions allow the spacecraft to see the surface in detail, hazard to those shadowed or in weak illumination conditions, deep space corresponds to the image pixels outside the limb of the body, and a fourth label is given for those pixels, which the algorithm cannot classify as any of the previous types. Table 2 gathers the meaning of the different labels for each of the layers developed in this work.

Fig. 11
figure 11

Confusion matrix for the shadow detection layer. Performances are normalised with respect to the total number of true labels for each category, indicated by the colorbar on the right-hand side of the figure

Following the training scheme introduced in Sect. 3, the predictions shown in Fig. 10 are obtained. Its left-hand side panel shows two examples of the shadow predictions the algorithm is able to perform. The predicted shadows match very accurately the target (i.e. the ground truth), even in zones where the shape of the shadowed areas becomes highly irregular or even around features. On the other hand, the upper example on that figure shows how, when shadows are cast on the body limbs, it becomes difficult for the CNN to successfully distinguish between deep space and shadowed area of the body. In fact, when looking at the right-side panel, where the worst three predictions are gathered, it is clear that most of the loss value these predictions yield comes from the inaccurate detection of the limit between shadowed area and deep space.

A more complete overview of the results obtained for the whole data set is provided by the confusion matrix shown in Fig. 11. There, the predicted labels for each of the images in the data set are compared to the true labels, as described by the masks. Four categories can be identified, namely true positive (TP), true negative (TN), false positive (FP), and false negative (FN). These values are computed for each different category. For the shadow detection case, deep space and safe labels are very accurately predicted, and shadow detection achieves \(67.78\%\) rate for TPs. The rest of the labels predicted as shadowed areas correspond mostly to deep-space predictions (the \(29.29\%\)), and only \(2.93\%\) of the cases were labelled as safe areas.

After close inspection of the generated masks for the training, it was observed that, when detecting the limb of the body in the original images (using thresholding and pixel subtraction techniques), some of the pixels around the limbs of the body are labelled as danger. This can be better observed in Fig. 14, where an arbitrary mask is displayed, together with a more in-detail view of the labels used for each pixel. It can be observed how a number of danger pixels close to the limbs of the body are actually wrongly labelled by the pre-processing and mask generation algorithm.

Nevertheless, the predictions obtained using this layer are capable of correcting this labelling mistakes and actually label these mistaken TNs correctly as deep space. On top of that, both deep space and shadowed labels correspond to hazards in the definitions used for the algorithms developed in this work, so the final set of hazards can be composed from the combination of these two sets. According to Luo et al. (2020), traditional methods for shadow estimation present performances of around \(80\%\), while similar CNN-based approaches could go up to \(91.79\%\) (in particular, the network proposed there). In the results presented here, deep space and danger predictions (all the non-safe conditions) together, both represent \(97.07\%\) of the classified pixels, which is a very high success rate for shadow detection.

Fig. 12
figure 12

Feature detection example. Inputs and masks are overlaid. Labels map danger areas in red, safe areas in green, and deep space in blue

Fig. 13
figure 13

Confusion matrix for the feature detection layer. Performances are normalised with respect to the total number of true labels for each category, indicated by the colorbar on the right-hand side of the figure

4.2 Feature detection

Following the results obtained for the shadow detection layers, and given the similarity of the study case, the same network architecture was chosen for the feature detection layer. In this case, safe areas correspond to pixels in absence of any boulder, crater, or feature that could represent a problem for the landing; hazards correspond to the opposite, and deep space and undetected keep the same meaning as for the shadow detection layer.

Figure 12 shows the results obtained after the training of the network. The left-hand side picture shows the results on a couple of target images. There, it can be observed how shadowed areas do not trigger any feature detection, since they are not visible and the target was not set to train the network to detect those features in those areas. The detection of the illuminated sections of the asteroid is very accurate and even achieves more natural shapes for the features detected than the original mask, which has enclosing rectangles instead.

The right-hand image shows the top-losses for the network, i.e. the input images that have the worst performances w.r.t. the target mask. All of these cases are images with very poor illumination conditions where not even the limb of the body can be accurately inferred. These results are in line with what was expected and would align well with the shadow detection layer that would discard those areas nevertheless.

In Fig. 13, statistic results for the training data set are shown. There it can be observed how, similarly to the shadow detection layer, deep space and safe labels are very accurately estimated. The TPs for the danger label look very low, with an overall accuracy of \(37.21\%\) for the data set. When inspecting the results visually (as shown in Fig. 12, for instance), it is clear that the accuracies shown in the confusion matrix do not look realistic with respect to the quality of the prediction. Two main reasons have been identified, both related to the way the data set images are pre-processed.

A \(36.95\%\) of all of the true danger labels are labelled as deep space, which is almost as high as the accuracy rate for the TPs. As for the shadow detection, this is not really a practical concern, since both labels are considered equally hazardous, and the problems stems again from the inaccuracies found in the ground truth mask generation algorithm.

On the other hand, \(25.84\%\) of the danger labels in the ground truth masks are, according to the confusion matrix, wrongly labelled as safe pixels. This is due to the fact that, because of simplicity reasons, the features detected in the pre-processing of the masks were framed in rectangles. These rectangles, however, also cover areas where no features can be observed in the original image, mostly next to their corners. This can be seen in Fig. 14b, where rectangles enclosed featured areas but do not exactly fit the feature’s morphology. Nevertheless, the network learns how to adopt more natural shapes for the features and to distinguish very accurately the limbs of those features.

Fig. 14
figure 14

Ground truth mask used in training for feature detection. Right-hand side shows the inaccuracies in the pre-processing algorithm: features are labelled around the limbs of the body, and rectangles are used to flag features on the surface

4.3 Slope estimation

Figure 15 shows the same structure as discussed for the previous two layers, applied to the slope estimation network. Again, the left-hand image shows how fully shadowed areas are problematic for the estimation of the surface slopes (as seen in the upper row example), but how the illuminated parts yield better performances. The same behaviour shown before, where the network, learns to naturalise the shapes given in the target is present in these results too. The predictions estimate the main hazard areas while paying less attention to very punctual spots, resulting in a more natural prediction of the overall slope of the body.

Fig. 15
figure 15

Slope estimation example. Inputs and masks are overlaid. Labels map danger areas in red, safe areas in green, and deep space in blue

Fig. 16
figure 16

Confusion matrix for the slope estimation layer. Performances are normalised with respect to the total number of true labels for each category, indicated by the colorbar on the right-hand side of the figure

In the top-losses plot, a particular case appears as the worst prediction. Differently from the rest of the worst cases, where illumination conditions play a fundamental role in the accuracy of the estimation, this input image shows a well-lit celestial body. To the human eye, the prediction does not appear to be that far off from reality, e.g. the main green corridor in the middle of the asteroid is roughly correct, some of the smaller spots are covered too, and even the limb detection appears to be accurate. If anything, it appears that the network has given a conservative prediction, probably influenced by the features that are present on the right-hand side of the asteroid, which could have raised the loss value.

Differently from how the masks are generated for the shadow and feature detection, slope thresholds are rendered directly from astroRender, so no pixel subtraction takes place here. This results in ground truth masks that do not have any mislabelling due to pre-processing. Thus, the results seen in the confusion matrix actually represent a better statistical interpretation of the quality of the predictions for the slopes. Almost no actually dangerous pixels are labelled as deep space (around \(2.19\%\)), and those detected as danger mostly lie near the limbs of the body. For the rest of the pixels, which belong to the observable surface of the body, a \(75.09\%\) success rate is achieved for the slope estimation. As described in Fig. 15a, most of this error comes from the network learning to estimate smoother-sloped areas and paying less attention to isolated high-slope surface patches.

4.4 Test set

The second round of results is generated using the test data set, which is completely external to the training of the networks, to assess how robust the predictions are when facing conditions or environments that are relatively new. Figure 17 shows four characteristic examples on which astroHda was tested. Different bodies are rendered and the corresponding images are then given to the CNNs, which obtain predictions for each layer. Separately, the target masks are also generated with the same pre-processing functionalities developed for the data set pipeline and used to create the training data set.

Fig. 17
figure 17

Test data set results on four different rendered images. On each of the panels, the left-hand side corresponds to the rendered image given as input to the networks, the first column is the target, and the second column is the prediction. The first row corresponds to feature prediction, the second to shadow detection, and the third to slope estimation

The results suggest that the feature and shadow detection capabilities of the networks are robust and adapt well to scenarios they are not familiar with. Remarkably, it can be observed how the algorithm outperforms the labelling technique used for the pre-processing of the data set. For instance, in the lower-left panel of Fig. 17, visual inspection reveals that some of the features detected by the networks had been overlooked by these pre-processing functions and are, nevertheless, detected by that layer. Shadow prediction is accurate and identifies the illuminated areas well, even under very constrained Sun-face angles.

On the other hand, functionalities related to the intrinsic geometry of the body are not that well predicted for the test data set images. Slope and body limb estimation struggle to yield satisfactory predictions when compared to the training data set. Limb estimation, which was already one of the main problems in the results shown in the top-losses plots for all layers, continues to be a challenge under poor illumination conditions and it seems that it becomes particularly difficult when the geometry of the body is not included in the training of the network. The same holds true for the slope estimation, which seems to be far from the targets for most images.

Similarly to what was done in previous sections, Fig. 18 shows statistics obtained from the test data set predictions and Fig. 19 the top-losses plots for that set. As expected, performances are in general worse than the ones from the training and validation data set, mainly due to the fact that the shape models and feature population distributions and geometries used to conform the test data set were not employed during training.

Nevertheless, the networks are still capable of predicting shadows and features with accuracies of \(86.5\%\) and \(92.5\%\), respectively. A stronger impact can be observed in the performances for the slope estimation layer, which goes from an overall accuracy of \(89.7\%\) to \(78.0\%\). As has been discussed previously, this layer is the most dependent on the knowledge of the geometry of the bodies at the training phase, thus suffering the most when completely new geometries are used as input for the prediction.

Looking at the top-losses plots also helps understand the conditions that make it more challenging for the networks to perform accurate predictions. Poor illumination conditions are again present in most of the worst cases for the networks. However, it can be seen how the limb estimation is much less accurate now that the networks have no a priori knowledge about the geometry of the body.

These results help confirm that including different geometries in the training phase of the algorithm greatly affects the performances of the posterior predictions for hazard detection, especially for slope estimation.

Fig. 18
figure 18

Statistics obtained from the test data set for shadow detection, feature detection, and slope estimation

Fig. 19
figure 19

Top losses for shadow detection, feature detection, and slope estimation

4.5 Layer composition

The results obtained from each layer have now been shown and discussed; however, they are not the final result that is expected from an HDA system. Instead, the different layers involved in hazard detection must be integrated to generate what is typically called a safety map.

Fig. 20
figure 20

Example of a rendered image, the evaluation of the different hazard detection layers, and the final safety map composition. Left is the input rendered image, centre shows (from left to right) feature detection, shadow detection, and slope estimation, and right displays the final safety map

For the creation of the safety maps in this work, a priority-based logic is chosen, where pixels flagged as deep space are prevalent over the ones flagged as a hazard, which in turn prevail over the safe pixels. This means that, if in each of the three layers described above a pixel is labelled differently, e.g. safe for the first layer, hazard for the second, and deep space for the third; it will be considered as deep space. Figure 20 shows the result of the composition of the three layers included in this work for an arbitrary image of the data set, and Table 4 shows some examples of how the logic works for a better understanding.

Table 4 Example of the logic followed for layer composition
Fig. 21
figure 21

Example of the probability maps used to conform a safety map on an image of comet Churyumov-Gerasimenko/67P

Each of the layers used in astroHda includes a softmax activation function that determines the label for the pixel, which is used in the generation of the safety maps. However, it is important to remark that this labelling comes from a probability assigned by the softmax function of a layer only concerned with its own purpose (i.e. feature detection, shadow detection, or slope estimation), but that is not aware of other layers. To further explain this concept, Fig. 21 includes the probability maps for each of the layers in addition to the target and prediction masks (already shown in Fig. 17). In case a pixel is simultaneously identified as shadowed, featured, and sloped, one would need to trace these labels back to their respective levels of certainty in their corresponding layers. Based on that certainty (or probability), an overall best fit can be obtained and labelled. For the particular case of a safety map, where the importance lies on whether a certain pixel represents a hazard or not, this is no longer needed, since it is not ultimately relevant which type of hazard the spacecraft is looking at, but the fact that it is indeed a hazard.

4.6 HIL

One of the main weaknesses of CNN-based methods for space navigation is the little amount of experience and testing they gather. Exposing technology to real-world scenarios where the developed algorithms need to cope with inputs that have not been computer-generated help point out the weaknesses of these technologies and understand whether they are ready-to-use. By performing HIL tests, the Technology Readiness Level (TRL) of the product is increased, as a measure of how close to a final state a certain technology is.

With that in mind, and to further test the trained networks, real images, obtained at the Space Hall facility from Deutsches Forschungszentrum für Künstliche Intelligenz (DFKI) Bremen, were taken imitating the lighting conditions that a lander probe or spacecraft could be facing in the last approach phase of a descent trajectory. Different feature configurations and illumination conditions are set up, to test the algorithm in plausible space scenarios.

The hardware used for the testing consisted of a Raspberry Pi 4 model BFootnote 1 and a Raspberry Pi Camera Module v2Footnote 2. The networks developed were loaded into the processing unit of the Raspberry Pi and the image acquisition and safety map generation were executed “on-board”. Although the whole test happens online in the Raspberry Pi, without any connection to a computer, the results are not considered to be processor-in-the-loop (PIL) because the algorithms developed lack any kind of processor-oriented optimisation. The computational times measured from the image acquisition instant until the safety map generation were in the order of seconds, which can be greatly improved by using dedicated hardware (as it was done in Pester and Schrittesser (2019)), optimising the input formatting for the networks, or even with more structural changes to the algorithms such as CNN-merging, for instance. A support structure for the Raspberry Pi 4 and the camera module was designed and 3D-printed and a rack was used to suspend the camera above the features. Figure 22 shows an overview of the set-up and the mount for the components utilised for image acquisition and processing.

Fig. 22
figure 22

Set-up used at DFKI Bremen for the image acquisition. On the left, a close-up look on the mount used to hold together the camera and the processing unit. On the right, an overview of the Space Hall before completely darkening the room

Fig. 23
figure 23

Images taken at DFKI, with different feature configurations and illumination conditions

Using the described set-up, several images with different feature configurations and lighting conditions were taken. Some of these images are shown in Fig. 23. Notice how, differently from the images used in the data set, the obtained images are saturated by what is meant to be the surface of the small body, i.e. no deep-space areas are present. These images as input made showed that the inclusion of zooming and noise as augmentation techniques was paramount to obtaining accurate predictions. Figure 24 shows the probability maps obtained for image (a) in Fig. 23 before and after these two particular augmentations were applied. The main effect in the probability masks predicted by the networks is the foggy structure of the initial maps (before noise inclusion), especially in the slope estimation layer. This is due to the noise introduced by the optical sensor and it is greatly improved by the inclusion of Gaussian noise in the images in the generated data set. The features and shadows are also improved by these augmentations and the edges are now better defined.

Fig. 24
figure 24

Probability maps for the images obtained at DFKI facilities at Bremen before (upper row) and after (lower row) including noise and zoom augmentations for the training data set

By substituting the softmax activation functions of the three layers by a high-pass filter triggering when the probability predicted by the network was higher than \(40\%\) (empirically), some of the remaining noise is removed. The obtained predictions are shown in Fig. 25, where it can be seen how the trained networks yield accurate results for the detection of hazards even for images taken with real hardware. Nevertheless, there is still room for improvement and it can be observed how poor illumination conditions make it more difficult for the networks to understand the limbs of the features, as was anticipated during training and testing. Also, the use of entirely different geometric conditions with respect to the data set poses a challenge for the network (even after including zooming in the augmentations), which struggles to fully identify the features and leaves some of the interior areas of them as safe spots. For these tests, the focus was put on the feature and shadow detection layers, since the floor underneath the features used was flat, so more testing would be required to check how well the networks can predict slopes when in very close approach conditions.

Fig. 25
figure 25

Predictions obtained for the different images taken at DFKI. Hazards are highlighted in yellow

5 Conclusions

This paper shows how the detection block of a passive HDA system is developed. CNNs utilised of to perform semantic segmentation on images, which are rendered using a dedicated pipeline for space exploration missions around small bodies. This pipeline is capable of working with different shape models, modifying their illumination conditions (based on Sun position, if needed), and even distributing stochastically populations of features over the surface of the selected body mesh.

With the data set generated, three networks are trained using a transfer learning approach, taking ResNet-34 and ImageNet as the backbone and initial database, respectively. After a few epochs of training, and including some augmentation techniques for the generated database, the networks are capable of producing satisfactory predictions for feature detection, shadow detection, and slope estimation, with the additional sub-product of limb detection.

When the networks are tested on a different data set, external to the training phase, the feature and shadow detection capabilities still yield accurate results, but the performances for slope estimation and limb detection are considerably lowered, suggesting that the inclusion of the target bodies (or at least their rough geometries) could be key to ensure the quality of the predictions. The predicted masks of the three layers trained are merged into what are called safety maps, which are a binary representation of the areas with and without hazards on the surface of the body and that are meant to be fed to the guidance subsystem of the spacecraft to modify its trajectory if needed. Whether using layer-specific pixel predictions or probability maps is a better alternative to conform the safety maps remains a very interesting research point and leaves a door open for future work opportunities in the topic.

Additionally, HIL tests were also performed, where the networks are tested against images taken by real sensors. The inclusions of Gaussian noise and cropping in the augmentations for the data set proves to be key for the evaluation of these images, and the outcome of these tests, albeit preliminary, is satisfactory.