1 Introduction

Semantic segmentation plays a significant role in unstructured and planetary scene understanding, offering invaluable knowledge to a robotic system or a planetary rover about its surroundings [1]. Through terrain semantic segmentation, robotic systems are able to analyze images or videos and accurately detect and classify multiple features or regions within their environments, allowing superior comprehension and spatial awareness. More specifically, robotic systems are capable of identifying and differentiate various elements including boulders, craters, or even potential obstacles and hazards. This fact allows the use of semantic information in the path planning, enabling the robotic system to navigate in challenging landscapes with increased safety. Moreover, accurate semantic segmentation is able to recognize potential mineral deposits or geological formations, contributing to scientific research for planet exploration.

Several studies investigate semantic segmentation in unstructured and planetary scenes using traditional algorithms without learning-based processes including [2,3,4,5], and machine learning algorithms such as [6,7,8,9]. However the last five years, terrain semantic segmentation based on deep neural networks dominates the literature [10].

Regarding the earthy unstructured scenes, in [11] and [14] authors propose semantic segmentation methodologies based on a modified DeepLabV3 + [12, 13] and a U-net with EfficientNet [15] backbone respectively, aiming to improve the scene understanding of self-driving vehicles in unstructured environments. Both models were trained and evaluated with IDD (Indian Driving dataset) dataset due to its high diversity, achieving satisfactory results using mean IoU (mean Intersection over Union) metric. In [16], authors propose a lightweight neural network for terrain semantic segmentation focused on unstructured environments which is capable of merging multi-scale visual features, in order to efficiently group and classify different types of terrains while a reinforcement learning algorithm, is able to utilize the predicted segmentation maps aiming to plan and guide a robot in paths with high safety. Similarly, in [17], a real-time terrain mapping method for autonomous excavators is presented, which is able to provide semantic and geometric information for the terrain using RGB images and 3D point cloud data, while a dataset which includes images from construction sites is designed and utilized. Regarding the datasets for earthy unstructured environments, in [18, 19], two publicly available datasets were developed for semantic segmentation deep learning models, focusing on self-driving in semi-unstructured or dense-vegetated environments. In [18], the dataset designed, for accurate comprehension in scenes with high coverage in grass, asphalt, soil and sand, while authors in [19], targeted more on dense-vegetated and rough terrain scenes for off-road self driving scenarios.

Concerning the planetary environments, several methodologies have been proposed for feature detection and terrain or scene segmentation aiming to reinforce and improve planet exploration tasks including landing, rover-based path planning, localization or planet surface investigation. In [20], a modified-U-net architecture [21] for rock segmentation on the martian surface is proposed, which was trained and tested with a Mars-like dataset [22] captured on Devon Island, achieving satisfactory accuracy. In [23], authors conduct a performance evaluation in rock detection for Mars-like environments using an original and modified versions of SSD (Single-Shot-Detector) [24] neural network, trained with the aforementioned dataset [22]. In [25], a modified Unet +  + architecture [26] for rock segmentation in planetary-like environments is proposed where two rounds of training are performed for the learning process. In the pre-training stage, the proposed architecture is fed by a synthetic dataset, created by a proposed algorithm while in the fine-tuning stage, the architecture is trained using a limited part of the Katwijk beach planetary rover dataset [27]. In [28, 29], authors conduct a benchmark analysis in Hazard Detection (HD) for planetary landing using several state-of-the-art semantic segmentation models compared with a replicated HD algorithm from NASA’s Autonomous Landing Hazard Avoidance Technology (ALHAT) project. The results proved that the segmentation architectures provide high efficiency on hazard detection outperforming the ALHAT algorithm in performance time and accuracy.

Several studies investigate the sky and ground segmentation in planetary environments, aiming to refine the scene understanding [30, 31]. In [30], an architecture for sky and ground segmentation in planetary scenes is proposed, inspired by U-net and NiN (Network In Network) [32] which was trained for two rounds with SkyFinder [33] and Katwijk beach planetery rover datasets [27] respectively. On the other hand in [31], a DeeplabV3 + neural network is utilized for skyline contour identification in martian environment, aiming to estimate the rover’s global position.

A significant limitation of deep learning methods in planetary environments, is the lack of qualitative real or synthetic available datasets, compared with datasets for urban or indoor environments [34]. In [34], authors propose a simulator which is able to construct valuable synthetic scenes for planetary environments including rich metadata while furthermore it is capable of generating multi-level semantic labels based on pre-defined materials. On the other hand, in [35], authors propose a large-scale dataset called AI4MARS for terrain semantic segmentation of Mars, aiming to reinforce autonomous navigation on the martian surface. AI4Mars includes about 35K annotated images captured by Curiosity, Opportunity and Spirit rovers while the labeling conducted by experts with the aid of crowdsourcing using a web-based annotation tool.

A crucial use of terrain classification in planetary environments is the path planning optimization [36]. In [37], a terrain segmentation model is proposed using PSPNet [38], trained by real rover-based images from Mars and artificial images generated by the Unity3D software, aiming to automate a path planning algorithm on the Martian surface. In [39], authors propose a methodology for path rerouting using imagery data, depth maps and a CNN-based neural network trained with Katwijk beach planetery rover dataset, in order to detect and avoid obstacles such as rocks and boulders.

Although several studies investigate rover-based scene recognition in Martian surface or planetary environments in general, quite few investigate similar tasks for the lunar surface. Lunar topography includes several features including rocks, boulders and craters, while the terrain in many areas is quite uneven with mounds and valleys. Although several studies propose methodologies for crater [40,41,42] or hazard [43] detection and segmentation, they focus on safe landing using remote-sensing images while there is a deficiency in rock and boulder identification during the rover navigation; a quite important issue for the smooth and trouble-free navigation.

In this study, a lightweight encoder-decoder neural network (NN) architecture is proposed for rover-based ground segmentation on the lunar surface. The proposed architecture is based on U-net and MobilenetV2 [44] while the training and evaluation process were conducted using a synthetic dataset with lunar landscape images. The proposed model provides robust results, allowing the lunar scene understanding focused on rocks and boulders. The main contributions of the study can be described as follows:

  • Development of a lightweight semantic segmentation model aiming to reinforce the autonomous rover navigation on the lunar surface

  • Investigation of lunar scene understanding through deep learning, using synthetic dataset in training and a combination of real and synthetic datasets in evaluation

  • Comparison of the model with U-net-based alternatives in different computing setups, proving the superiority of the proposed architecture in terms of accuracy and performance-time

  • Lunar scene understanding based on semantic-segmentation through deep learning, proves a great potential in autonomous lunar navigation ensuring a safer and smoother navigation on the moon

2 Materials and methods

Semantic information in unstructured environments provides a contextual understanding of objects and their relationships within an image, enabling machines to recognize and categorize features semantically, reinforcing crucial tasks including autonomous navigation in unknown planetary scenes. Although the literature includes several studies focused on terrain segmentation in unstructured scenes, there are two main gaps, that this study attempts to fill:

  • Semantic segmentation in the lunar surface using rover-based images, instead of the most studies that investigate scene understanding through semantic segmentation in earthy unstructured environments or in the martian surface

  • A lightweight semantic segmentation model, capable of being used in systems with low computing resources, providing high efficiency after training with a limited size of dataset

In other words, the scope of this study, is to develop a lightweight and robust semantic segmentation model, aiming to be used in lunar surface exploration. Two challenges have to be encountered: The first one is the lack of valuable rover-based datasets for the lunar surface, compared with Mars where several datasets have been proposed. The second challenge is the size of the model, since most of the semantic segmentation architectures are computationally expensive.

To address these challenges above, a U-net based architecture is proposed, since U-net is an efficient and accurate neural network in terms of accuracy which doesn’t require large datasets [45, 46].

More specifically, the proposed architecture is composed by an encoder-decoder architecture where a modified version of MobileNetV2 neural network is used as an encoder and a lighter decoder of U-net is utilized for the segmentation stage. To speed up, the learning process, the MobileNetV2 has been trained with ImageNet, a well-known image dataset which includes millions of general-purpose photographs. Thus, during the training process, the pre-trained network “transfers” its earned “experience” to the model, encountering the issue of the limited size of lunar surface dataset.

2.1 Modified U-net architecture

As referred above, the proposed architecture is based on U-net, a well-known architecture for semantic segmentation which was initially proposed for medical applications.

The U-shaped model of U-net can be separated in two main components: (a) the encoder, which reduces the image dimensions, increasing the feature maps while learning to classify the desired features, and (b) the decoder, which reconstructs the image dimensions, decreasing the feature maps and performing precise segmentation of the detected features. The U-net decoder, is able to segment the detected features retrieving the topology of the image content through four skip connections among different levels of the encoder. These connections transfer information to the decoder in order to maintain the spatial details of images with the aim to reconstruct them (Fig. 1).

Fig. 1
figure 1

U-net architecture

U-net is mainly composed by convolutional (Conv2D) and “BatchNormalization” layers. Regarding the encoder-decoder functionality, the encoder downsamples the image through the “MaxPooling2D” layer, and the decoder upsamples the image using the UpSampling2D layer while the “Concatenate” layer generates the skip connections between the encoder and decoder part. At the end, “softmax” (Eq. 1) which is the activation function is utilized in order to export the segmentation map for each input image.

$$\upsigma {(\overrightarrow{{\text{z}})}}_{{\text{i}}} = \frac{{{\text{e}}}^{{{\text{z}}}_{{\text{i}}}}}{\sum\nolimits_{\mathrm{j }= 1}^{{\text{K}}}{{\text{e}}}^{{{\text{z}}}_{{\text{j}}}}}$$
(1)

where \(\overrightarrow{z}\) is the input vector and zi presents the elements of the input vector. \(\sum_{\mathrm{j }= 1}^{{\text{K}}}{{\text{e}}}^{{{\text{z}}}_{{\text{j}}}}\) is a normalization term with K classes which ensures that the output of the function will sum to one and each output value will be in a range of (0, 1). In this study, the classes that are represented by K are rocks / boulders, sky and ground (background) (see Sect. 2.2).

Although U-net is an accurate semantic segmentation architecture, it provides increased performance-time while it requires a time-consuming training process with much experimentation in fine-tuning, since it includes about 31,000,000 trainable parameters. In order to accelerate the training process, “transfer learning” technique is utilized, using a pre-trained (with ImageNet dataset) MobileNetV2 as the encoder (Fig. 2).

Fig. 2
figure 2

Architecture of U-net with MobilenetV2 as encoder

MobileNetV2 is a CNN-based architecture designed for providing high efficiency in mobile devices while it has been utilized in multiple tasks of computer vision including classification, semantic segmentation, object detection, etc. The main MobilenetV2 architecture is composed by 19 residual bottleneck layers where each bottleneck is based on inverted residual block. The inverted residual block is based on a narrow-wide-narrow approach using a point-wise convolution with ReLU6, followed by a depth-wise convolution with ReLU6, followed by a linear point-wise convolution. Moreover, a skip connection, merges the input of the block with the output through the “Add” layer (Fig. 3). ReLU6, a modification of the well-known activation function ReLU (Rectified Linear Unit), performs the non-linear transformation aiming the model to learn more complex tasks while outperforms the traditional ReLU in accuracy and execution-time [47].

Fig. 3
figure 3

Inverted residual block architecture

The approach of inverted residual blocks reduces the extracted parameters and computation compared with conventional convolution layers. According to Sandler et al. 2018 [44] when the kernel k = 3 for 3 × 3 depth-wise convolution, the computational cost is about 9 times smaller compared with traditional convolution without significant reduction in accuracy.

More specifically, if the input of a traditional convolution is \({h}_{i} \times {w}_{i} \times {d}_{i}\) where h and w, the image dimensions and d, the depth or channels while the output is \({h}_{i} \times {w}_{i} \times {d}_{j}\), then the computational cost is calculated as \({h}_{i} \times {w}_{i} \times {d}_{i }\times {d}_{j} \times k \times k\), where k, the kernel size, while the corresponding computational cost of an inverted residual block will be: \({h}_{i} \times {w}_{i} \times {d}_{i }({k}^{2} + {d}_{j})\).

The combination of the original pre-trained MobileNetV2 as an encoder with U-net decoder, provides a more lightweight architecture including about 8,000,000 trainable parameters instead of U-net which includes about 31,000,000, while it is able to accelerate the training process. However, this architecture remains unsuitable for applications which require high efficiency in terms of inference-time especially for real-time tasks.

To deal with low-performance time without reducing the accuracy, an architecture based on a modified MobileNetV2 encoder and a lightweight U-net decoder, is proposed.

Regarding the modified MobileNetV2, is composed by an initial fully convolution layer followed by 13 residual bottleneck layers, instead of the original MobileNetV2 which includes 19, since right after the block 13, the parameters are highly increased in the original architecture from about 92,000 to 155,000. Moreover, to further reduce the computational cost, the depth-multiplier which is a positive factor that multiplies the channels through the depth-wise convolution, was defined with a value of 0.35 instead of 1.0 aiming to decrease the output channels of the depth-wise convolution layers. It’s worth mentioning that for depth-multiplier values less than 1.0, the depth-multiplier is applied to all layers except the last convolution layer.

Concerning the U-net decoder, all the filters of the convolution layers were divided by the factor of 2 aiming to accelerate the segmentation stage while the four skip connections connects the input image, the block 1, block 3 and block 6 of the encoder respectively.

The proposed architecture includes about 220,000 trainable parameters which are far fewer than the 31,000,000 and 8,000,000 trainable parameters of U-net and original MobileNetV2/U-net respectively.

The proposed architecture with detailed representation of the layers is presented in Fig. 4 while a more abstract representation is depicted in Fig. 5.

Fig. 4
figure 4

Proposed architecture

Fig. 5
figure 5

Proposed architecture for lunar terrain segmentation

2.2 Dataset

As referred above, there is a lack in datasets for lunar surface segmentation while to the best of author’s knowledge, there is not rover-based image dataset with real lunar landscapes. Instead, several datasets for the martian surface have been proposed.

Thus, a dataset with artificial rover-based images which depicts lunar landscapes was utilized for training and validation of the proposed architecture,. The dataset was created by the Space Robotics Group of Keio University in Japan, using Planetside Software's Terragen and a DEM (Digital Elevation Model), based on Lunar Orbiter Laser Altimeter on NASA [48]. It includes about 9,700 artificial images and the corresponding annotated masks taking into account the following four classes: large rocks, small rocks, sky and ground (background) (Fig. 6):

Fig. 6
figure 6

Dataset of lunar surface for semantic segmentation bySpace Robotics Group of Keio University in Japan. The artificial images are presented in the left column while the corresponding masks in the right column

Several drawbacks are included in the dataset, such as the decreased accuracy in feature segmentation and the lack of balance between the classes of large rocks and small rocks. To deal with the imbalanced classes, the two classes of rocks were merged in one class. Thus, the new dataset includes the following classes: rocks, sky and ground (background).

Nevertheless, since this is the only publicly available dataset for the lunar surface focused on semantic segmentation, it was utilized in order to train and validate the proposed architecture, aiming to provide a lightweight model for potential use in systems with low computing resources during the rover navigation, on the lunar surface.

3 Implementation and results

In this section, the implementation of the proposed modified U-net architecture is described while afterwards, the evaluation and results of the model for lunar ground semantic segmentation, are presented.

3.1 Training process of modified U-net

The proposed architecture was implemented using Python and Keras / TensorFlow deep learning library [49] while several Python libraries including NumPy [50], Matplotlib [51] and Scikit-learn [52] were utilized.

The main goal of the architecture is to detect and localize rocks and boulders while in order to segment the whole scene, three classes are taken into account: rocks, sky and background. The training data which constitute the 70% of the lunar landscape dataset feeds the modified U-net while the remaining 30% of the dataset is used for the validation and testing. The model was trained for 15 epochs using the early stopping technique while the batch size was defined equal to 16. The categorical cross entropy loss function and Adam optimizer with a learning rate of 5 × 10−5 were utilized. Regarding the input size, the dimensions of 480 × 480 pixels was used, since it was observed that a larger image size was provided more refined results than the widely used size of 256 × 256 pixels.

The training and validation process were conducted in a machine with Intel i7- 3.50 GHz × 8 cores of CPU, 16 Gb of RAM and NVIDIA GTX 1080 Ti of GPU with CUDA version 11.2 enabled.

3.2 Evaluation and results of modified U-net

The proposed architecture was trained and validated using dice-coefficient, recall, Io (Intersection over Union) and precision metrics which are defined with the following formulas:

$$\mathrm{Recall }= \frac{{\text{TP}}}{\mathrm{TP }+\mathrm{ FN}}$$
(2)
$$\mathrm{Dice }=\frac{{\text{TP}}}{2\mathrm{TP }+\mathrm{ FN }+\mathrm{ FP}}$$
(3)
$$\mathrm{IoU }=\frac{{\text{TP}}}{\mathrm{TP }+\mathrm{ FN }+\mathrm{ FP}}$$
(4)
$$\mathrm{Precision }=\frac{{\text{TP}}}{\mathrm{TP }+\mathrm{ FP}}$$
(5)

where, TP stands for true positive while FN and FP stands for false negative and false positive. The results after the training process are presented in Table 1 while the learning curves of loss function and dice coefficient are depicted in Fig. 7.

Table 1 Loss function, dice-coefficient, recall, IoU (Intersection over Union) and precision after the training process
Fig. 7
figure 7

Loss and dice coefficient curves during training and validation

As observed in Table 1, the value of loss function is below 0.1, the dice-coefficient is in a level of 0.80, the recall is close to 1.0 while the IoU and precision are in a level of 0.70 and 0.75 respectively, indicating that the model is able to provide satisfactory results. Moreover, in Fig. 7 the learning curves of the training and validation process for loss function and dice-coefficient are quite close after the sixth epoch without fluctuations proving that the model doesn’t overfit.

After the training process, the proposed architecture was validated in testing data which are completely unknown for the model including images from the synthetic dataset and from real lunar landscape images while the corresponding qualitative results are presented in the Figs. 8 and 9.

Fig. 8
figure 8

Left column: Original images from the synthetic lunar surface, (middle column) The corresponding annotated masks, (right column) Predictions of the proposed architecture. In each prediction (row) the IoU (Intersection over Union) metric is presented

Fig. 9
figure 9

Left column: Real images from the lunar surface, (right column) Predictions of the proposed architecture. In each prediction (row) the IoU (Intersection over Union) metric is presented

As observed in Fig. 8, the proposed architecture provides satisfactory results in testing data with synthetic images, achieving IoU (Intersection over Union) in a level of 0.85 or above. It is able to differentiate the sky from the ground region defining the horizon line with high accuracy while it precisely predicts the location of the small rocks and boulders on the lunar surface. It is not affected from the number of rocks that exist in the scene, since it is able to provide robust results in a scene without any or one rock (Fig. 8d, e) or with multiple small rocks and boulders (Fig. 8c).

Moreover, the proposed architecture achieves respectable results in real rover-based images (Fig. 9a-d) which are quite different in terms of color and illumination compared with the training data. The model is not affected by the camera tilt, being capable of identifying rocks, either the camera targets on the horizon (Fig. 9a, b) or on the ground (Fig. 9c, d).

Regarding size of the model, it includes only 220,000 trainable parameters while the weights file size of the model is about 3.5 MB which is considered quite small for semantic segmentation models. The model was tested in terms of inference time for a set of images with a size of × 480 pixels using three different computing setup: (a) a GPU-enabled conventional desktop machine, (b) CPU-only conventional desktop machine and (c) a CPU-only embedded system with quite low resources. The results are presented in Table 2.

Table 2 Inference time (in milliseconds and FPS) of the proposed model in a desktop GPU-enabled and CPU-only conventional desktop computer and in a CPU-only embedded system with low resources

As observed in the Table 2, the model provides quite satisfactory inference time in the GPU-enabled machine achieving 40 ms inference time per image and 25 FPS (Frames per second). The model performs sufficiently without GPU (CPU-only) in the same machine, providing a performance time in a level of 100 ms per image and 10 FPS. The model was also tested on a Raspberry Pi 4 with 4 GB of RAM which is a CPU-only embedded system with quite low resources, providing inference time equal to 1080 ms and 0.92 FPS. Overall, the results are considered respectable taking into account that the image segmentation tasks require high-end GPU-enabled machines and prove that the model can be be used in GPU-enabled or CPU-only conventional machines and embedded systems with low computing resources.

4 Discussion

In this study, a deep learning architecture for semantic segmentation is proposed, which is able to understand semantically a lunar scene, focused on detecting and classifying rocks and boulders. The main goal of this study is the implementation of a lightweight deep learning model with a potential use in real-time, in order to increase the safety of rover navigation during a mission on the moon.

Thus, an encoder-decoder architecture was developed which is composed by a modified MobileNetV2 neural network as encoder and a lightweight U-net decoder. Regarding the MobileNetV2 architecture, it includes a fully convolution layer followed by 13 residual bottleneck layers while the depth-multiplier factor was defined in a value of 0.35 instead of the original MobileNetV2 which includes 19 residual bottleneck layers and the default depth-multiplier factor is equal to 1.0. Concerning the segmentation stage, all the filters of U-net decoder were divided by the factor of 2 while the skip connections transfer information related with the spatial content of each image from several layers (the initial input, the block 1, the block 3 and the block 6) of the encoder part.

As presented in Sect. 3.2, the proposed architecture provides robust results achieving IoU in a level of 0.80 or above, detecting and classifying rocks and boulders with satisfactory accuracy in both synthetic and real rover-based images from the lunar surface. To further validate the proposed architecture, it was compared with three similar and widely used encoder-decoder architectures based on U-net:

  • The original U-net

  • The U-net with VGG16 as encoder

  • The U-net with the original MobileNetV2 as encoder

The architectures above, were trained and tested under the same parametrization so as a fair and proper evaluation to be conducted.

The trainable parameters of the proposed architecture is about 220,000 while the corresponding trainable parameters of U-net, VGG16/U-net and MobileNetV2/U-net are about 31,000,000, 24,000,000 and 8,000,000 respectively. The weights file sizes are about 370 MB for U-net, 285 MB for VGG16/U-net and 97 MB for MobileNetV2/U-net while the corresponding weights file size of the proposed architecture is about 3.5 MB (Table 3).

Table 3 Parameters and model size of the U-net, VGG16/U-net, MobV2/U-net and the proposed architecture

In Fig. 10, qualitative results from the alternative and the proposed architectures are depicted while in Table 4 the corresponding IoU score is presented. It’s worth noting that original MobileNetV2/U-net could not converge with this specific parametrization, thus in the results below the proposed architecture is compared with original U-net and VGG16/U-net.

Fig. 10
figure 10

First column: original synthetic (a, b, c) and real (d, e, f) lunar images, second column: original U-net model predictions, third column: VGG16/U-net model predictions, fourth column: proposed architecture

Table 4 IoU score in testing data of the original U-net, VGG16/U-net and the proposed model, trained with the same dataset and parametrization

As observed in Fig. 10, all the models produce respectable segmentation results. In Fig. 10a, b, c and d the proposed model provides similar accuracy in rocks segmentation compared with the original U-net and VGG16/U-net, predicting all the important rocks and boulders that could harm a rover during navigation. On the other hand, in Fig. 10e and f which are depicted real images from lunar surface, the proposed architecture provides refined segmentation results compared with the alternative models. For instance, in Fig. 10e, the proposed model precisely segments the two main rocks on the ground instead of original U-net which fails to predict them while VGG16/U-net falsely unifies them in a bigger rock. Similarly, in Fig. 10f, the proposed model and VGG16/U-net produce quite close results while original-U-net falsely predicts a large shadow as a rock.

Regarding the evaluation of the models on testing data in terms of intersection over union (IoU), the proposed architecture provides an IoU score of 0.84 (Table 4) outperforming the VGG16/U-net while is close to IoU of U-net which is equal to 0.86. The results above, determines the superiority of the proposed architecture since, it is about 110 times and about 140 times smaller than the VGG16/U-net and the original U-net respectively while provides similar segmentation predictions in both alternative architectures.

It’s worth noting that, although all the models provide robust results in sky segmentation defining the horizon line, they are unable to classify the sky as separate class. This is due to dataset's lack of color variety on the ground features and the large black shadows presented on the ground. Thus, because the images are synthetic, there is no a meaningful difference between the sky and the large black areas on the ground. Nevertheless, a refined synthetic rover-based dataset or a dataset with real lunar landscape images, would solve this issue, improving the classification results of all the models.

Regarding the inference-time, the models were tested on a large set of images with a size of 480 × 480 in three different computing setups: (a) a GPU-enabled conventional desktop machine, (b) CPU-only conventional desktop machine and (c) a CPU-only embedded system with quite low resources. The corresponding results are presented in the Table 5 and Fig. 11.

Table 5 Comparison in terms of inference time (in milliseconds and FPS) of the original U-net, VGG16/U-net and the proposed model in a desktop GPU-enabled and CPU-only conventional desktop computer and in a CPU-only embedded system with low resources
Fig. 11
figure 11

Inference time in millisecond (ms) of the U-net, VGG16 / U-net and the proposed model for the GPU-enabled machine, the CPU-only machine, and the Raspberry Pi 4 embedded system

As observed in Table 5 and Fig. 11, the proposed model achieves quite less inference time compared with the U-net and VGG16 / U-net while the difference in performance-time is increased among the models when the computing resources are reduced. Regarding the GPU-enabled machine, the proposed model achieves 43 ms and 23.25 FPS, while the VGG16/U-net provides 52 ms (19.23 FPS) of inference-time and U-net about 100 ms (10 FPS) which is twice the time compared with the proposed model. In the CPU-only machine, the proposed model provides inference-time in a level of 100 ms (10 FPS) while the VGG16 / U-net and U-net models perform predictions with 640 ms (1.56 FPS) and 850 ms (1.17 FPS) inference-time respectively, six and nine times more than the proposed model. Concerning the Raspberry Pi 4 with 4GB of RAM embedded system, the proposed model achieves an inference-time about 1080 ms (0.92 FPS) which is quite satisfactory since to the best of our knowledge, this embedded system provides the lowest computing resources on the market, especially in deep learning. Instead, the VGG16/U-net and U-net models provide 11,120 ms (0.09 FPS) and 19,680 ms (0.05 FPS) inference-time, proving that the proposed model is about 11 and 20 times faster in the Raspberry Pi 4 embedded system compared with the VGG16/U-net and U-net models respectively.

5 Conclusions

In summary, an encoder-decoder architecture for semantic segmentation was developed, aiming to reinforce the safety of rover navigation on the lunar surface. The main goal of this study was the implementation of a semantic segmentation model for the lunar surface, capable of being utilized by embedded systems with low computational resources.

To achieve this goal, a deep learning architecture based on U-net neural network was developed, since U-net is able to provide respectable results, trained with limited size of datasets [21]. To reduce the computational cost of U-net, a modified MobileNetV2 neural network was used as the encoder, while a lighter version of U-net decoder were implemented in order to accelerate the segmentation stage. The proposed architecture was fed with a publicly available dataset which includes rover-based synthetic images from the lunar surface. Although it contains several limitations and drawbacks, including lack of color variations, and low accuracy in labeling of features, to the best of author’s knowledge, it is the only available dataset of the lunar surface focused on deep learning models’ training.

As a result, the proposed model achieves satisfactory accuracy in scene segmentation, in synthetic images and in real rover-based images of the lunar surface while it includes significantly less trainable parameters than U-net based alternatives. The proposed architecture was evaluated compared with the original U-net, with VGG16/U-net and with the original MobileNetV2/Unet neural networks which were trained under the same parametrization. The trainable parameters and weights file size of the models proved that the proposed architecture is about 140 times smaller than the original U-net, 110 times than the VGG16/U-net and 36 times smaller than the original MobilenetV2/U-net while it provides quite close accuracy in terms of IoU with the original U-net and outperforms the U-net based alternatives. Moreover, the models were tested in three different computing setups, two conventional machines (GPU-enabled and CPU-only) and an embedded system with low computing resources, proving that the proposed model is quite faster than U-net and VGG6/U-net in all computing systems and especially in the embedded system.

However, due to the aforementioned drawbacks of the dataset, the proposed model could further be improved especially in classification but also in segmentation task adding more classes such as, sandy regions, bedrocks, craters, etc., using a more qualitative dataset with synthetic or even better with real rover-based lunar images. Given that a qualitative dataset from the lunar surface will be available in the near future due to the planned missions of NASA’s Artemis program, the proposed architecture is able to provide a significant potential in lunar scene understanding, ensuring safe and precise navigation, and to contribute in groundbreaking discoveries, expanding the scientific understanding of the Moon.