Toward the recognition of spacecraft feature components: A new benchmark and a new model

Countries are increasingly interested in spacecraft surveillance and recognition which play an important role in on-orbit maintenance, space docking, and other applications. Traditional detection methods, including radar, have many restrictions, such as excessive costs and energy supply problems. For many on-orbit servicing spacecraft, image recognition is a simple but relatively accurate method for obtaining sufficient position and direction information to offer services. However, to the best of our knowledge, few practical machine-learning models focusing on the recognition of spacecraft feature components have been reported. In addition, it is difficult to find substantial on-orbit images with which to train or evaluate such a model. In this study, we first created a new dataset containing numerous artificial images of on-orbit spacecraft with labeled components. Our base images were derived from 3D Max and STK software. These images include many types of satellites and satellite postures. Considering real-world illumination conditions and imperfect camera observations, we developed a degradation algorithm that enabled us to produce thousands of artificial images of spacecraft. The feature components of the spacecraft in all images were labeled manually. We discovered that direct utilization of the DeepLab V3+ model leads to poor edge recognition. Poorly defined edges provide imprecise position or direction information and degrade the performance of on-orbit services. Thus, the edge information of the target was taken as a supervisory guide, and was used to develop the proposed Edge Auxiliary Supervision DeepLab Network (EASDN). The main idea of EASDN is to provide a new edge auxiliary loss by calculating the L2 loss between the predicted edge masks and ground-truth edge masks during training. Our extensive experiments demonstrate that our network can perform well both on our benchmark and on real on-orbit spacecraft images from the Internet. Furthermore, the device usage and processing time meet the demands of engineering applications.


Introduction
Spacecraft surveillance and recognition systems have found application in on-orbit maintenance, space docking, and other orbit services. When offering on-orbit services, one of the most significant tasks a satellite has to complete is the recognition of spacecraft feature components. Several countries have developed miscellaneous spacecraft surveillance and recognition systems since the start of the 21st century. For example, to mitigate the growing risk of space debris, the Air Force Research Laboratory (AFRL) implemented a plan named Autonomous Nanosatellite Guardian for Evaluating Local Space (ANGELS) in 2005. Special satellites were placed in geostationary orbit to conduct continuous surveillance and accurate detection of satellite targets in the near-space area. These satellites extract important attributes of the target and communicate with their own surveillance platform for tl614@sina.com feedback [1]. In 2014, the U.S. Air Force implemented the Geosynchronous Space Situational Awareness Program (GSSAP), which included four satellites to monitor other satellites in space [2][3][4]. Canada launched a satellite named NEOSSAT in 2013 to detect asteroids in near-Earth space. It was also designed to track high-orbit satellites and space debris [5]. These attempts have many shortcomings: (1) traditional detection approaches such as radar or infrared observations have cost and energy supply restrictions; (2) they do not have any on-orbit ability to process data acquired by the sensors. In other words, they have to communicate with the ground control center, leading to signal delays and substantial human costs.
Compared with traditional detection methods, image recognition is an easy but relatively effective method for obtaining position and direction information. Tang and Zou proposed a non-cooperative spacecraft recognition approach based on the fusion of local features [6]. Zhi et al. and Hu also introduced combined multi-feature metrics towards the recognition of spacecraft feature components [7,8]. However, these traditional imageprocessing studies, which depend on elaborate designs, are geared towards certain types of spacecraft. Moreover, they tested their methods on a few images, rather than on a standard dataset. Recently, deep learning technology has achieved great success in the field of artificial intelligence [9]. Convolutional neural networks (CNNs) have yielded outstanding results in the computer vision field and refresh almost all of the records [10]. These methods have rendered the recognition of spacecraft feature components feasible both in algorithms and engineering applications. Zhang et al. used CNN and long short-term memory network (LSTM) for space target recognition [11,12]. However, they attempted to recognize a space target based on radar echo signals. To the best of our knowledge, a specialized CNN-based deep learning model has not yet been proposed for the recognition of spacecraft feature components in optical images. Even more importantly, the lack of substantial on-orbit images makes it difficult to train or evaluate such a model. This paper introduces a new labeled dataset with thousands of artificial on-orbit spacecraft images. Base images are derived from 3D Max and STK software, and take both the modeling accuracy and costs into consideration. These images include many types of satellites (such as low earth orbit, medium earth orbit, and geosynchronous earth orbit satellites) and many postures. A degradation algorithm including geometric transformation, noise, and image blur was developed to simulate on-orbit illumination conditions and imperfect camera observation. Thousands of artificial spacecraft images were generated by applying this degradation algorithm to the base images. The feature components, such as panels of the spacecraft in all degraded images, were annotated manually to evaluate the performance of the recognition algorithms.
Our work mainly focuses on semantic segmentation for the recognition of spacecraft feature components, which involves labeling the images with pixel-level semantic information [13]. Traditional image segmentation methods mainly include four classes: (1) threshold segmentation, in which an image is divided into several features based on thresholds with several pixel values; (2) edge detection segmentation, in which the pixel gray level or discontinuous color of the edge is applied to detect the area and achieve complete image segmentation; (3) regional growth segmentation, the essence of which is to connect pixels with similar features, such as grayscale features, shape features (e.g., scale-invariant feature transform (SIFT) [14], histogram of oriented gradient (HOG) [15]), and other features, whereas it is challenging to select features; (4) graph theory, which entails mapping an image to a graph, and transforming the segmentation of the image into the partitions of the graph. This includes the Markov random field method based on graph cut [16] and the random walk method [17].
All of these traditional methods have specific disadvantages; hence, it is not easy to use one of these approaches to recognize spacecraft feature components when having to overcome different spacecraft postures, backgrounds, and light conditions. In contrast to these methods, the deep learning method can automatically learn, extract, and accurately represent data features. CNN-based semantic segmentation establishes mapping from pixels to semantics without the need for posterior manual work. FCN [18], SegNet [19], and U-NET [20] are classical models of semantic segmentation using CNNs. E-Net [21], and Link-Net [22] are light networks created specifically for tasks that require low-latency operations, which have fewer parameters and faster speeds. DeepLab V3+ [23], a state-of-the-art model in the current semantic segmentation domain, enables segmentation precision across multiple datasets. However, our initial work showed that the DeepLab V3+ model has poor edge recognition abilities. The feature components of a spacecraft usually have regular shapes. Poorly defined edges provide imprecise position or direction information and lower the performance of on-orbit services.
In this paper, we propose a border-supervised network based on the DeepLab V3+ model to address this problem. We leveraged the edge information of the target as a supervisory guide by proposing the Edge Auxiliary Supervision DeepLab Network (EASDN) algorithm, which calculates the L2 loss between the predicted edge masks and ground-truth edge masks during training. Subsequently, we conducted extensive experiments to demonstrate that our networks perform well both on our benchmark and on real on-orbit spacecraft images from the Internet. Our light version, which is less computationally expensive, can also meet the demands of device usage and time in engineering applications.

Spacecraft feature component dataset
Owing to the particularity of spacecraft, it is difficult to obtain real on-orbit images from the Internet, and those in the public domain are almost all simulation images. Thus, we chose to produce base spacecraft images using threedimensional (3D) animation rendering and production software. We enlarged these base images to render an on-orbit effect by applying a degradation algorithm. In addition, the spacecraft feature components in all of these images were labeled.

Base images
Zhang et al. established the BUAA-SID 1.0 space target database [24], which is based on 3D models of space targets and utilizes 3D Max software to render and generate a full-view simulation image sequence of a space target. However, the number of images is limited, and the rendering effect does not closely correspond to the real on-orbit environment. Our base on-orbit images of spacecraft stem from 3D Max and STK software, both of which exploit the 3D model of the spacecraft to obtain the simulation image. The difference lies in the image quality and computational cost. It takes a significant amount of time for 3D Max to render a 3D model of the spacecraft using a raytracer, although the quality of the rendering image is high and close to that of real on-orbit satellites. STK, on the other hand, generates simulation images without distributed rendering. Capturing 3D models directly in a space scene from STK is readily achievable and saves time, but the outputs are not realistic. Figure 1(a) is the rendering result produced with 3D Max, and Fig. 1(b) is the result obtained with STK. The 3D model in Fig. 1(b) is composed of regular objects such as balls, cubes, conical surfaces, and so on. They lack textures, reflections, and shading, which are necessary to simulate realistic images.  Rendering results for satellite "a2100" using 3D Max and STK software.

3D Max
3D models of spacecraft can be built and rendered from essential 3D objects and components using 3D Max software. Nevertheless, we need to survey the geometric structural parameters of the spacecraft. Thus, it would be more efficient to utilize the models in the STK satellite library and import the files of the STK model (which would need to be converted to the ".lwo" format) into 3D Max software for rendering. In total, 16 types of satellites were rendered by the 3D Max software.

STK
STK software was utilized to simulate the on-orbit operation of a satellite and record high-definition videos with a resolution of 1280 × 960 pixels at 60 frames per second. Color images were extracted from the videos every five frames as STK output images. Seven types of satellites were produced using STK software. Each satellite contained more than 200 images.
The 3D Max or STK software produces RGB images, whereas real on-orbit satellite images are usually captured in gray mode. Conversion to grayscale is therefore necessary for all images to create a drawing of a real situation. Because human eyes are most sensitive to green and least sensitive to blue, a reasonable grayscale image can be obtained by calculating the weighted average of RGB channels according to the following formula [26]: The resolution of the base images is scaled to 320 × 240 pixels for a unifying setting. The image size is scaled by the bicubic interpolation method, which retains the visual effect of the original image as much as possible. This method considers not only the variation in the four neighboring pixels, but also the variation of other surrounding pixels, which has a smoothening effect on the edges of the object [27]. Ultimately, we produced images of 23 satellites, which were used to create 4550 gray images. The spacecraft feature components were labeled on the base images.

Degradation algorithm
The spacecraft image captured on orbit was found on the Internet, as shown in Fig. 2 [25]. As can be seen from the figure, the optical observation image contains a considerable amount of noise, which is especially obvious in the background of the image. As a result of the effect of noise interference, the color of the background is no longer black, whereas the gray values of these pixels are degraded.
Therefore, the real on-orbit optical image of the spacecraft needs to be processed by adding simulated illumination (light or sunshine), but the influence of noise, distortion, and geometric transformation must be considered. The process followed by the degradation algorithm is illustrated in Fig. 3. The synthetic degradation shown in Fig. 3 refers to the process of applying several of the aforementioned degradation operations to a base image, to obtain the artificial spacecraft images under different conditions. It is worth noting that the operations of noise, blur, and illumination are solely conducted on the base image and its mask does not need to be changed. On the other hand, for the geometric transformation operation, it is necessary to process the base image and its corresponding labeled image simultaneously. At the same time, the dataset of spacecraft feature components can be enlarged using this degradation algorithm, because one base image can yield many (we set this number to 7 for diversity) artificial onorbit images by randomly applying different operations in different operating sequences.

Illumination
The illumination added here is stronger at the center of the lobe than that around the lobe. This means that the main lobe is the brightest locally. The relevant expression is as Eq. (2): where k is the brightness of the center point, (x 0 , y 0 ) are the coordinates of the center point, and r is the radius. When k is sufficiently large, the main lobe of illumination tends to be overexposed, thus losing its original texture. If k is too small, the effect of the illumination operation is not conspicuous. In our dataset, the value of k is 128. The radius is equal to a quarter of the minimum width and height of the input image. In our dataset, the background in the base image is almost entirely black, and only the spacecraft has gray values. Therefore, it is appropriate to set the center point as the centroid of the base image. The image centroid calculation method was as Eq. (3): where (x 0 , y 0 ) is the centroid coordinate, and f (x, y) is the gray value of pixel (x, y).
The effect of illumination is depicted in Fig. 4.

Noise
To ensure that the images in the dataset more closely resemble those of the real spacecraft in orbit, it is necessary to add suitable noise interference to the images with a black background. According to the probability distribution of the noise, the actual image noise can be divided into Gaussian, Rayleigh, gamma, exponential, and uniform noise. Gaussian noise (i.e., normal noise) is often employed in practice because it is mathematically easy to use in space and frequency domains. The probability density function of the Gaussian random variable z is given by the following formula [28]: where z is the gray value, µ represents the average value of z, and σ is the standard deviation of z. The performance of the noise is shown in Fig. 5. In our dataset, we exploited the Gaussian-distributed additive noise. The mean of the random distribution was 0, and its variance was set to 0.005.

Blur
After the noise operation, the image texture too obviously consists of particles, and the image is still far from the actual situation. Therefore, Gaussian blur was added to ensure the image is closer to the real situation. The blur radius of the Gaussian blur was 0.05.

Geometric transformation
Geometric transformation processing is also required to expand the dataset to include additional feasible images and to counteract overfitting that occurs during training. This processing does not change the pixel value of the image, and it only maps the position coordinates of one image to another. Our geometric transformations include translation, rotation, scaling, and random combinations of these three essential operators, which are illustrated in Fig. 6. Eventually, synthetic degradation, which includes these operations, was carried out, as shown in Fig. 7. As a result, our final Spacecraft Feature Component Dataset (SFCD) consists of 27,240 on-orbit images.

Our models
In Section 2, we described the process we followed to construct the new SFCD, which consists of images of on-orbit spacecraft and their corresponding labels. This  section is concerned with the problem of building an effective model that recognizes the feature components of an on-orbit spacecraft. The degradation algorithm (especially the operations responsible for adding noise, blur, and illumination) increases the difficulty of semantic segmentation, inevitably in terms of object details and boundary regions. To address this problem, we developed the EASDN based on the DeepLab V3+ model, which we modified to take into consideration the degraded image boundary information.

DeepLab V3+
DeepLab V3+ [23], which forms the basis of our intelligent spacecraft feature component recognition network, offers the advantages of portability and high precision. It is the latest version of the DeepLab algorithm series [29][30][31]. To fuse multi-scale information, DeepLab V3+ introduces the encoder-decoder architecture commonly used in semantic segmentation, and balances precision and cost in terms of time via atrous convolution by controlling the resolution of the encoded features. The DeepLab V3+ design is structured such that the entire previous model of DeepLab V3 [31] is employed as an encoder, followed by a simple but effective decoder. Atrous spatial pyramid pooling (ASPP) is the core of the DeepLab algorithm series. Multiple effective receptive fields can be realized via convolution and pooling operations with different dilation rates to obtain multiresolution features and discover multi-scale context information. DeepLab V3+ also utilizes depthwise separable convolution to reduce the computational overhead. DeepLabV3+ has advanced semantic segmentation capabilities and achieves segmentation accuracy on multiple benchmarks. DeepLab V3+ extracts deep features with an Xception structure and performs feature fusion. Other common feature extraction networks, such as ResNet [32], DRN [33], and MobileNet [34], can also be applied as the backbone of DeepLab V3+. According to our experiments and the ablation study of Chen et al. [23], the model based on Xception is the best. Although the MobileNet model is not as good as the model based on Xception, it has fewer parameters and is computationally less complex. Hence, MobileNet can be employed as a backbone that meets the requirements of a light model. Based on these factors, Xception and MobileNet were selected as the backbone for feature extraction.

Edge Auxiliary Supervision DeepLab Network
To simulate the on-orbit environment, the degradation algorithm is applied to the base images. Unfortunately, it causes the information about the target space to be lost in the image, especially in the boundary area of the spacecraft, thus reducing the precision of semantic segmentation. Inspired by previous work [35,36], we developed the EASDN to which we added a supervision guide for the target, enabling the neural network to predict the edge of the target more precisely and improve the segmentation accuracy simultaneously. The structure of the model is illustrated in Fig. 8. As shown in Fig. 8, the predict image P is obtained by the model. The predicted edge map P E is the product of the Edge Extract Module with input P . The true edge map G E is produced in the same manner as the ground truth G. Then, the true edge map G E is designed to pass through a smooth module to obtain the smoothing true edge map G s . Finally, the edge auxiliary loss is calculated between P E and G s .

Edge Extract and Smooth Module
The function of the Edge Extract Module is to extract the edge of the object in an image by using an edge detection filter. The inputs are the predictive and real masks. Edge detection filters can generally be described as having a 3 × 3 convolution. The Sobel filter [37] and Laplace filter [28] are well-known edge detection filters. Based on our experiments, the Laplace filter is isotropic and delivers superior performance; therefore, we chose it as our edge detection filter. Its basic formula is as Eq. (5): where x(u, v) is the input image, y(u, v) is the output image, and c is a coefficient.
Smooth Module is prepared for the true edge map G E , which facilitates training and alleviates the overfitting of our model. Here, we leverage the Gaussian filter. Alternatively, the median filter or mean filter would also be acceptable because their effects are almost the same in this specific problem according to our experiments.

Loss function
The total loss L Total consists of the original pixel-wise cross entropy loss L CE and the new Edge Auxiliary Loss L Edge : where α is a weight parameter that balances the weights of L Total and L Edge and L CE is a common loss function for semantic segmentation [38]. The definition is where y ′ is the predictive value and y is the true label value. In addition, the formula for L Edge is as Eq. (8): where x is a sample and m is the sample number of a batch. L Edge is the mean square error between P E and G s .

Training setting
The model was tested on a Linux Ubuntu 16.04 system based on the TensorFlow framework. Three NVIDIA GeForce GTX 1080Ti 11 GB GPUs, Intel Core i7 CPU, and 16 GB of memory were used for the training process.
The training settings are listed in Table 1. The learning rate scheduler is a method for adjusting the learning rate, which has a certain influence on the final training effect of the model. The output stride is a specific setting that measures the size of the feature map output by the encoder. It represents the ratio between the input size and size of the feature map. When the output stride is 16, which represents the size of the feature map from the encoder and is 1/16 of the input size, the corresponding atrous convolution rate is [12,24,36], and when the output stride is 8, the rate is [6,12,18], which enables the segmentation precision to be improved. However, in this setting, the calculation cost and training time increased. We also applied pre-trained models on ImageNet for training [39].

Dataset
We utilized the aforementioned Spacecraft Feature Component Dataset for training and testing. The training, validation, and test sets were divided randomly. There were 1603 images in the test set.

Evaluation metrics
To measure the function and contribution of the segmentation system, its performance needs to be evaluated. We would need to employ standard and acknowledged metrics to ensure fairness. Several aspects of the system need to be tested to assess its effectiveness, including the execution time, memory occupation, and accuracy [40].
Many criteria are available for measuring the accuracy of algorithms in image segmentation. These standards are usually variations of their pixel precision and intersection over union (IoU). The mean intersection over union (mIoU) is the most common metric because of its simplicity and representativeness, and has been employed by most researchers to report their results [18,19,21,22]. Therefore, mIoU was considered as a criterion for accuracy in this study.
mIoU is the ratio of the intersection and union of two sets. For semantic segmentation, the two sets are the ground truth and the predicted values. This ratio can be viewed as the ratio between the intersection (i.e., true positive) and the sum of false negative, false positive, and true positive. It is computed by calculating the intersection over union for each class, and then taking the average [18]: where there are k + 1 classes (k = 1 in this study), p ij denotes the number of pixels that are predicted to be in class j with label i. Therefore, p ii is the true positive. The denominator of Eq. (9) represents the sum of the false negatives, false positives, and true positives for one class.

Quantitative evaluation
We only evaluate the models based on what we have trained because no similar datasets or models have been proposed before. Table 2 shows the segmentation accuracy of DeepLab V3+ with different backbone networks and our proposed EASDN model. In addition to displaying the segmentation accuracy of the model, Table 3 provides the training and testing speeds of the models, as well as the model size. Considering the different requirements of the model accuracy and speed on different platforms and hardware, our models satisfy the demands of engineering applications.  The results in Table 2 indicate that the DeepLab V3+ model based on Xception has higher segmentation accuracy. In addition, EASDN further improves the segmentation accuracy, with an increase of 1.57% mIoU on the test set. At the same time, it can be seen from Table 3 that although the accuracy of the DeepLab V3+ model based on MobileNet is not competitive with that of the other networks, the training and testing speeds are significantly improved, and the model requires little memory. Moreover, our EASDN model did not reduce the speed while improving the segmentation accuracy of the algorithm.

Qualitative results
The qualitative results of our models can be seen in Fig. 9. The figure depicts the segmentation results of the test images in the presence of noise, blur, and sunlight in a simulated on-orbit environment. The visualization results indicate that the recognition and segmentation results of the spacecraft panels by our proposed model can be considered to be acceptable. Figure 10 shows the results of our test of real onorbit spacecraft images from the Internet. Even under poor conditions, the spacecraft panels can be correctly identified and recognized, demonstrating the robustness of the model. In other words, it proves that our dataset is close to the real environment, and our degradation algorithm is also suitable.

Conclusions
A variety of internal degradation factors (noise, blur, and distortion from an on-orbit camera) and external disturbances (sunlight, space dust, and the plume from the space environment) make it challenging to recognize spacecraft feature components. Moreover, the lack of sufficient on-orbit spacecraft images has hindered research. We attempted to address this issue and constructed the SFCD with authentic degradations. The EASDN model was proposed to perform efficient and effective recognition of images in our dataset and real on-orbit images from the Internet. Our extensive experiments validated that the models trained on our SFCD benchmark have a beneficial generalization capability to real on-orbit conditions. In the future, we aim to enlarge the SFCD benchmark by considering a satellite against the Earth (or other planets) as the background and plan to add labels for additional feature components (such as aerials, cameras, and engine nozzles). Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.
The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.