1 Introduction

Infrared small target detection is one of the research hotspots in the field of military reconnaissance, which is widely used in the early-warning system, precision-guided weapon, missile tracking system, and maritime surveillance system. Based on the infrared target monitoring system, it uses the infrared radiation differences between background and target to detect the target. Comparing with the radar detection system, the infrared target detection system has strong concealment, easy portability and can detect radar blind area. However, owing to atmospheric disturbance, optical scattering and diffraction, the target has low radiation intensity, low signal-to-clutter ratio (SCR) and lack of shape and texture information. It thus easily drowns into the background [1]. Besides, complex natural scenes such as cloud edges and waves usually introduce false alarm in detection. Therefore, infrared small target detection is a valuable task with great challenges.

In this paper, the goals of infrared small target detection are to improve the detection rate and to reduce the false alarm rate. In general, infrared small target detection methods are mainly divided into two categories: one is track before detect (TBD) method based on sequence images, and the other is detect before track (DBT) method based on a single-frame image.

TBD methods detect the targets by processing several adjacent frames, which is suitable for the situation that the background changes slowly and the target track is continuous in the infrared sequence. The classical methods are three-dimensional matched filter [2], dynamic-programming method [3], multilevel-hypothesis-testing method [4]. However, multi-frame detection is hard to realize in hardware because of its computational complexity and memory consumption. On the contrary, the DBT method is real-time, low in complexity and easy to implement. It preprocesses the single-frame image, extracts suspicious targets by threshold segmentation, and confirms the targets according to the moving track of the targets on the image sequence. Over the past few decades, many methods based on single-frame detection are proposed.

Filter-based methods use different filter templates to suppress background, such as Max–median/Max–mean filters [5], morphology filtering [6], Top-Hat transform [7], two-dimensional least mean square (TDLMS) filter [8]. It is hard to obtain the universal template parameters and organize elements without prior information, meaning that adaptability and robustness are hard to be guaranteed.

Laplacian of Gaussian (LoG) filter-based method [9] has led to a wide range of research on the human visual system (HVS), Chen [10] proposed a method based on local contrast measure (LCM) for the first time. Then, the improved LCM (ILCM) [11] and the novel local contrast method (NLCM) [12] are proposed to further improve the detection result. Multiscale patch-based contrast measure (MPCM) [13] and multiscale local homogeneity measure (MLHM) [14] detect small targets by enhancing the contrast between the target and its surrounding background. However, methods based on HVS will generate a large number of false alarms when the complex background is complex. Furthermore, Zhang [15] and Liu [16] both utilize the differences between the gradient characteristics of target and background. On the one hand, they can effectively suppress sharp edges. On the other hand, it is time-consuming to calculate the gradient information.

Most of the traditional methods detect targets by hand-designed features, which are usually only effective for a certain background. However, the real infrared small target detection scenes include sky, cloud, sea and various building backgrounds. The traditional method using only artificial features obviously lacks robustness in the complex and changeable application scenes, while the method based on deep learning can extract features by convolutional neural network (CNN), which can make up for the lack of feature and the difficulty of feature description in infrared image.

Deep learning has made great progress in the field of target detection recently, there are mainly two types of detection pipelines: two-stage region-based method [17,18,19] and one-stage method [20,21,22]. Wu [23] draws on the experience of the former and applies it to infrared target detection. The potential target region is extracted by the sliding window method, and the real target is judged by a classification network. But it's worth noting that the sliding window method is an actually exhaustive search, which is extremely inefficient. As the lack of public infrared dataset, paper [24] and [25] propose to simulate the infrared image characteristics by visual image data set.

Based on the above observations, a two-stage region-based method is proposed in this paper, using local intensity enhancement and corner detection to find as many real targets as possible, and reducing the false alarm rate by CNN classifier. Our contributions can be summarized as follows:

  1. 1.

    An improved region proposal method is proposed. Instead of sliding window search it improves the efficiency of candidate region search.

  2. 2.

    A lightweight CNN classifier is designed to classify the infrared images with few features and small sizes.

  3. 3.

    A public and diverse infrared image set is established for the study of infrared small target detection.

2 Methodology

Our infrared small target detection method is composed of two modules. The first module takes an infrared image of any size as input. Firstly, enhance the local intensity of the small target; secondly, the coordinates of the suspected target points are found by corner detection; finally, select image patches centered on the coordinates as the potential target regions.

The inputs of the second module are the potential target regions. Each potential target region is fed into the CNN classifier to obtain the category probabilities of potential target region. The category of the region depends on the maximum probability. If it is a target, set the central pixel value of this region to the probability value, and the rest pixel values are set to zeros.

The pipeline of our framework is shown in Fig. 1.

Fig. 1
figure 1

Framework overview

2.1 Region proposal method

2.1.1 Local intensity enhancement

The infrared image generates from the infrared radiation differences between target and background. Infrared small targets, such as aircraft and cruise missiles, are thermal radiation sources. Hence, infrared small targets are usually considered to be brighter than surrounding background [15]. In other words, the gray value of small target is larger than its surrounding area, so we perform the local intensity enhancement according to Formula (1). For an infrared small target, the gray value after enhancement will be greater than 0. On the contrary, background pixels with gray values lower than the local mean value will be suppressed. This widens the distinction between small target and background.

$$ I(x,y) = \max (0,f(x,y) - \mu ), $$
(1)

where \(I(x,y)\) is the gray value after enhancement, \(f(x,y)\) is the gray value at \((x,y)\), the local mean value \(\mu\) is the average value of the image patch. Since the size of image patch should be larger than the target size, we adjust the patch size between 7 × 7 ~ 15 × 15 to fit most cases.

2.1.2 Corner detection

A corner is a point around which the pixels vary greatly in gradient amplitude and direction. Qi et al. proposed that the shape of small infrared target is similar to the isotropic Gaussian intensity function due to the optics point spread function (PSF) of the thermal imaging system at a long distance [29]. Figure 2 illustrates five typical examples and their gradient vector graphs. Hence, infrared small target is also a kind of corner.

Fig. 2
figure 2

Gradient vector graphs of infrared small targets. The length and direction of the arrow indicate the amplitude and direction of the gradient

The eigenvalues of the covariance matrix can be used to distinguish targets from other regions [26]. The eigenvalues depend on the distribution of gradients, so the gradient of the image is analyzed at first. Except small targets, there are also flat regions such as sky background and edge regions such as cloud clutter in an infrared image. The horizontal and vertical gradient distributions of different regions are shown in Fig. 3.

Fig. 3
figure 3

Horizontal and vertical gradient distribution of different regions. Each ‘*’ represents the gradient of a pixel, the abscissa represents the horizontal gradient, and the ordinate represents the vertical gradient (A: target region; B, C: edge region; D: flat region)

From Fig. 3, the gradients of the pixels in the flat regions are distributed near the origin, with different directions and small amplitudes. The gradients of the edge regions are mainly distributed in a certain direction, which depends on the direction of the edge. However, the gradients of the corners are scattered, not limited to a certain direction and amplitude. Based on the analysis above, if it is a flat region, then the values of \(\lambda_{1}\) and \(\lambda_{2}\) are both small, if it is an edge region, then \(\lambda_{1} \gg \lambda_{2}\) or \(\lambda_{2} \gg \lambda_{1}\), and if it is a corner, both \(\lambda_{1}\) and \(\lambda_{2}\) are large. Considering the complexity of eigenvalue calculation, the determinant and trace of the covariance matrix are used to measure eigenvalues. The process of region proposal method summarized in Algorithm 1.

figure a

2.2 CNN classifier

Small infrared targets can be detected by region proposal method, however, the background may also contain corners. To eliminate false targets, a lightweight CNN classifier is designed to determine whether the candidate region contains the real target or not.

The proposed classification network model is shown in Fig. 4. The model consists of an image preprocessing layer, a feature extraction module, a global average pooling layer, and a softmax layer.

Fig. 4
figure 4

CNN classifier for infrared images, a feature extraction module followed by a global average pooling layer, the final category probability is output by softmax

Even if we collect a large number of infrared images, the overfitting problem still needs to be avoided. Therefore, before the image is fed into the convolution layers, one of the following methods for preprocessing is applied on-the-fly: (a) Horizontal flipping; (b) Contrast adjustment in the range [0.5, 1.5]; (c) Brightness adjustment in the range [− 0.12, 0.12]. In this way, the model can see slightly different images in different epoch.

The feature extraction module is composed of convolution layers and pooling layers. A stack of two 3 × 3 convolution layers is used, which has the same effective receptive field as a 5 × 5 filter kernel [27]. Feature extraction is performed using two 8-channel convolutional layers at first. After each down-sampling, the number of the feature map channels increased by a factor of 2. The maximum pooling is selected because it is beneficial to retain the gray feature of the infrared small target. The kernel size and stride of the pooling layer are two. Due to the number of categories is 4, a bottleneck layer Conv1 of 4 channels is used to reduce the number of feature map channels to 4. Thus, output of the feature extraction module is a feature map of 8 × 8 pixels and 4 channels.

Rather than classical fully connected layers after feature extraction, we use global average pooling. There are two beneficial reasons: The fully connected layers have a large number of training parameters and are prone to overfitting. However, in the global average pooling, no parameter needs to be optimized. Furthermore, global average pooling sums out the spatial information that will improve the robustness against disrupted inputs. Global average pooling layer makes each feature map correspond to a category and outputs the average value of each feature map as a response. Finally, all the responses are fed into the softmax layer for classification. It is identified as a potential target area when the probability of the target category is greater than a certain threshold.

3 Experiment

3.1 Dataset

Infrared images are collected by an infrared camera in summer and winter, respectively, covering a variety of weather and light conditions, as well as various sizes of detected objects. This infrared data set will be open source and will continue to be expanded and enriched.

The size of the collected infrared image is 640 by 480, which does not meet the input requirements of the classifier. In order to obtain the training images, the region proposal method is used to extract potential target regions by cropping patches at 32 × 32. Based on our observation, it is inevitable that these regions will contain non-target regions. We reduce the threshold appropriately, in this way, balanced positive and negative samples can be obtained at the same time. Besides, negative samples are divided into three categories because the characteristics of different backgrounds are obviously different. Finally, the total number of dataset images is 28,000, the ratio of target, flat background, complex background and building background is 1:1:1:1, and the training set and validation set are distributed by 7:3. Figure 5 shows some examples of our dataset.

Fig. 5
figure 5

Some examples of our dataset a target, b flat background, c complex background and d building background

3.2 Proposed CNN classifier training experiment

The CNN classifier is implemented using Python language and TensorFlow package. Training experiments are performed on a computer with an Intel Core i7-7700 CPU, an NVIDIA GPU and 8 GB Memory. We use the following experimental setting for training: the cross-entropy loss is selected as a loss function, and the optimization algorithm used for training is Rmsprop. The other parameters are set in Table 1.

Table 1 Training parameter setting

The loss function curve is shown in Fig. 6. It is worth noting that the curve will fluctuate slightly due to the stochastic nature of the learning process. The loss function converges approximately after 8000 iterations. After network training, all the images in the validation set are fed into the model. Finally, the accuracy of validation set is up to 98.11%.

Fig. 6
figure 6

Loss function curve (obtained through tensorboard and smoothing coefficient = 0.8)

3.3 Comparison experiment and analysis

3.3.1 Evaluation metrics

In order to objectively evaluate the detection performance of different methods. Several widely used evaluation metrics including signal-to-clutter ratio (SCR), the background suppression factor (BSF), the receiver operating characteristic (ROC) curve are introduced.

The SCR represents the contrast between target and background, and BSF describes the extent of background suppression, which are defined as follows:

$$ {\text{SCR}} = \frac{{f_{{\text{T}}} - f_{{\text{B}}} }}{{\sigma_{{\text{B}}} }} $$
(2)
$$ {\text{BSF}} = \frac{{\sigma_{{{\text{in}}}} }}{{\sigma_{{{\text{out}}}} }} $$
(3)

where \({f}_{\mathrm{T}}\) is the maximum gray value of the target, \({f}_{\mathrm{B}}\) and \({\sigma }_{\mathrm{B}}\) are mean value and the standard deviation of the local background, respectively, \({\sigma }_{\mathrm{in}}\) and \({\sigma }_{\mathrm{out}}\) denote the standard deviation of the original image and the output image, respectively.

To further evaluate the target detection performance of different methods, the receiver operating characteristic (ROC) curve is also used in this paper. The horizontal axis of the ROC curve is the false alarm rate \({F}_{\mathrm{a}}\), the vertical axis is the detection probability \({P}_{\mathrm{d}}\). The curve reflects the relationship between the two metrics. The \({P}_{\mathrm{d}}\) and \({F}_{\mathrm{a}}\) are defined as follows:

$$ P_{{\text{d}}} = \frac{{{\text{number}}\,{\text{of}}\,{\text{true}}\,{\text{targets}}}}{{{\text{number}}\,{\text{of}}\,{\text{all}}\,{\text{real}}\,{\text{targets}}}} $$
(4)
$$ F_{{\text{a}}} = \frac{{{\text{number}}\,{\text{of}}\,{\text{false}}\,{\text{targets}}}}{{{\text{number}}\,{\text{of}}\,{\text{all}}\,{\text{detected}}\,{\text{targets}}}} $$
(5)

3.3.2 Comparison with state-of-the-arts

To evaluate the target detection performance of the proposed method, we choose nine public infrared images in various scenes and noisy cases. The target is either submerged in a complex cloud background or exists in a complex sea background. Besides, the background includes not only thick clouds and sheet clouds but also buildings and hills that are prone to cause a high false alarm rate. In the experiments, the parameters in Algorithm 1 are set as k = 0.06, t = 0.01 and n = 15, the threshold of CNN classifier is set to 0.8, and fine-tuning according to the specific situation. As shown in Fig. 7, although the region proposal method recalls the real targets, there are still many false targets in sharp background. However, after the regions are classified, the false alarms can be greatly reduced. Additionally, we find that all the real targets are found with different response values, and only a few false alarms appear in (d) and (e). Experiment results demonstrate that the proposed methods can effectively suppress the background.

Fig. 7
figure 7

Nine groups of target detection results of the proposed method. The first image of each group is the original image. In the second image of each group, the suspicious target points, in other words, the center points of the candidate regions, are marked with red circles. The third image of each group shows the three-dimensional result of the proposed method

To further prove the superiority of the proposed method, six state-of-the-art methods are selected as baseline methods, including multiscale local contrast method (MLCM) [10], multi-patch contrast method (MPCM) [13], average absolute gray difference (AAGD) [28], Laplacian of Gaussian (LoG) filter-based method [9], local intensity and gradient properties (LIG) [15] and flux density and direction diversity in gradient vector field (IMG) [16]. Interference of complex background clutter is always the main difficulty of infrared small target detection, so the ability of background suppression becomes an important standard to judge the performance of a detection method. Figure 8 shows the results of nine images processed by the above six methods and the proposed methods. We can compare the advantages and disadvantages of the above methods intuitively. It can be seen that the MLCM method significantly enlarges the size of each target, but it only enhances the original image roughly and does not suppress the background obviously. MPCM and LIG methods suppress the background to a certain extent, as shown in (b–d) and (h), but they are still sensitive to strong cloud edges in (f–g). AAGD and LoG methods have the same problems, especially, they are vulnerable to the interference of clutters around the sea-sky line and the boundary of hills. IMG method proposes the multiscale flux density that the small targets can be effectively enhanced, background clutters can be simultaneously suppressed, and the heavy clutter can be further suppressed by the gradient direction diversity, it is indeed superior to other baseline methods in background suppression. Nevertheless, as with other baseline methods, there are still a large number of false alarms left at the building signboard in (i). The proposed method gives satisfactory results, no matter the background is thick cloud, line clutter or building signboard. Accordingly, our method has robust performances for the various scenes.

Fig. 8
figure 8

Results and the 3D grayscale images of different methods

The above experiments qualitatively demonstrate the excellent performance of the proposed method, the following will compare with baseline methods quantitatively using evaluation metrics. Results of SCR and BSF are shown in Fig. 9. In general, the higher the value of these two metrics means the better the detection performance of the method. The SCR and BSF of MLCM are not ideal due to the simple calculation, but those of the IMG method are relatively higher. Interestingly, the metric values of MPCM vary greatly under different conditions, which shows its robustness is not good. As can be seen in Fig. 9, the proposed method achieves the highest evaluation metrics in different scenarios, meaning that the proposed method outperforms other state-of-the-art methods.

Fig. 9
figure 9

SCR and the BSF of different methods

In order to further evaluate the performance of the proposed method, we tested ROC curves of different sequences. The probability of detection gradually raises with the increase of false alarm rate. In the terms of the ROC curves in Fig. 10, the probabilities of detection of baseline methods can almost reach nearly 100% when the false alarm rate reaches a certain value. However, at the same probability of detection, the proposed method has the lowest false alarm rate, indicating the proposed method has the best performance in background suppression and target enhancement simultaneously.

Fig. 10
figure 10

ROC curves of three real infrared sequences. a The results of all baseline methods and the proposed method. To show the probability of detection with a smaller false alarm rate, the details of ROC curves are shown in (b) and (c)

4 Conclusion

In this paper, an infrared small target detection method based on candidate region proposal and CNN classifier is proposed, which can separate the real target from the background and greatly reduce the false alarm rate caused by complex background clutter. In particular, after off-line training, the CNN classifier can eliminate special backgrounds such as buildings. Extensive experiments demonstrate that the proposed method has better detection performance than state-of-the-art methods in different scenarios. In future work, we plan to concentrate on the time consumption for computing.