Object tracking in infrared images using a deep learning model and a target-attention mechanism

Small object tracking in infrared images is widely utilized in various fields, such as video surveillance, infrared guidance, and unmanned aerial vehicle monitoring. The existing small target detection strategies in infrared images suffer from submerging the target in heavy cluttered infrared (IR) maritime images. To overcome this issue, we use the original image and the corresponding encoded image to apply our model. We use the local directional number patterns algorithm to encode the original image to represent more unique details. Our model is able to learn more informative and unique features from the original and encoded image for visual tracking. In this study, we explore the best convolutional filters to obtain the best possible visual tracking results by finding those inactive to the backgrounds while active in the target region. To this end, the attention mechanism for the feature extracting framework is investigated comprising a scale-sensitive feature generation component and a discriminative feature generation module based on the gradients of regression and scoring losses. Comprehensive experiments have demonstrated that our pipeline obtains competitive results compared to recently published papers.


Introduction
Visual tracking can be considered as the ability to look at something and follow its movement. Visual tracking in videos that learns to estimate the locations of a target object has been broadly employed for several applications, such as infrared search and track (IRST) system (or infra-red sighting and tracking), video surveillance, autonomous driving, and human motion analysis [1,2]. However, due to the long observation distance in the infrared image, the target has a low signal-to-noise ratio (SNR) and a small size leading to obtain limited information for tracking [3,4]. Moreover, under a range of environmental conditions make infrared small target tracking even more difficult. It mostly comprises some concrete scenes such as intense noise, low contrast, competing for background clutter, and camera ego-motion, and so on. For instance, the camera ego-motion leads to the happening of an impulsive motion of the target between two sequential frames, which simply causes to miss the target. Also, as small objects in infrared images can be simply submerged in a complex background with a low signal-to-clutter ratio (SCR), it makes the tracker drift to the background. The intense noise and low value of the contrast can lead to a drop in target SNR. Besides, because of the long imaging distance, small targets have no concrete texture and shape. Hence, robust and accurate small object tracking in infrared images remains a challenging task in crowded scenes [5,6]. Several models that try to track small targets in infrared images efficiently have been implemented in the literature. Although many researches have been conducted using visible cameras, due to their high dependency on the illumination condition, these cameras are not good options for the night-time environment. So, to overcome this issue, we employ an infrared imaging system that is more robust to illumination variations and is able to work well in night-time and day-time [7].
In the last few years, deep learning (DL) pipelines have reached better classification and prediction results compared to the state-of-the-art performance in the different fields of computer vision tasks [8][9][10][11][12][13]. However, there are only some DL strategies to track objects in the infrared images, and their efficiency is not as competitive as the algorithms based on hand-crafted features. Moreover, they are unable to detect and track objects with variation both in size and shape effectively. So, in this paper, we suggest a small object tracking approach using infrared images which uses a deep learning model that is able to track even small objects at the presence of size and shape variations. As each texture includes many textural information that are crucial when dealing with a real scene, we apply a textural descriptor approach to explore key features. The employed textural descriptor is an illumination-invariant technique that is very beneficial for tracking tasks. Moreover, we propose a deep learning model which accept both original image and image obtained by textural descriptor method (encoded image). Our DL model includes target attention mechanism and size attention mechanism. The attention mechanism means one or more features are more important than others and we need to pay more attention on them.
The remaining parts of this paper are organized as follows. Firstly, related works are discussed in "Literature review". The characteristics and architecture of the suggested model are presented in "Materials and methods". "Experiments" describes the implementation details of the suggested model. "Conclusions" provides conclusions.

Literature review
A small target tracking technique infrared images based on Kalman filter and multiple cues fusion is proposed to overcome the problem of complex environmental conditions in [1]. In the first step, they employed the Kalman filter to estimate the preliminary target position that is considered as the center of the region of interest (ROI). Next, the motion, contrast, and grey color cues in the ROI are investigated to produce the confidence map to locate the small target. Finally, the target models and the fusion weights are updated, and the predicted target position can be considered as a measurement of the Kalman filter. A robust maritime dim small target detection scheme to overcome the problem of submerging weak targets in heavy cluttered infrared (IR) maritime images introduced in [14]. They enhanced the quality of employing images by the multidirectional improved top-hat filter. Also, they established directional morphological filtering (DMF) by incorporating morphological operations and constructed multidirectional structuring elements (MSEs) to explore the multidirectional differences between target area and local proximate objects.
A learning discriminative prediction model was proposed in [15] which is capable of fully investigating the background and target appearance information. Firstly, a steepest descentbased technique is employed that calculates an optimal step length in each iteration. Then, a module that efficiently initializes the target model is integrated. Zhang et al. [16] suggested an RGB-infrared fusion tracking strategy using visible and infrared images. To this end, a fully convolutional network based on the Siamese Networks (SiamFT) was suggested. In the first step, infrared and visible images are processed by an infrared network and a visible network. Next, to form fused template image, convolutional features of infrared and visible images explored from two Siamese Networks are merged. A modality weight calculation technique using the response value of Siamese network is employed for estimating the reliability of dissimilar images. Finally, a cross-relation approach is used to create the final response map.
An improvement of a fully convolutional neural network (FCNN) to estimate object location was proposed in [17]. Their strategy uses a comprehensive sampling technique as well as better scoring scheme. The possible object positions are computed using a two-stage sampling that combines clustered foreground contour information and stochastically distributed samples. The best sample is chosen based on a combined score of model reliability, predicted location, and appearance similarity. Yang et al. [18] proposed a tracking system using a correlation filter (CF) tracker strategy. Moreover, a Gabor filter (GF) feature extractor is used in the frequency domain (GF-KCF). By constructing a set of frequency-domain GFs, the suggested method tries to suppress background noise effectively and highlight target texture information. Yao et al. [19] suggested a Siamese network for tracking task that using a dilated convolution module for enhancing scales adaptability of network. To diminish the dependence of the model on the initially given exemplar, they used a target template library update technique based on the tracking outcomes of historical frames.

Materials and methods
As infrared images take from a long distance, the target signals include insufficient texture information. Besides, the complex background clutters such as sea clutter, sea-sky line, forest, mountains, island, and cloud clutter are usually changeable, which diminishes the efficiency of a tracking model. So, in this section, we describe the importance of the textural features when we are dealing with a complex background while a small object needs to be tracked. Moreover, we propose a new deep learning pipeline that uses two attention mechanisms for a size-invariant and target tracking model. The proposed strategy to detect and estimate an object in infrared images is displayed in Fig. 4.

Texture descriptor
Textural analysis of any kinds of images endeavors to explore some key informative details and characterizations of a surface texture such as entropy, contrast, shapes correlation, smoothness, energy, homogeneity contrast, roughness, and colors [13,20,21]. As introduced in many works [22,23], several kinds of local descriptors are employed to represent an image into an encoded image based on the code-book of visual patterns or some pre-defined coding rules.
These strategies have a wide range of usage in many fields of research like object tracking [24], image segmentation [13,[25][26][27], and aerial image analysis [28,29]. Generally, in texture segmentation and classification, the main aim is to split the image into a set of homogeneous textured segments [30].
Local binary pattern (LBP), local directional pattern (LDP), and local ternary patterns (LTP) feature descriptors can be easily implemented and are influenced by varying the pixel intensity of nearest-neighbor (rectangular, circular, etc. neighborhood) in clockwise or counter-clockwise to encode (altering) the low-level information of a curve, line, edges, and spot inside an image and generate the result as a binary value [31][32][33]. Non-linear Kirsch kernels in 8 rotations [37] As in encoding applications, the gradient value shows more robustness compared to a graylevel intensity. Some strategies based on the gradient value such as local directional number patterns (LDN) and local word directional pattern (LWDP) have attained much attention [34]. The LDN is used in the gradient domain for generating an illuminationinvariant representation of the image. The LDN utilizes directional information for investigating the location of all edges that their magnitudes are insensitive to lighting variations. This is implemented by operating the 8 directions Kirsch kernels (filters) that are rotated by 45°in the 8 main compass directions ( Fig. 1). Each kernel generates a feature map and only the maximum value in each location is chosen to obtain a final edge map [35,36]. An example of employing the non-linear kirsch kernel to an infrared image is indicated in Fig. 2.
As Eq. 1 demonstrates, pixel (cpx, cpy) implies the medial pixel of a neighborhood, while nr is defined as the minimum negative replication and pr states the maximum positive response [38]. The result of applying the LDN approach to some images is indicated in Fig. 3.

Our deep learning model
In this part, we explain how our model is able to learn more informative and unique features from the original and encoded image for visual tracking. In the first step, the gap   between the obtained features from a pre-trained convolutional neural network (CNN) and efficient representations of best features for visual tracking is introduced. Formerly, the attention mechanism for feature extracting framework is investigated comprising a scale-sensitive feature generation component and a discriminative feature generation module based on the gradients of regression and scoring losses. Our pipeline is displayed in Fig. 4.

Target attention mechanism
There are many differences between the extracted features' aims to track a predefined object tracking and the visual recognition of a general target. Firstly, most of the features extracted by a pre-trained CNN are uninformative and do not cover all key details about general objects. This means for a predefined object tracking task, the class labels for testing and training samples are consistent and pre-defined, whereas in an online object tracking system (general purposes) there are countless number of classes. Secondly, all trained weights and biases in a pre-trained CNN model aim to increase difference between inter classes and cannot able to deal with the variation in intra-classes properly. This is due to encountering of some insignificant features among all features to predict the happening of scale variations and distinguishing the aimed objects among some much similar objects. Lastly, as differences among inter-classes are principally related to some feature maps, all extracted features using a trained deep learning model are sparsely activated using each class annotation. Moreover, some significant parts of applying convolution kernels (filters) results in detecting uninformative details and redundancy leading to overfitting and a high computational load. Accordingly, only some convolutional kernels are able to detect some patterns related to the target object.
Many strategies in the field of image processing that use convolution kernels imply the significant role of convolutional kernels to recognize hidden patterns inside the image. This group-level object information is calculated through the corresponding gradients [2,[39][40][41][42][43][44].
Recently, a gradient-weighted class activation mapping (Grad-CAM) model was proposed by [2] to produce a highlighted feature map by calculating a sum of weighted neurons along the feature channels. This strategy acts by calculating the gradient at each input pixel which demonstrates the corresponding importance belonging to given class annotation. In other words, by computing the mean pooling of all the gradients in entire the channel, the weight of a feature channel is produced. Different from the gradient-based models employing classification losses, a ranking loss and a regression loss has been used in our study. Our strategy is specifically designed for the tracking task to recognize the best convolutional kernels contributing to detecting the pattern of targets and is sensitive to scale variations.
Using the gradient-based strategy, in this study a target attention mechanism with losses has been implemented designed for visual tracking. Given a CNN employing for extracting features has the output feature map , a subspace ζ is computed using the channel importance as where ψ 1 and ψ 2 are selecting function to choose the key channels for image1 and image2, respectively. The score of the i-th channel 1 i and 2 i can be calculated by In this study, we explore the best convolutional filters to obtain best possible visual tracking results by finding those inactive to the backgrounds while active to the target region. This means, in the training process using a loss function, the best possible values for weights and biases are found. These weights and biases learned how to respond to the backgrounds and target region. So, a regression approach is employed to explore all the pixels i, j inside the image patch aligned with the center of the target center for obtaining a Gaussian label map by where (i, j) demonstrates the difference in distance with the target and σ stands for the dimension of filter (width). Moreover, to overcome the problem of computing time a ridge regression loss is employed to formulate the issue by where W indicates the weight of regressor and * implies the convolution operation. The importance of each kernel is calculated based on the derivation of Loss regression with respect to the input feature pixels input and its contribution to fitting the label map. By considering Eq. 4 and the chain rule, the gradient of the Loss regression can be calculated by According to Eq. 3 and the gradient of the regression loss, the target-active kernels can be defined which are able to distinguish between the background and the target. These produced features by employing the gradient strategy are able to select only some kernels to produce more discriminative deep features to focus on the specified target. This strategy leads to eliminating many uninformative parts of the image and overcoming the problem of over-fitting. In other words, when we remove much informative parts of the image, we eliminate many uninformative features from the training feature vectors. So, the rate of unbalancing data will dramatically be decreased and lead to overcoming the problem of over-fitting.

Size attention mechanism
To increase the target detection robustness against strong occlusion and noises, it is essential to find some robust kernels that are able to detect the variation size of the target. As due to non-continuous change rate of the target's size, it is not an easy task to find the size of the object in each frame precisely. But by using the proposed network to find a paired sample we can estimate the closest size variation. So, by formulating the issue as a scoring model and finding and scoring the size of all possible target size, we are able to select the higher score as the target size. The obtained gradients from the score loss demonstrate which kernels are more sensitive to size variations.
Inspired by [45] we investigate a smooth approximation of the scoring loss function by where sample i , sample j are pair-wise samples for the training phase and φ demonstrates the set of training pairs. As suggested in [45], we compute the derivation of Loss size score with respect to f (sample) by where h i, j h i −h j and h i demonstrates a one-hot vector with zero values while the ith position indicates 1 value. By employing the backpropagation strategy, the gradients scoring loss can be calculated as (9) ∂Loss size score ∂ sample i ∂Loss size score where sample predicted indicates the output estimation, W implies the filter weights of a Conv layer. According to Eq. 3 and the gradient of the scoring loss, the size-sensitive kernels can be defined. By combining the scoring losses and regression, we are able to detect the kernels that are both sensitive to size variation and active to the target.

Tracking process
The overall pipeline our suggested tracker is demonstrated in Fig. 4. There are two main reasons for integrating the target attention mechanism and feature extraction routes. Firstly, feature extraction routes consider both significant features extracted from original and encoded image which significantly highlight the key details of the target. Secondly, by decreasing the searching area inside the image, the proposed model is able to perform the tracking task efficiently.
Our tracking pipeline includes a target attention mechanism, a pre-trained feature extractor, and a matching block. We only use a pre-trained feature extractor for training the network on the classification task with offline training strategy. Moreover, the target attention mechanism can be employed in the training process in the first frame.
In initial training (offline step), the scoring and regression loss functions are trained independently. Next, once the models are converged, gradients from each loss are computed. By computing these gradients from the pre-trained networks, only those kernels with highest importance scores are chosen to obtain the best possible outcomes.
When we are dealing with an input video (sequential frames) in online finding target, the likelihood scores between the search area inside the image and the initial target in the current frame is directly computed employing the target attention mechanism. This step can be conducted by applying a convolution layer to the extracted output for obtaining a response map. All values in the response map implies the rate of correctness of the real target. Given the exploration area in the existing frame h t and the initial target sample 1 , we can predict the position of the target in frame t as where * implies the convolution operation.

Dataset and implementation details
In this study, training, validation, and testing of the suggested strategy have been accomplished on the Dim-small target detection and tracking dataset [46]. This dataset, made by the ATR laboratory of National University of Defense Technology, comprises 22 image sequences, 30 trajectories, 16,944 targets and 16,177 frames. The aim of this dataset is to detect and track of low altitude flying target and the data acquisition scenario covers complex field background and sky background. Figure 5 demonstrates some sample from the dataset. Our pipeline is implemented in Matlab with the MatConvNet toolbox [47] on a PC with a GTX-1080 GPU, core i7 3.6 GHz CPU, over CUDA 9.0, CuDNN 5.1, and 16G memory.

Assessment metrics
The effectiveness of the suggested pipeline is evaluated using the three criteria, namely Sensitivity, Accuracy, and Specificity. Specificity is the measure of non-target that has been estimated appropriately (actual negative rate). Sensitivity is the measure of targets that have been appropriately recognized (True positive rate or Recall). Accuracy is employed as the assessment metric for computing the overlap between the ground truths and the estimated targets [9,13,48]. These three criteria are computed by: Sensitivity 100 × TP TP + FN (11) Accuracy 100 × TP + TN TP + TN + FP + FN (12) Specificity 100 × TN TN + FP (13) where False Negative (FN) implies those objects, which do not cover the target and are classified as the target. While False Positive (FP) states those targets incorrectly predicted by our suggested tracking pipeline. Lastly, True Positive (TP) represents the number of targets over the entire frames that are correctly classified as the targets by the proposes technique. In many cases, higher values of the sensitivity can show lower specificity values. The higher the values for specificity and sensitivity, the better the performance of the pipeline [12,49,50].

Experimental results and discussions
We use the VGG-16 model [51] as the base network for increasing in the number of layers with smaller kernels that leads to increasing in non-linearity (a positive in deep learning). The Adaptive Moment Estimation (Adam) is utilized to the train the model, with an initial learning rate 10-4, a batch size 70, the maximum iteration 70, and weight decay 10-5.
To obtain more robust and accurate spatial information, the outputs of the Conv4-1 and Conv4-3 layers are employed as the base deep features. Also, the top 80 significant kernels from the Conv4-1 layers to learn the score-sensitive features and the top 250 significant kernels from the Conv4-3 layers to learn the target-active features are selected.
To have a clear understanding and for qualitative and quantitative comparison purposes, we also implemented eight other pipelines (Single Shot MultiBox Detector (SSD) [52], Target-aware [3], Discriminative Model [15], Directional morphological filtering (DMF) [14], Kalman filter [1], GF-KCF [18], Siamese network [19] and Grad-CAM [2]) for evaluating the suggested infrared searching and tracking target performance. The SSD [52] strategy uses an Adaptive Pipeline Filter (APF) using the motion information and temporal correlation. The DMF [14] algorithm is based on multidirectional morphological filtering and spatiotemporal cues. The Kalman filter [1] strategy is employed to estimate the preliminary target position that is considered as the center of the region of interest (ROI). The discriminative Model [15] is capable of fully investigating the background and target appearance information. So, this model employs the steepest descent-based technique that calculates an optimal step length in each iteration. Then, a module that efficiently initializes the target model is integrated. The Accuracy, Specificity, and Sensitivity values of all frames employing all mentioned frameworks are described in Table 1. For each index in Table 1, the highest Accuracy, Specificity, and Sensitivity values are highlighted in bold. Notice that when employing the DMF [14] and Targetaware [3], accuracy values were enhanced in comparison to other mentioned networks, but the values of sensitivity using Siamese network [19] and SSD [52] is still higher. Moreover, there is a minimum difference between the values of Specificity employing DMF [14] and Kalman filter [1] and the values of Sensitivity using DMF [14] and Target-aware [3]. The Grad-CAM [2] gains the worst outcomes for all three measures.
Moreover, it is clear that DMF [14], SSD [52], GF-KCF [18], Siamese network [19] and Target-aware [3] models are more stable than the Grad-CAM [2], Discriminative Model [15], and Kalman filter [1]. For Grad-CAM [2], all metrics are less than the other models and it suffer from overfitting. The gap between the value of accuracies by utilizing DMF [14], SSD [52], and Target-aware [3] models for tracking tasks equals zero which is relatively smaller than this gap when employing Grad-CAM [2] and Discriminative Model [15]. The specificity value of the SSD [52] is better than all other techniques with 0.89. Also, using only the original image by our model obtains an unacceptable result, but its performance is still higher than Discriminative Model [15]. Moreover, using only an encoded image as the input image to feed the network obtains the worst results among all compared methods.
From Table 1, it is recognizable that the suggested pipeline obtained the highest criterion values for recognizing and tracking targets than those obtained by all eight other models. This enhancement is because of: firstly, the suggested pipeline pays special attention to finding important parts of image rather than investigate all areas inside the image. Secondly, our framework explores the changing size of the target before it happens in the next frame. Lastly, by encoding the original image into a new image, we can find more informative details. Moreover, our strategy can analyze all frames more rapidly than other approaches. Also, there is a minimum difference between the evaluating time of some videos employing Grad-CAM [2], SSD [52], and Discriminative Model [15]. Figure 6 indicates a visual demonstration of the good outcomes attained by the proposed framework on the Dimsmall target detection and tracking dataset. As indicated, due to employing the target attention mechanism, the difference between the value of target and background inside the images is increased and the border between them is recognized with a high rate of accuracy. Also, using the size attention mechanism make our pipeline more robust to track the target when varying size occurs. But it is not true when we are dealing some targets with varying size at the same time. Moreover, owing to the use of the LDN encoding approach, the suggested tracking framework can explore more unique contextual information from the target and background which leads to better tracking outcomes. The analysis of our attention-based mechanism CNN model is indicated Fig. 6 The results of target tracking in infrared images using our framework in different frames in two streams  Fig. 7. Although our technique provides outstanding outcomes compared to the other recently published frameworks, the suggested strategy still has limitations when encountering changing size of multitargets at the same time. This is due to an increase in the size of the target's expected region which leads to decreasing performance in the feature exploration.

Conclusions
In this paper, a novel target detection and tracking in infrared images has been developed that benefits from the characterization of an original image and an encoded image. It means that each image has unique and informative characteristics to aid the framework efficiently even if varying size effects are present. We introduced a target attention mechanism which is able to highlight only significant part of the image to work on it. Moreover, we have described that working only on a part of the image including potential target area allows our network to reach performance close to human observers. This leads to decreasing computational burden of the model and capability to make estimations faster as it eliminates some uninformative parts of the image. Comprehensive experiments have been conducted, which indicate the effectiveness of the suggested framework by the comparison with the state-of-the-art models.
Author contributions The specific contributions made by each author is as follows: MP: conceptualization, methodology, implementation, writing-original draft, writing-review and editing. GK: conceptualization, methodology, implementation, writing-original draft, writing-review and editing. BAR: conceptualization, methodology, implementation, writing-original draft, writing-review and editing.
Funding None.

Availability of data and materials
The dataset used in this study can be obtained from the corresponding author on reasonable request.

Conflict of interest
The authors declare that they have no competing interests.

Financial interests
The authors declare they have no financial interests.

Non-financial interests
The authors declare they have no non-financial interests.
Ethics approval and consent to participate Not applicable.

Consent for publication Not applicable.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecomm ons.org/licenses/by/4.0/.