Multi-modal Visual Tracking: Review and Experimental Comparison

Visual object tracking, as a fundamental task in computer vision, has drawn much attention in recent years. To extend trackers to a wider range of applications, researchers have introduced information from multiple modalities to handle specific scenes, which is a promising research prospect with emerging methods and benchmarks. To provide a thorough review of multi-modal track-ing, we summarize the multi-modal tracking algorithms, especially visible-depth (RGB-D) tracking and visible-thermal (RGB-T) tracking in a unified taxonomy from different aspects. Second, we provide a detailed description of the related benchmarks and challenges. Furthermore, we conduct extensive experiments to analyze the effectiveness of trackers on five datasets: PTB, VOT19-RGBD, GTOT, RGBT234, and VOT19-RGBT. Finally, we discuss various future directions from different perspectives, including model design and dataset construction for further research.


Introduction
Visual object tracking is a fundamental task in computer vision, which is widely applied in many areas, such as smart surveillance, autonomous driving, and human-computer interaction.Traditional tracking methods are mainly based on visible (RGB) images captured by a monocular camera.When the target suffers long-term occlusion or is in low-illumination scenes, the RGB tracker can hardly work well and may cause tracking failure.With the easyaccess binocular camera, tracking with multi-modal information (e.g.visibledepth, visible-thermal, visible-radar and visible-laser) is a prospective research direction that has become popular in recent years.Many datasets and challenges have been presented [1,2,3,4,5,6].Motivated by these developments, trackers with multi-modal cues have been proposed with the potential accuracy and robustness against extreme tracking scenarios [7,8,9,10,11].
With the emergence of multi-modal trackers, a comprehensive and in-depth survey has not been conducted.To this end, we revisit existing methods from a unified view and evaluate them on popular datasets.The contributions of this work can be summarized as follows.
• Substantial review of multi-modal tracking methods from various aspects in a unified view.We exploit the similarity of RGB-D and RGB-T tracking and classify them in a unified framework.We category existing 56 multi-modal tracking methods based on auxiliary modality, tracking framework, and related datasets with corresponding metrics.Taxonomy with detailed analysis can cover the main knowledge in this field, and provide an in-depth introduction to multi-modal tracking models.
• A comprehensive and fair evaluation of popular trackers on several datasets.We collect 29 methods consisting of 14 RGB-D and 15 RGB-T trackers, and evaluate them on 5 datasets in accuracy and speed for various applications.We further analyze the advantages and drawbacks of different frameworks in qualitative and quantitative experiments.
• A prospective discussion for multi-modal tracking.We present the potential direction of multi-modal tracking in model design and dataset construction, which can provide prospective guidance to researchers.
The rest of the paper is organized as follows.In section 2, we introduce existing related basic concepts and previous related surveys.Section 3 provides a taxonomical review of multi-modal tracking.We represent the intro-duction of existing datasets, challenges, and corresponding evaluation metrics described in section 4. In section 5, we report the experimental results on several datasets and different challenges.Finally, we discuss the future direction of multi-modal tracking in section 6.All the collected materials and analysis will be released at https://github.com/zhang-pengyu/Multimodal_tracking_survey.

Visual Object Tracking
Visual object tracking aims to estimate the coordinates and scales of a specific target throughout the given video.In general, tracking methods can be divided into two types according to used information: (1) single-modal tracking and (2) multi-modal tracking.Single-modal tracking locates the target captured by a single sensor, such as laser, visible and infrared cameras, to name a few.In previous years, tracking with RGB image, being computationally efficient, easily accessible and high-quality, became increasingly popular.Numerous methods have been proposed to improve tracking accuracy and speed.
In RGB tracking, several frameworks including Kalman filter (KF) [12,13], particle filter (PF) [14,15], sparse learning (SL) [16,17], correlation filter (CF) [18,19], and CNN [20,21] have been involved to improve the tracking accuracy and speed.In 2010, Bolme et al. [18] proposed a CF-based method called MOSSE, which achieves high-speed tracking with reasonable performance.Thereafter, many researchers have aimed to develop the CF framework to achieve stateof-the-art performance.Li et al. [19] achieve scale estimation and multiple feature integration on the CF framework.Martin et al. [22] eliminate the boundary effect by adding a spatial regularization to the learned filter at the cost of speed decrease.Galoogahi et al. [23] provide another efficient solution to solve the boundary effect, thereby maintaining a real-time speed.Another popular framework is Siamese-based network, which is first introduced by Bertinetto et al. [20].Then, deeper and wider networks are utilized to improve target representation.Zhang et al. [21] find that the padding operation in the deeper net-work induces a position bias, suppressing the capability of powerful network.
They address the position bias problem, and improve the tracking performance significantly.Some methods perform better scale estimation by predicting segmentation masks rather than bounding boxes [24,25].Above all, many efforts have been conducted in this field.However, target appearance, as the main cue from visible images, is not reliable for tracking when target suffers extreme scenarios including low illumination, out-of-view and heavy occlusion.To this end, more complementary cues are added to handle these challenges.A visible camera is assisted by other sensors, such as laser [26], depth [7], thermal [10], radar [27], and audio [28], to satisfy different requirements.
Since 2005, series of methods have been proposed using various multimodal information.Song et al. [26] conduct multiple object tracking by using visible and laser data.Kim et al. [27] exploit the traditional Kalman filter method for multiple object tracking with radar and visible images.Megherbi et al. [28] propose a tracking method by combining vision and audio information using belief theory.In particular, tracking with RGB-D and RGB-T data has been the focus of attention using a portable and affordable binocular camera.Thermal data can provide a powerful supplement to RGB images in some challenging scenes, including night, fog, and rainy.Besides, a depth map can provide an additional constrain to avoid tracking failure caused by heavy occlusion and model drift.Lan et al. [29] apply the sparse learning method to RGB-T tracking, thereby removing the cross-modality discrepancy.Li et al. [11] extend an RGB tracker to the RGB-T domain, which achieves promising results.
Zhang et al. [10] jointly model motion and appearance information to achieve accurate and robust tracking.Kart et al. [7] introduce an effective constraint using a depth map to guide model learning.Liu et al. [30] transform the target position to 3D coordinate using RGB and depth images, and then perform tracking using the mean shift method.

Previous Surveys and Reviews
As shown in Table 1, existing surveys are introduced, which are related to multi-modal processing, such as, image fusion, object tracking, and multi-  [35] provide a detailed review of the machine learning method using multi-modal information.
Various differences and developments are observed among the most related works [32,37].First, we aim to conduct a general survey on how to utilize multi-modal information, especially RGB-D and RGB-T tracking, on visual object tracking in a unified view.Furthermore, different from [32], we pay much attention to the recent deep-learning-based methods, which have not been pro-

Multimodal Tracking Auxiliary Modality Purpose
Tracking Framework

Multi-modal Visual Tracking
This section provides an overview of multi-modal tracking from three aspects: (1) auxiliary modality purpose: how to utilize the information of auxiliary modality to improve tracking performance; (2) tracking framework: the types of framework that trackers belong to.Note that, in this study, we mainly focus on visible-thermal (RGB-T), visible-depth (RGB-D) tracking, and we consider visible modality as the main modality and other sources (i.e.thermal and depth) as auxiliary modalities.The taxonomic structure is shown in Figure 1.

Auxiliary Modality Purpose
We first discuss the auxiliary modality purpose in multi-modal tracking.
There are three main categories: (a) feature learning, where the feature representations of auxiliary modality image are extracted to help locate the target; (b) pre-processing, where the information from auxiliary modality is used before the target modeling; and (c) post-processing, where the information from auxiliary modality aims to improve the model or refine the bounding box.

Feature Learning
Methods based on feature learning extract information from auxiliary modality through various feature methods, and then adopt modality fusion to combine the data from different sources.Feature learning is an explicit way to utilize multi-modal information, and most of corresponding methods consider the image of auxiliary modality as an extra channel of the model.According to different fusion methods, as shown in Figure 2, it can be further categorized as methods based on early fusion (EF) and late fusion (LF) [31,90].
EF-based methods combine multi-modal information in the feature level using concatenation and summation approaches; while LF-based methods model each modality individually and obtain the final result by considering both decisions of modalities.

Early Fusion (EF).
In EF-based methods, the features extracted from both modalities are first aggregated as a larger feature vector and then sent to the model to locate the target.The workflow of EF-based trackers is shown in the left part of Figure 2.For most of the trackers, EF is the primary choice in the multi-modal tracking task, while visible and auxiliary modalities are treated alike with the same feature extraction methods.Camplani et al. [43] utilize HOG feature for both visible and depth maps.Kart et al. [47] extract multiple features to build a robust tracker for RGB-D tracking.Similar methods exist in [44,48,49,42,54,56,58,2,60,3].However, auxiliary modality often indicates different information against the visible map.For example, thermal and depth images contain temperature and depth data, respectively.The aforementioned trackers apply feature fusion, ignoring the modality discrepancy, which decreases the tracking accuracy and causes the tracker to drift easily.To this end, some trackers differentiate the heterogeneous modalities by applying different feature methods.In [45], the gradient feature is extracted in a depth map, while the average color feature is used to represent the target in the visible modality.Meshgi et al. [52] use the raw depth information and many feature methods (HOG, LBP, and LoG) for RGB images.In [29,57,64], the HOG and intensity features are used for visible and thermal modalities, respectively.
Due to the increasing cost involved in feature concatenation and the misalignment of multi-modal data, some methods tune the feature representation after feature extraction by the pruning [67] or re-weighting operation [50,72], which can compress the feature space and exploit the cross-modal correlation.
In DAFNet [67], a feature pruning module is proposed to eliminate noisy and redundant information.Liu et al. [50] introduce a spatial weight to highlight the foreground area.Zhu et al. [72] exploit modality importance using the proposed multi-modal aggregation network.

Late fusion (LF). LF-based methods process both modalities simultaneously
and the independent models for each modality are built to make decisions.
Then, the decisions are combined by using weighted summation [78,74,4,76], calculating joint distribution function [73,8,77], and conducting multi-step localization [75].Conaire et al. [73] assume the independence between multimodal data, and then obtain the result by multiplying the target's likelihoods in both modalities.A similar method is adopted in literature [77].Xiao et al. [4] fuse two single-modal trackers via an adaptive weight map.In MCBT [75], data from multiple sources are used stepwise to locate the target.A rough target position is first estimated by optical flow in the visible domain, and the final result is determined by part-based matching method with RGB-D data.

Pre-Processing
Due to the available depth map, the second purpose of auxiliary modality is to transform the target into 3D space before target modeling via RGB-D data.
Instead of tracking in the image plane, these types of methods model the target in the world coordinate, and 3D trackers are designed [38,39,7,30,40,41].Liu et al. [30] extend the classical mean shift tracker to 3D extension.In OTR [7], the dynamic spatial constraint generated by the 3D target model enhances the discrimination of DCF trackers in dealing with out-of-view rotation and heavy occlusion.Although a significant performance is achieved, the computation cost of 3D reconstruction cannot be neglected.Furthermore, the performance is highly subject to the quality of depth data and the accessibility of mapping functions between the 2D and 3D spaces.

Post-processing
Compared with the RGB image that brings more detailed content, the depth image highlights the contour of objects, which can segment the target among surroundings via depth variance.Inspired by the nature of depth map, many RGB-D trackers utilize the depth information to determine whether the occlusion occurs and estimate the target scale [43,46,49,79].

Occlusion Reasoning (OR).
Occlusion is a traditional challenge in the tracking task because the dramatic appearance variation leads the model drifting.
Depth cue is a powerful feature to detect target occlusion; thus, the tracker can apply a global search strategy or model updating mechanism to avoid learning from the occluded target.In [43], occlusion is detected when the depth variance is large.Then, tracker enlarges the search region to detect the re-appeared target.Ding et al. [44] propose an occlusion recovery method, where a depth his-togram is recorded to examine whether the occlusion occurs.If the occlusion is detected, the tracker locates the occluder and searches the candidate around.
In [10], Zhang et al. propose a tracker switcher to detect occlusion based on the template matching method and tracking reliability.The tracker can dynamically select which information is used for tracking between appearance and motion cues, thereby improving the robustness of the tracker significantly.

Scale Estimation (SE).
SE is an important module in tracking task, which can obtain a tight bounding box and avoid drift.CF-based trackers estimate the target scale by sampling the search region in multiple resolutions [91], learning a filter for scale estimation [92], which cannot effectively adapt to the target's scale change [49].Both thermal and depth maps provide clear contour information and a coarse pixel-wise target segmentation map.With such information, the target shape can be effectively estimated.In [46], the number of scales is adaptively changed to fit the scale variation.SEOH [49] uses space continuity-of-depth information to achieve accurate scale estimation with minor time cost.The pixels belonging to the target are clustered by the K-means method in the depth map, and the sizes of the target and search regions are determined by clustering result.

Tracking Framework
In this section, multi-modal trackers are categorized based on the methods used in target modeling, including generative and discriminative.The generative framework focuses on directly modeling the representation of the target.
During tracking, the target is captured by matching the data distribution in the incoming frame.However, generative methods only learn the representations for the foreground information while ignoring the influence of surroundings, suffering from background cluttering or distractions [93].In comparison, the discriminative models construct an effective classifier to distinguish the object against the surroundings.The tracker outputs the confidence score of sampled  sample manners are exploited, e.g.sliding window [50], particle filter [38,45], and Gaussian sampling [11].Furthermore, a crucial task is utilizing powerful features to represent the target.Thanks to the emerging convolution networks, more trackers have been built via efficient CNNs.We will introduce the various frameworks in the following paragraphs.

Generative Methods
Sparse Learning (SL).SL has been popular in many tasks including image recognition [94] and classification [95], object tracking [96], and others.In SL-

Mean Shift (MS).
MS-based methods maximize the similarity between the histograms of candidates and the target template, and conduct fast local search using the mean shift technique.These methods usually assume that the object overlaps itself in consecutive frames [77].In [39,30], the authors extend

Particle Filter (PF).
The PF framework is a Bayesian sequential importance sampling technique [97].It consists of two steps, i.e., prediction and updating.In the prediction step, given the state observations z 1:t = {z 1 , z 2 , ..., z t } during the previous t frames, the posterior distribution of the state x t is predicted using Bayesian rule as follows: where p (x t | z 1:t−1 ) is estimated by a set of N particles.Each particle has a weight, w i t .In the updating process, w i t is updated as In the PF framework, the restrictions of linearity and Gaussianity imposed by Kalman filter are relaxed, thereby leading accurate and robust tracking [8].Several works improve the PF method for multi-modal tracking task.Bibi et al. [38] formulate the PF framework in 3D, which considers both representation and motion models and propose a particle pruning method to boost the tracking speed.Meshgi et al. [52] consider occlusion in approximation step to improve PF in occlusion handling.Liu et al. [64] propose a new likelihood function for PF to determine the goodness of particles, thereby promoting the performance.

Correlation Filter (CF). CF-based tracker learns the discriminative template
denoted as CF to represent the target.Then, the online learned filter is used to detect the object in the next frame.As the circular convolution can be accelerated in Fourier domain, these trackers can maintain approving accuracy with high speed.In recent years, many CF-based variants are proposed, such as adding spatial regularization [98], introducing temporal constraint [99], and equipping discriminative features [100], to increase the tracking performance.
Due to the advantage of CF-based trackers, many researchers aim to build multi-modal trackers with the CF framework.Zhai et al. propose a long-term RGB-D tracker [7], which is designed based on CSRDCF [101] and applies online 3D target reconstruction to facilitate learning robust filters.
The spatial constraint is learned from the 3D model of the target.When the target is occluded, view-specific DCFs are used to robustly localize the target.
Camplani et al. [43] improve the CF method in scale estimation and occlusion handling, while maintaining a real-time speed.

Deep Leraning (DL).
Due to the discriminative ability in feature representation, CNN is widely used in the tracking task.Various networks provide a powerful alternative to the traditional hand-crafted feature, which is the simplest way to utilize CNN.Liu et al. [50] extract the deep features from VG-GNet [102] and hand-crafted features to learn a robust representation.Li et al. [68] concatenate deep features from visible and thermal images, and then adaptively fuse them using the proposed FusionNet to achieve robust feature representation.Furthermore, some methods aim to learn an end-to-end network for multi-modal tracking.In [11,67,69], a similar framework borrowed from MDNet [103] is applied for tracking with different structures to fuse the cross-modal data.These trackers achieve obvious performance promotion while the speed is poor.Zhang et al. [71] propose an end-to-end RGB-T tracking framework with real-time speed and balanced accuracy.They apply ResNet [104] as the feature extractor and fuse RGB and thermal information in the feature level, which are used for target localization and box estimation.
Other Frameworks.Some methods use an explicit template matching method to localize the object.These methods find the best-matched candidate with the target captured in frames through a pre-defined matching function [75,41].

Datasets
With the emergence of multi-modal tracking methods, several datasets and challenges for RGB-D and RGB-T tracking are released.We summarize the available datasets in Table 2. over union (IoU) during all frames, which is defined as where the IoU (•, •) denotes the IoU between the bounding box bb i and ground truth gt i in the i-th frame.If the IoU is larger than the threshold t sr , we consider the target to be successfully tracked.The final rank of the tracker is determined by the Avg.Rank, which is defined as the average ranking of SR in each attribute.The STC dataset [4]  The trackers are measured by using both SR and VOT protocols.The VOT protocol evaluates the tracking performance in terms of two aspects: accuracy and failure.Accuracy (Acc.)considers the IoU between the ground truth and bounding box, and failure (Fail.)measures the times when the overlap is zero and the tracker is set to re-initialize using the ground truth and continues to track.CTDB [87] is the latest RGB-D tracking dataset, which contains 80 shortterm and long-term videos.The target is out-of-view and occluded frequently, which needs the tracker to handle both tracking and re-detection cases.The metrics are Precision (Pr.), Recall (Re.) and the overall F-score [106].The preci-sion and recall are defined as follows, where u i is defined in Eq. 3. The F-score combines both precision and recall through

RGB-T Dataset
In previous years, two RGB-T people detection datasets are used for tracking.The OTCBVS dataset [88] has six grayscale-thermal video clips captured from two outdoor scenes.The LITIV dataset [89] contains nine sequences, considering the illumination influence and being captured indoors.These datasets with limited sequences and low diversity have been depreciated.In 2016, Li et al. construct the GTOT dataset for RGB-T tracking, which consists of 50 grayscale-thermal sequences under different scenarios and conditions.A new attribute for RGB-T tracking is labeled as thermal crossover (TC), which indicates that the target has similar temperature with the background.Inspired by [107,108], GTOT adopts success rate (SR) and precision rate (PR) for evaluation.PR denotes the percentage of frames whose CPE is smaller than a threshold t pr , which is set to 5 in GTOT to evaluate small targets.Li et al. [2] propose a large-scale RGB-T tracking dataset, namely RGBT210, which contains 210 videos and 104.7k image pairs.This dataset also extends the number of attributes to 12.The detailed description of attributes can be found in supplementary file.The metric is the same as GTOT, except t pr is set to 20 normally.In 2019, the researchers enlarge the RGBT210 dataset and propose RGBT234 [3], which provides individual ground truth for each modality.Furthermore, besides SR and PR, expected average overlap (EAO) is used for evaluation, combining the accuracy and failures in a principled manner.

Challenges for Multi-modal Tracking
Since 2019, both RGB-D and RGB-T challenges have been held by VOT Committee [6,5].For RGB-D challenge, trackers are evaluated on the CDTB dataset [87] with the same evaluation metrics.All the sequences are annotated on the basis of 5 attributes, namely, occlusion, dynamics change, motion change, size change, and camera motion.RGB-T challenge constructs the dataset as a subset of RGBT234 with slight change in ground truth, which consists of 60 RGB-T public videos and 60 sequestered videos.Compared with RGBT234, VOT-RGBT utilizes different evaluation metrics, i.e., EAO, to measure trackers.In VOT2019-RGBT, trackers need to be re-initialized, when tracking failure is detected (the overlap between bounding box and ground truth is zero).Besides, VOT2020-RGBT introduces a new anchor mechanism to avoid a causal correlation between the first reset and the later ones [5] instead of the re-initialization mechanism.

Experiments
In this section, we conduct analysis on both public datasets and challenges from the overall comparison, attribute-based comparison, and speed.For fair comparison on speed, we refer to the device used (CPU or GPU), platform used (M: Matlab, MCN: Matconvnet, P: Python, and PT: PyTorch), and setting (detailed information on CPU and GPU).The available codes and detailed description of trackers are collected and listed in the supplementary files.

Experimental Comparison on RGB-D Datasets
Overall Comparison.PTB provides a website1 for comprehensive evaluating RGB and RGB-D methods in an online manner.We collect the results of 14 RGB-D trackers on the website and sort them based on rank.The results are shown in Table 3.We list the Avg.Rank, SR and corresponding rank of each attribute.The Avg. Rank is calculated by averaging the rankings of all attributes.According to Table 3, OTR achieves the best performance among all the competitors, which is based on the CF framework without deep features.
Reason for the promising result is that the 3D construction provides a useful constraint for filter learning.The same conclusion can be obtained by CA3DMS  regularly, these trackers with online learning are easy to drift.While the target is in small size, CF can provide precise tracking results.The occlusion handling mechanism contributes greatly to videos with target occlusion.The 3D mean shift method shows obvious advantage in tracking targets with rigid shape and no occlusion.OAPF obtains an above-average performance on tracking small objects, thereby indicating the effectiveness of the scale estimation strategy.
Speed Analysis.The speed report of RGB-D trackers are listed in Table 4.

Experimental Comparison of RGB-T Datasets
We select 14 trackers as our baseline to perform overall comparison of the GTOT and RGBT234 datasets.As only part of the trackers (JMMAC, MANet, mfDiMP) release their codes, We run these trackers on these two datasets and record the performance of other trackers in their original papers.The overall results are shown in Table 5.

Challenge Results on VOT2019-RGBD
We list the challenge results in Table 6.Both original RGB trackers without utilizing depth information and RGB-D trackers are merged for evaluation.not perform well, which may stem from the online updating using occlusion patches that degrade the discriminability of the model.

Challenge Results on VOT2019-RGBT
For VOT2019-RGBT dataset shown in Table 7, JMMAC with exploiting both appearance and motion cues shows high accuracy and robust performance and obtains the highest EAO in a large margin.Early fusion is the primary manner of RGB-T fusion, while the late fusion method (JMMAC) has great potential in improving tracking accuracy and robustness, which has not been fully utilized.All top six trackers are equipped with CNN as feature extractor, thereby indicating the powerful ability of CNN.SiamDW using a Siamese network is a general method that performs well in both RGB-D and RGB-T tasks.ATOM  [123] and sports video analysis [124], which is also a better choice for multi-modal tracking.

Specific Network for Auxiliary Modality.
As the gap of different modalities exists and the semantic information is also heterogeneous, traditional methods use different features to extract more useful data [57,64,45].Although sufficient works on network structures for visible image analysis have been conducted, the specific architecture for depth and thermal maps has not been deeply explored.Thus, DL-based methods [11,66,67,71] trade the data in auxiliary modality as an additional dimension of the RGB image with the same network architecture, e.g., VGGNet and ResNet, and extract the feature in the same level (layer).A crucial task is to design a network for processing multi-modal data.Since 2017, AutoML method, especially neural architecture search (NAS), has been popular which design the architecture automatically and obtain highly competitive results in many areas, such as image classification [125], and recognition [126].However, researchers pay less attention to NAS method for multi-modal tracking, which is a good direction to explore.
Multi-modal Tracking with Real-time Speed.The additional modality multiplies the computation, which causes difficulty for the existing tracking frameworks to achieve the requirements of real-time performance.A speed-up mechanism needs to be designed, such as feature selection [67], knowledge distillation technology, and others.Furthermore, Huang et al. [127] propose a tradeoff method, where the agent decides which layer is more suitable for accurate localization, thereby providing 100 times speed boost.Metrics for Robustness Evaluation.In some extreme scenes and weather conditions, such as rainy, low illumination and hot sunny days, visible or thermal sensors cannot provide meaningful data.The depth camera cannot obtain precise distance estimation when the object is far from the sensor.Therefore, a robust tracker needs to avoid tracking failure when any of the modality data is unavailable during a certain period.To handle this case, both complementary and discriminative features have to be applied in localization.However, none of the datasets measures the tracking robustness with missing data.Thus, a new evaluation metric for tracking robustness needs to be considered.

Conclusion
In this study, we provide an in-depth review of multi-modal tracking.First, we conclude multi-modal trackers in a unified framework, and analyze them from different perspectives, including auxiliary modality purpose and tracking framework.Then, we present a detailed introduction on the datasets for multi-modal tracking and corresponding metrics.Furthermore, a comprehensive comparison of five popular datasets is conducted and the effectiveness of trackers belonging to various types are analyzed in the views of overall performance, attribute-based performance, and speed.Finally, as an emerging

Figure 1 :
Figure 1: Structure of three classification methods and algorithms in each category.

Figure 2 :
Figure 2: Workflows of early fusion (EF) and late fusion (LF).EF-based methods conduct feature fusion and model them jointly; while LF-based methods aim to model each modality individually and then combine their decisions.
candidates and chooses the best matching patch as the target.Various patch

Figure 3 :
Figure 3: Framework of OAPF [52].The particle filter method is applied, with occlusion handling, in which the occlusion model is constructed against the template model.When the target is occluded, the occlusion model is used to predict the position without the updating template model.
based RGB-T trackers, the tracking task can be formulated as a minimization problem for the reconstruction error with the learned sparse dictionary[57,29,56,58,60,63,64,1].Lan et al.[29] propose a unified learning paradigm to learn the target representation, modality-wise reliability and classifier, collaboratively.Similar methods are also applied in the RGB-D tracking task.Ma et al.[51] construct an augmented dictionary consisting of target and occlusion templates, which achieves accurate tracking even in heavy occlusion.SL-based trackers achieve promising results at the expense of computation cost.These trackers cannot meet the requirements of real-time tracking.

Figure 4 :
Figure 4: Workflow of the JMMAC [10].The CF-based tracker is used to model appearance cue, while both camera and target motion are considered, thereby achieving substantial performance.the 2D MS method to 3D with RGB-D data.Conaire et al. [77] propose an MS tracker using spatiogram instead of histogram.Compared with discriminative methods, MS-based trackers directly regress the offset of the target, which omits dense sampling.These methods with lightweight features can achieve real-time performance, whereas the performance advantage is not obvious.

Figure 5 :
Figure 5: Framework of MANet [11].Generic adapter (GA) is used to extract common information of RGB-T images.Modality adapter (MA) aims to exploit the different properties of heterogeneous modalities.Finally, instance adapter (IA) models the appearance properties and temporal variations of a certain object.

[ 65 ]
introduce lowrank constraint to learn the filters of both modalities collaboratively, thereby exploiting the relationship between RGB and thermal data.Hannuna et al.[46] effectively handle the scale change with the guidance of the depth map.Kart et al.
consists of 36 RGB-D sequences and covers some extreme tracking circumstances, such as outdoor and night scenes.This dataset is captured by still and moving ASUS Xtion RGB-D cameras to evaluate the tracking performance under conditions of arbitrary camera motion.A total of 10 attributes are labeled to thoroughly analyze the dataset bias.The detailed introduction of each attributes are shown in the supplementary file.

Figure 7 :
Figure 7: Examples and corresponding attributes in GTOT and RGBT234 tracking datasets.
and 3DT, which construct a 3D model to locate the target via mean-shift and sparse learning methods.These trackers with traditional features are competitive to the deep trackers.DL-based trackers (WCO, TACF, and CSR-RGBD) achieve substantial performance, which indicates the discrimination of deep features.CF-based trackers achieve various results and are the most widelyapplied framework.Trackers based on original CF methods (DMKCF, DSKCF and DSOH) perform significantly worse than those developed on improved CF (OTR, WCO and TACF).OTOD based on point cloud does not exploit the

Figure 8 :
Figure 8: Attribute-based Comparison on RGBT234.ily in fast-moving targets.This condition may result from CF-based trackers having a fixed search region.When the target moves outside the region, the target cannot be detected, thereby causing tracking failure.CMPP, which exploits the inter-modal and cross-modal correlation, have great promotion on low illumination, low resolution, and thermal crossover.Targets in these attributes have unavailable modality and CMPP can eliminate the gap between heterogeneous modalities.The detailed figure on attribute-based comparison can be found in supplementary file.
Trackers who obtain top three ranks on F-score, precision, and Recall, are designed with the same component and framework.Unlike the PTB dataset, DLbased methods have potential performance on VOT-RGBD19, which results from these trackers utilizing large-scale visual datasets for offline training and

6. 2 .Figure 9 :
Figure 9: Unregistration examples in RGBT234 dataset.We show the ground truth of visible modality in both images.The coarse bounding box degrades the discriminability of the model.temporalaspects is essential.As shown in Figure9, the target is out of the box and the model is degraded by learning meaningless background information.In the VOT-RGBT challenge, the dataset ensures the precise annotation in infrared modality and the misalignment of the RGB image is required to be handled by the tracker.We state that the image pre-registration process is necessary during dataset construction by cropping the shared visual field and applying image registration method.

Table 1 :
Summary of existing surveys in related fields.
[36]dal machine learning.Some of them focus on specific multi-modal information or single tasks.Cai et al.[33]collect the datasets captured by RGB-D sensors, which are used in many different applications, such as object recognition, scene classification, hand gesture recognition, 3D-simultaneous localization and mapping, and pose estimation.Camplani et al.[34]focus on multiple human tracking with RGB-D data and conduct an in-depth review of different aspects.A comprehensive and detailed survey by Ma et al.[36]is presented to summarize the methods regarding RGB-T image fusion.Recently, a survey on

Table 2 :
Summary of multi-modal tracking datasets.
[105]score as the target location, which can reduce the model drift.In[83], a structured SVM[105]is learned by maximizing a classification score, which can prevent the labeling ambiguity in the training process.

Table 3 :
Experimental results on the PTB dataset.Numbers in parentheses indicate their ranks.The top three results are in red, blue, and green fonts.

Table 4 :
The speed analysis of RGB-D trackers.
Attribute-based Comparison.PTB provided 11 attributes from five aspects for comparison.CF-based trackers, including OTR, WCO, TACF, CSR-RGBD, and CCF, are not well-performed on tracking animals.As animals move fast and ir-

Table 5 :
Experimental results on the GTOT and RGBT234 datasets.
equipping deeper networks.For instance, the original RGB tracker with DL framework also achieves excellent performance.Furthermore, occlusion handling is another necessary component of the high-performance tracker because VOT2019-RGBD focuses on long-term tracking with target frequent reappearance and out-of-view and most of the trackers are equipped with a redetection mechanism.The CF framework (FuCoLoT, OTR, CSR-RGBD, and ECO) does

Table 7 :
Challenge results on the VOT2019-RGBT dataset.
Multi-modal fusion.Compared with tracking with unimodality data, multimodal tracking can easily exploit a powerful data fusion mechanism.Existing methods mainly focus on feature fusion, whereas the effectiveness of other fusion types has not been exploited.Compared with early fusion, late fusion eliminates the bias that heterogeneous features may be learned from different modalities.Furthermore, another advantage of late fusion is that we can utilize various methods to model each modality independently.The hybrid fusion method combining the early and late fusion strategies has been used in image segmentation