Advertisement

Fusion object detection of satellite imagery with arbitrary-oriented region convolutional neural network

  • Ying Ya
  • Han PanEmail author
  • Zhongliang Jing
  • Xuanguang Ren
  • Lingfeng Qiao
Original Paper
  • 38 Downloads

Abstract

Object detection on multi-source images from satellite platforms is difficult due to the characteristics of imaging sensors. Multi-model image fusion provides a possibility to improve the performance of object detection. This paper proposes a fusion object detection framework with arbitrary-oriented region convolutional neural network. First, nine kinds of pansharpening methods are utilized to fuse multi-source images. Second, a novel object detection framework based on Faster Region-based Convolutional Neural Network structure is used, which is suitable for large-scale satellite images. Region Proposal Network is adopted to generate axially aligned bounding boxes enclosing object sin different orientations, and then extract features by pooling layers with different sizes. These features are used to classify the proposals, adjust the bounding boxes, and predict the inclined boxes and the objectness/non-objectness score. Smaller anchors for small objects are considered. Finally, inclined non-maximum suppression method is utilized to get the detection results. Experimental results showed that the proposed method performs better than some state-of-the-art object detection techniques, such as YOLO-v2, YOLO-v3, etc. Some numerical tests validate the efficiency and effectiveness of the proposed method.

Keywords

Object detection Image fusion Satellite imagery CNN 

1 Introduction

Object detection is an important and challenging research hotspot in the field of computer vision and digital image processing. It is widely used in many applications fields, such as robot navigation, intelligent video surveillance, industrial detection, aerospace, etc. At the same time, object detection is also a basic algorithm of object recognition, which plays a vital role in subsequent recognition tasks. Because of the extensive application of deep learning, object detection algorithms have been developed rapidly. It has made great strides in the past few years, since the convolutional neural networks (CNN) [1] method was used and won in the ImageNet competition [2] in 2012.

Satellite images are important information resource of great significance to economic and social development. It is widely used in disaster monitoring, environmental monitoring, resource investigation, land-use assessment, agricultural output estimation, and urban construction planning because of its practicability and timeliness. However, the deep learning methods cannot be applied to satellite imagery directly. There are four main problems that the algorithm needs to consider: first, the high resolution of optical satellite imagery posing serious problems for object detection; second, the objects such as ships, small vehicles, and planes, may be extremely small (less than 15 × 15 pixels) and dense in satellite images, rather than the common data sets such as PASCAL VOC [3] and ImageNet; third, the public training data set is relatively lacking; fourth, the problem of complete rotation invariance. Many target objects such as cars, ships, and planes have lots of orientation when viewed from overhead. Thus, it is a strong need to investigate the effect of different structures from deep networks and the modalities of multi-source images of satellite.

In satellite missions, different modalities of multi-source images can be obtained by different sensors, such as panchromatic image, infrared image, hyper-spectral image, multispectral image and SAR image, etc. In recent years, with the rapid development of earth observation technology, many optical remote-sensing imaging satellites with high spatial resolution have emerged. Panchromatic images with sub-meter resolution provide very rich local features for visual tasks. However, the existence of multi-view imaging, long shooting distance, cloud, and shadows may lead to false alarm and missed detection. How to detect and extract objects accurately, quickly, and steadily and gain more response and processing time based on satellite multi-source image information has attracted increasing attention.

To deal with the previous problems, we propose a fusion object detection framework based on arbitrary-oriented region convolutional neural network. Multi-model image fusion provides a possibility to improve the performance of object detection. The proposed method includes two stages. First, nine kinds of image fusion methods are taken to fuse multi-source images. Second, a novel object detection framework based on Faster R-CNN structure is utilized, which is suitable for large-scale satellite imagery. To handle the detect objects in any orientation, the bounding boxes are in different orientations, and then features by pooling layers with different sizes are extracted. The features are used to classify the proposals, adjust the bounding boxes, and predict the objectness/non-objectness score. Smaller anchors for small objects are added. Finally, inclined non-maximum suppression method is utilized to get the detection results. The results from real-world applications show that the proposed method outperforms the state-of-art methods.

The reminder of this paper is organized as follows. Section 2 introduces related work about object detection and image fusion. Section 3 describes the pansharpening methods and details our object detection method for satellite imagery. Section 4 describes the data sets which we used in our experiments. Finally, in Sect. 5, the evaluation indicators are introduced, and the experimental results of our algorithm are showed and discussed in detail.

2 Related work

The research on detecting small objects on satellite imagery over large areas remains a challenge with great significance. Object detection algorithms have made great strides in the past few years since the appearance of deep learning methods and the application of Convolutional Neural Networks (CNN), while the performance of state-of-the-art object detection networks on satellite imagery is not very well. Pansharpening methods can fuse multi-model images to improve the performance of object detection on satellite imagery.

2.1 Deep learning method

Deep learning technology uses multi-layer to compute model to progressively extract higher level features from the raw input image and learn abstract representation of data from complex structures in numerous data, which has attracted the attention of academia, since Hinton et al. first proposed Deep Neural Network [4] in 2006. Then, the follow-up research by Bengio LeCun et al. opened the upsurge of deep learning research. This technology has been successfully applied to many pattern recognition problems, including computer vision. In deep learning, CNN is a class of deep neural networks with convolution structure, sparse connection, and weight sharing. Its characteristics lead to the reduction of the scale of parameters of the neural network and the complexity of model training, and avoid the cumbersome feature extraction and data reconstruction. At the same time, the convolution operation preserves the spatial information of image pixels, with the invariance of translation, rotation, and scaling. When multi-dimensional images are directly input into the network, these advantages are more obvious. In 1989, LeNet-5 [5], a CNN model, was proposed by LeCun et al. for handwritten character recognition. This method achieved satisfactory results. In 2012, AlexNet [6], a CNN model constructed by Krizhevsky et al. greatly reduced the error rate in image classification of ImageNet large-scale visual recognition challenge competition, refreshed the record of image classification, and established the position of deep learning in computer vision.

2.2 Object detection

Object detection is widely used in computer vision technique for detecting and locating instances of semantic objects of a certain class in digital images or videos [7]. It is a very challenging subject with great potential and space. Object detection including two sub-tasks: object location to determine where objects are located in given images and object classification to determine which category the objects belong to. However, the traditional methods may not perform well when constructing complex ensembles, which combine multiple low-level image features with high-level contexts from object detectors and scene classifiers [8]. Traditional object detection methods usually use shallow trainable architectures and design features manually [9]. While deep learning methods have the capacity to learn more complex features and learn informative object representations. With the rapid development in deep learning, an obvious gain is achieved.

Deep learning-based approaches such as Faster R-CNN [10], SSD [11], and YOLO [12, 13, 14] do end-to-end object detection automatically. Faster R-CNN uses 1000 × 600 pixel input images, SSD runs on 300 × 300 or 512 × 512 pixel inputs, and YOLO ingests 416 × 416 or 544 × 544 pixel images. These frameworks are widely used in many computer vision tasks and have good performance, but small objects in groups, such as flocks of ships or small and dense vehicles in the satellite imagery present a challenge, they cannot directly applied to satellite imagery [15]. In this paper, a framework based on Faster R-CNN model is used to improve the performance on detecting small objects [16].

Faster R-CNN is a typical and effective object detection model. The original model of Faster R-CNN is R-CNN [17]. It was intuitive, but it has high compute complexity and long compute time. An immediate descendant to R-CNN is Fast R-CNN [18]. It performed much better in terms of speed. However, the selective search algorithm for generating region proposals was a big bottleneck. The main insight of Faster R-CNN is to replace the slow selective search algorithm with a fast neural network and introduce a Region Proposal Network (RPN).

Most object detection methods detect horizontal bounding boxes. The Rotation Region Proposal Network (RRPN) [19] was proposed by Ma et al. to detect arbitrary-oriented scene text which based on Faster R-CNN. An algorithm called Rotational Region CNN [16] based on RRPN can detect arbitrary-oriented texts in natural scene images. To detect small objects in any orientation in the satellite imagery, we use the Rotational Region CNN framework and add smaller anchors, while the existence of differences between images obtained by different sensors may lead to false alarm and missed detection. The fusion of the PAN and MS images provides the possibility to gain images with the higher resolutions and more local features or content in both the spatial and spectral domains [20].

2.3 Pansharpening methods of satellite image

Multi-band image fusion [21, 22, 23, 24, 25] methods aim to combine spatial and spectral information from one or multiple observations and other image sources, such as panchromatic images, multispectral images, or hyper-spectral images. Fusing multi-band images has become a thriving area of research in a number of different fields, such as space robotics, remote sensing, etc. Pansharpening aims at fusing a multispectral image and a panchromatic image, featuring the result of the processing with the spectral resolution of the former and the spatial resolution of the latter [20].

Panchromatic (PAN) and multispectral (MS) image fusion methods originated in the 1980s [26, 27]. Since SPOT-1 satellite provided panchromatic and multispectral images simultaneously in 1986, fusion methods have developed rapidly. In general, fusion methods can be classified into three categories [28]: component replacement fusion methods, multi-resolution analysis fusion methods, and model-based fusion methods. Among them, the component replacement methods are the simplest and most popular fusion methods, which have been widely used in professional remote-sensing software such as The Environment for Visualizing Images (ENVI1) and Earth Resources Data Analysis System (ERDAS2). First, the luminance component is obtained based on spectral transformation, and then, the spatial information of multispectral image is enhanced by replacing the luminance component with panchromatic image. Typical methods include Principal Component Analysis (PCA) fusion [29], Gram-Schmidt (GS) fusion [30], Intensity-Hue-Saturation (IHS) fusion [31], etc. Multi-resolution analysis fusion methods extract high spatial structure information of panchromatic image based on wavelet transform or Laplacian pyramid, and inject the extracted spatial structure information into multispectral image to obtain high spatial resolution fusion image with a certain injection model [32], such as multi-hole wavelet fusion method [33], Laplace pyramid-based method [34], and Contourlet-based method [35]. For component substitution fusion methods and multi-resolution analysis fusion methods, [36] further extended them to the same fusion framework, which greatly promoted the development of panchromatic/multispectral fusion method.

3 Proposed method

To deal with the limitations discussed in Sect. 1, an object detection framework with arbitrary-oriented region CNN is proposed. The pansharpening methods are detailed in Sect. 3.2.

3.1 Object detection framework

Our object detection framework is based on Faster R-CNN, a canonical model in the field of deep learning-based object detection. The original model of Faster R-CNN is R-CNN. While R-CNN demands significant computational resources. Fast R-CNN improved the detection speed through performing feature extraction over the image before proposing regions. Instead of running 2000 CNNs over 2000 overlapping regions, it runs only one CNN over the entire image and used softmax layer to replace the SVM to classify the objects. However, the selective search algorithm for generating region proposals is a bottleneck problem. The main contribution of Faster R-CNN was to replace the slow selective search algorithm with a fast neural net. Specifically, it introduced the Region Proposal Network (RPN). Its main workflow is summarized as follows:
  • At the last layer of an initial CNN, a 3 × 3 sliding window moves across the feature map and maps it to a lower dimension.

  • For each location of sliding window, it generates multiple possible regions based on k fixed-ratio anchor boxes.

  • Each region proposal consists of an “objectness” score for that region and four coordinates representing the bounding box of the region.

In a sense, Faster R-CNN = RPN + Fast R-CNN [37]. Many object detection frameworks, such as SSD [11] and YOLO [12, 13, 14], do not rely on region proposal, but estimate object candidates directly. Our object detection architecture in this paper named Rotational Region CNN [16], which is improved on the basis of Faster R-CNN, utilizing the object candidates proposed by RPN to predict the orientation information, the network architecture of it is showed in Fig. 1. The RPN is used for proposing axis-aligned bounding boxes that enclose the arbitrary-oriented objects. For each box generated by RPN, three different pooled sizes ROI poolings are performed and the pooled features are concatenated for predicting the objectness scores, axis-aligned box, and inclined minimum area box. Then, an inclined non-maximum suppression is conducted on the inclined boxes to get the final results.
Fig. 1

Network architecture of Rotational Region CNN [16]

The objects in satellite imagery are very small; therefore, smaller anchors are considered in RPN. The anchor aspect ratios and other settings of RPN are same as Faster R-CNN. The 3 ROI Poolings’ pooled sizes are: 7 × 7, 3 × 11, 11 × 3, which can obtain more horizontal and vertical features on small-scale objects. We estimate both the axis-aligned bounding box and the inclined bounding box; therefore, we not only do normal NMS on axis-aligned bounding boxes and but also do inclined NMS on inclined bounding boxes [16], and the method of calculating Intersection-over-Union (IoU) refers to [19]. The loss function in the training process is same as Faster R-CNN, while the loss defined on each proposal is different; it includes the objectness/non-objectness classification loss and the box regression loss as:
$$\begin{aligned} L\left( {p,b,v,v^{*} ,u,u^{*} } \right) & = L_{\text{cls}} \left( {p,b} \right) \\ &+ \lambda_{1} b\mathop \sum \limits_{{i \in \left\{ {x,y,w,h} \right\}}} L_{\text{reg}} \left( {v_{i} ,v_{i}^{*} } \right) \\ &+ \lambda_{2} b\mathop \sum \limits_{{i \in \left\{ {x1,y1,x2,y2,h} \right\}}} L_{\text{reg}} \left( {u_{i} ,u_{i}^{*} } \right), \\ \end{aligned}$$
(1)
in which \(\lambda_{1}\), \(\lambda_{2}\) parameters balance the trade-off between three subformulas and b is the indicator of the class label (object: \(b = 1\) and background: \(b = 0\)). \(p = \left( {p_{0} ,p_{1} } \right)\) is the parameter means the probability over object and background computed by the softmax function. \(L_{\text{cls}} \left( {p,b} \right) = - \log p_{b}\) is the log loss for true class b. v and u are tuples of true axis-aligned and true inclined bounding box regression objects. \(v^{*}\) and \(u^{*}\) are the predicted tuples for the object label. The parameterization for v and \(v^{*}\) is given in [38]. In this paper, v and \(v^{*}\) specify a scale-invariant translation and log-space height/width shift relative to an object proposal. We use \(\left( {a,a^{*} } \right)\) indicates \(\left( {v_{i} ,v_{i}^{*} } \right)\) or \(\left( {u_{i} ,u_{i}^{*} } \right)\), \(L_{\text{reg}} \left( {a,a^{*} } \right)\) is defined as:
$$L_{\text{reg}} \left( {a,a^{*} } \right) = {\text{smooth}}_{{{\text{L}}1}} \left( {a - a^{*} } \right)$$
(2)
$${\text{smooth}}_{L1} \left( x \right) = \left\{ {\begin{array}{*{20}{l}} {0.5x^{2} } &\quad {{\text{if}}\,\left| x \right| < 1} \\ {\left| x \right| - 0.5 } &\quad {\text{otherwise}} \\ \end{array} } \right..$$
(3)

3.2 Pansharpening algorithms

The object detection framework mentioned in Sect. 3.1 achieved good results. However, the existence of multi-view imaging, long shooting distance, cloud, and shadows may lead to false alarm and missed detection. Multi-model image fusion can improve the performance of object detection on satellite imagery. Pansharpening algorithms aim at fusing a panchromatic (PAN) and a multispectral (MS) image simultaneously acquired over the same area. MS has fewer spatial details, while PAN only has single band. Pansharpening can combine the spatial details resolved by the PAN image and the several spectral bands of the MS in a unique product, which improve object detection performance by providing images with the highest resolutions in both the spatial and spectral domains.

Pansharpening methods usually can be divided into two main classes, the component substitution (CS) methods and the multi-resolution analysis (MRA) methods. First, the notations used in this paper are described in Table 1.
Table 1

List of the main symbols

Symbol

Description

MS

Multispectral image

\(\widetilde{\text{MS}}\)

MS image interpolated at the scale of PAN

P

PAN image

\(\widehat{\text{MS}}\)

Pansharpened image

R

Spatial resolution ratio between MS and PAN

N

Number of MS bands

Vectors in this paper are expressed in bold lowercase (e.g., \({\mathbf{a}}\)),\(a_{i}\) indicates the ith element. Two-dimensional and three-dimensional arrays are indicated in bold uppercase (e.g., \({\mathbf{A}}\)). We use a 3-D array \({\mathbf{A}} = \left\{ {{\mathbf{A}}_{k} } \right\}_{k = 1, \ldots ,N}\) to indicate an MS image composed by N bands indexed by the subscript \(k = 1, \ldots , N\); and \({\mathbf{A}}_{k}\) indicates the kth band of \({\mathbf{A}}\). A PAN image is a 2-D matrix and will be expressed as \({\mathbf{P}}\).

A typical formulation of CS fusion method is given by:
$$\widetilde{{{\text{MS}}_{k} }} = \widetilde{{{\text{MS}}_{k} }} + g_{k} \left( {{\mathbf{P}} - {\mathbf{M}}_{L} } \right),\quad k = 1, \ldots ,N,$$
(4)
in which k indicates the kth spectral band, the vector of the injection gains indicated as \({\text{g}} = \left[ {g_{1} , \ldots , g_{k} , \ldots , g_{N} } \right]\), and \({\mathbf{M}}_{L}\) is defined as:
$${\mathbf{M}}_{L} = \mathop \sum \limits_{i = 1}^{N} w_{i} \widetilde{{{\text{MS}}_{i} }},$$
(5)
in which the weight vector \({\mathbf{w}} = \left[ {w_{1} , \ldots , w_{i} , \ldots , w_{N} } \right]\) is the first row of the forward transformation matrix and can measure the degrees of spectral overlap among the MS channels and PAN [39, 40].

The pipeline of CS approach: (1) matches the scale of PAN, interpolate the MS image; (2) calculates the intensity component using formula (5) and matches the histograms of the PAN and the intensity component; (3) injects the extracted details by formula (4).

The CS family includes many methods, which are described in more detail below. Table 2 summarizes the values of the spectral weights and injection gains of them by (5) and (4). In \(w_{k,i}\), subscripts k and i refer to output and input bands, respectively.
Table 2

Spectral weight in (5) and injection gains in (4) for several CS-based methods

Method

\(w_{k, i}\)

\(g_{k}\)

BT [41]

\({\raise0.7ex\hbox{$1$} \!\mathord{\left/ {\vphantom {1 N}}\right.\kern-0pt} \!\lower0.7ex\hbox{$N$}}\)

\(\frac{{\widetilde{{{\text{MS}}_{k} }}}}{{M_{L} }}\)

PCA [42]

\({\mathbf{A}}_{1,i}\)

\({\mathbf{A}}_{1,k}\)

GS [42]

\({\raise0.7ex\hbox{$1$} \!\mathord{\left/ {\vphantom {1 N}}\right.\kern-0pt} \!\lower0.7ex\hbox{$N$}}\)

\(\frac{{{\text{cov}}\left( {M_{L} ,\widetilde{{{\text{MS}}_{k} }}} \right)}}{{{\text{var}}\left( {M_{L} } \right)}}\)

GSA [43]

\(\widehat{{w_{i} }}\) (Eq. 6)

\(\frac{{{\text{cov}}\left( {M_{L} ,\widetilde{{{\text{MS}}_{k} }}} \right)}}{{{\text{var}}\left( {M_{L} } \right)}}\)

BDSD [44]

\(\widehat{{w_{k,i} }}\) (Eqs. 89)

\(\widehat{{g_{k} }}\) (Eqs. 89)

PRACS [45]

\(\widehat{{w_{i} }}\) (Eq. 6)

(Eqs. 1112)

IHS

\({\raise0.7ex\hbox{$1$} \!\mathord{\left/ {\vphantom {1 N}}\right.\kern-0pt} \!\lower0.7ex\hbox{$N$}}\left( {N = 3} \right)\)

\(1\)

GIHS [46]

\({\text{any }}w_{i} \ge 0\)

\((\sum\nolimits_{i = 1}^{N} {w_{i} } )^{ - 1}\)

  1. (1)

    PCA

     
PCA is a linear transformation of the data which achieved through a multi-dimensional rotation of the original coordinate system of the N-dimensional vector space. Principal components (PCs) are a set of scalar images produced by the projection of the original spectral vectors on the new axes, which are the eigenvectors of the covariance matrix along the spectral direction. They are uncorrelated to each other. PCs are usually sorted for reducing variance, which quantifies their information content.
  1. (2)

    GS

     
GS transformation is a usual technique used to orthogonalize a set of vectors in linear algebra and multivariate statistics. GS orthogonalization processes an MS vector at the time by finding its projection on the (hyper) plane defined by the previously found orthogonal vectors and its orthogonal component, so that the sum of the orthogonal and projection components is equal to the zero-mean version of the original vectorized band. Pansharpening replaces \({\mathbf{M}}_{L}\) with the histogram-matched \({\mathbf{P}}\) before the inverse transformation [20]. Aiazziet al. proposed adaptive GS (GSA) in, in which \({\mathbf{M}}_{L}\) is generated by a weighted average of the MS bands, with MSE-minimizing weights by a low-pass-filtered version of PAN:
$${\mathbf{P}}_{L} = \mathop \sum \limits_{k = 1}^{N} w_{k} \widetilde{{{\text{MS}}_{k} }}.$$
(6)
  1. (3)

    Band-Dependent Spatial Detail Algorithm

     
The Band-Dependent Spatial Detail (BDSD) algorithm [44] starts from an extended version of the generic formulation (4) as follows:
$$\widetilde{{{\text{MS}}_{k} }} = \widetilde{{{\text{MS}}_{k} }} + g_{k} \left( {{\mathbf{P}} - \mathop \sum \limits_{i = 1}^{N} w_{k.i} \widetilde{{{\text{MS}}_{i} }}} \right), \quad k = 1, \ldots ,N.$$
(7)
The coefficients are defined as:
$$\gamma_{k,i} = \left\{{\begin{array}{*{20}{l}} {g_{k} } &\quad {{\text{if}}\,i = N + 1} \\ { - g_{k} \cdot w_{k.i} } &\quad {\text{otherwise}} \\ \end{array} } \right.;$$
(8)
Eq. ( 4 ) can be rewritten in compact matrix form as:
$$\widehat{{{\text{MS}}_{k} }} = \widetilde{{{\text{MS}}_{k} }} + L\gamma_{k} ,$$
(9)
in which \({\mathbf{L}} = \left[ {\widetilde{{{\text{MS}}_{1} }}, \ldots ,\widetilde{{{\text{MS}}_{N} }},{\mathbf{P}}} \right], \gamma_{k,i} = \left[ {\gamma_{k,1} , \ldots ,\gamma_{k, N + 1} } \right]^{T} .\).
  1. (4)

    PRACS

     
The concept of partial replacement of the intensity component is described in [45] named Partial Replacement Adaptive CS (PRACS). This method utilizes \({\mathbf{P}}^{\left( k \right)}\), a weighted sum of PAN and of the kth MS band, to calculate the kth sharpened band in (4). For \(k = 1, \ldots ,N\), the band-dependent high-resolution sharpening image is calculated as:
$$\begin{array}{*{20}c} {{\mathbf{P}}^{\left( k \right)} = CC\left( {{\mathbf{M}}_{L} , \widetilde{{{\text{MS}}_{k} }}} \right) \cdot P + \left( {1 - {\text{CC}}\left( {{\mathbf{M}}_{L} , \widetilde{{{\text{MS}}_{k} }}} \right)} \right) \cdot \widetilde{{{\text{MS}}_{k} }}^{\prime } } \\ \end{array} .$$
(10)
The injection gains {\(g_{k}\)} are obtained by:
$$\begin{array}{*{20}c} {g_{k} = \beta \cdot CC\left( {{\mathbf{P}}_{L}^{\left( k \right)} ,\widetilde{{{\text{MS}}_{k} }}} \right)\frac{{{\text{std}}\left( {\widetilde{{{\text{MS}}_{k} }}} \right)}}{{\frac{1}{N}\mathop \sum \nolimits_{i = 1}^{N} {\text{std}}\left( {\widetilde{{{\text{MS}}_{i} }}} \right)}}L_{k} } \\ \end{array} .$$
(11)
\(L_{k}\) is defined as:
$$\begin{array}{*{20}c} {L_{k} = 1 - \left| {1 - CC\left( {{\mathbf{M}}_{L} , \widetilde{{{\text{MS}}_{k} }}} \right)\frac{{\widetilde{{{\text{MS}}_{k} }}}}{{{\mathbf{P}}_{L}^{\left( k \right)} }}} \right|} \\ \end{array} .$$
(12)
The pipeline of the MRA approach: (1) interpolates the MS image to reach the PAN scale; (2) calculates the low-pass version \({\mathbf{P}}_{L}\) of the PAN by means of the equivalent filter for a scale ratio equal to R and computes the band-dependent injection gains \(\left\{ {g_{k} } \right\}_{k = 1, \ldots ,N}\); (3) injects the extracted details by (13). A brief summary of the MRA-based approaches is showed in Table 3.
$$\begin{array}{*{20}c} {\widehat{{{\text{MS}}_{k} }} = \widetilde{{{\text{MS}}_{k} }} + g_{k} \left( {{\mathbf{P}} - {\mathbf{P}}_{L} } \right),\quad k = 1, \ldots ,N} \\ \end{array} .$$
(13)
Table 3

MRA-based pansharpening methods and related MRA schemes with filters and injection gains

Method

Type of MRA and filter

\(g_{k}\)

HPF [29]

ATWT w/Box Filter

1

HPM [47] /SFIM [48, 49]

ATWT w/Box Filter

\(\frac{{\widetilde{{{\text{MS}}_{k} }}}}{{p_{L} }}\)

Indusion [50]

DWT w/CDF Bior. Filt.

1

MTF-GLP [34]

GLP w/MTF Filter

1

MTF-GLP-CBD

GLP w/MTF Filter

\(\frac{{{\text{cov}}\left( {P_{L} ,\widetilde{{{\text{MS}}_{k} }}} \right)}}{{{\text{var}}\left( {P_{L} } \right)}}\)

MTF-GLP-HPM [51]

GLP w/MTF Filter

\(\frac{{\widetilde{{{\text{MS}}_{k} }}}}{{p_{L} }}\)

MTF-GLP-HPM-PP

GLP w/MTF Filter

\(\frac{{\widetilde{{{\text{MS}}_{k} }}}}{{p_{L} }}\)

  1. (5)

    Low-Pass Filtering (LPF)

     
An implementation of applying a single linear time-invariant LPF \(h_{\text{LP}}\) to the PAN image \({\mathbf{P}}\) to obtain \({\mathbf{P}}_{L}\) is given by (14), the notation * presents the convolution operator:
$$\widehat{{{\text{MS}}_{k} }} = \widetilde{{{\text{MS}}_{k} }} + g_{k} \left( {{\mathbf{P}} - {\mathbf{P}} *h_{LP} } \right),\quad k = 1, \ldots ,N.$$
(14)
  1. (6)

    Pyramidal Decompositions

     

This method is commonly referred to as pyramidal decomposition utilizing Gaussian LPFs that carry out the analysis steps. The Gaussian filters can be tuned to match the sensor MTF closely, and allow extracting from the PAN those details, which are not seen by the MS sensor because of the coarser spatial resolution [20].

4 Data sets

4.1 Object detection training data

The training data set which we utilized in our object detection algorithm is DOTA [52]. It is a large-scale data set for object detection in aerial images. DOTA contains 2806 images from different sensors and platforms. There are 15 categories in total in DOTA data set, including large vehicle, small vehicle, plane, helicopter, ship, harbour, bridge, baseball diamond, basketball court, soccer-ball field, tennis court, ground-track field, roundabout, basketball court, and storage tank. The size of each image ranges from about 800 × 800 to 4000 × 4000 pixels. The objects in DOTA have multiple scales, orientations, and shapes.

There are 188,282 instances of the fully annotated DOTA images annotated by an arbitrary quadrilateral. Figure 2 shows some examples of annotated DOTA images.
Fig. 2

Examples of annotated DOTA images [52]

4.2 Fusion data

For image fusion, we use three data sets:
  1. (1)

    Pléiades data set

     
The Pléiades data set was used for the 2006 data fusion contest [53] collected by an aerial platform and provided by CNES, the French Space Agency. The images are an urban area of Toulouse (France) with the size of 1024 × 1024 pixels. The resolution of the four MS bands is 0.6 m. The PAN data with high resolution were simulated by the following procedure. The red and green channels were averaged, and the result was filtered with a system characterized by the nominal MTF of the PAN sensor. After the resampling to 0.8 m, which adding thermal noise. Finally, inverse filtering and wavelet denoising were used to obtain simulated image [20].
  1. (2)

    Kaggle Dstl Satellite Imagery Feature Detection Competition Data

     
Defence Science and Technology Laboratory (Dstl) provides 1000 m × 1000 m satellite images in both 3-band and 16-band formats in this competition [54]. The 3-band images are the traditional RGB natural colour images. The 16-band images contain spectral information by capturing wider wavelength channels. This multi-band imagery is taken from the multispectral (400–1040 nm) and short-wave infrared (SWIR) (1195–2365 nm) range.
  1. (3)

    2019 IEEE GRSS Data Fusion Contest Data

     

This contest [55] provides Urban Semantic 3D (US3D) data, a large-scale public data set including multi-view, multi-band satellite imagery and ground truth geometric and semantic labels for two large cities. The US3D data set includes incidental satellite imagery, Airborne LiDAR, and semantic labels covering approximately 100 square kilometres over Jacksonville, Florida and Omaha, Nebraska, United States. WorldView-3 panchromatic and 8-band visible and near infrared (VNIR) images are provided courtesy of Digital Globe. Source data consist of 26 images collected between 2014 and 2016 over Jacksonville, Florida, and 43 images collected between 2014 and 2015 over Omaha, Nebraska, United States. Ground sampling distance (GSD) is approximately 35 cm and 1.3 m for panchromatic and VNIR images, respectively. VNIR images are all pan sharpened. Satellite imagery is provided in geographically non-overlapping tiles, where Airborne LiDAR data and semantic labels are projected into the same plane.

5 Experiments

The proposed fusion object detection method is comprehensively evaluated on the dataset mentioned in Sect. 4. First, the object detection framework is evaluated and compared with state-of-art methods on DOTA dataset. Second, fusion results of PAN image and MS image are showed and the performance of our fusion object detection framework is evaluated.

5.1 Performance evaluation of object detection

First, we evaluated our object detection approach on DOTA dataset and compared with state-of-art methods YOLO-v2 and YOLO-v3 framework. Because YOLO framework is a typical one-stage object detection method with well performance. The evaluation images are clipped to 800 × 800 pixels. The evaluation indicators are Precision, Recall, F1-Measure, Average Precision (AP), and mean average precision (mAP):
  • TP: True Positives

  • FP: False Positives

  • FN: False Negatives

  • TN: True Negatives.

$$\begin{array}{*{20}c} {{\text{Precision}} = \frac{\text{TP}}{{{\text{TP}} + {\text{FP}}}}} \\ \end{array}$$
(15)
$$\begin{array}{*{20}c} {{\text{Recall}} = \frac{\text{TP}}{{{\text{TP}} + {\text{FN}}}}} \\ \end{array}$$
(16)
$$\begin{array}{*{20}{l}} {{\hbox{F1-Measure}} = 2 \times \frac{{{\text{Precision}} \times {\text{Recall}}}}{{{\text{Precision}} + {\text{Recall}}}}} \\ \end{array} .$$
(17)
The area enclosed by P–R curve is the value of AP and mAP is the mean value of AP. Tables 4, 5, 6, and 7 are Precision, Recall, F1-Measure, AP and mAP of YOLO-v2, YOLO-v3 framework, and our method in horizontal and rotational orientations. And the test speed of our experiment is provided in Table 8, which is assessed on an NVIDIA GeForce GTX 1080Ti GPU.
Table 4

Precision of YOLO-v2, YOLO-v3, and our method

Classes

Precision

YOLO-v2

YOLO-v3

Horizon of our method

Rotation of our method

Tennis court

0.2454

0.2686

0.9447

0.9446

Harbour

0.1293

0.0845

0.7694

0.6977

Bridge

0.0803

0.0122

0.5876

0.5140

Plane

0.3453

0.1752

0.8857

0.9101

Ship

0.2253

0.2054

0.7928

0.6353

Ground-track field

0.2184

0.0105

0.6447

0.6527

Large vehicle

0.2003

0.1391

0.7871

0.6765

Helicopter

0.0283

0.0049

0.5000

0.4891

Basketball court

0.0977

0.0164

0.6193

0.6444

Roundabout

0.1117

0.0433

0.6498

0.6752

Small vehicle

0.2601

0.1001

0.5795

0.5390

Storage tank

0.2631

0.2137

0.8213

0.8293

Soccer-ball field

0.1588

0.0172

0.6000

0.6063

Swimming pool

0.0062

0.0201

0.5849

0.5654

Baseball diamond

0.2694

0.0179

0.6964

0.7143

Bold datas are the best among these methods

Table 5

Recall of YOLO-v2, YOLO-v3, and our method

Classes

Recall

YOLO-v2

YOLO-v3

Horizon of our method

Rotation of our method

Tennis court

0.7422

0.7427

0.7603

0.8669

Harbour

0.5761

0.4508

0.6221

0.6166

Bridge

0.3545

0.1377

0.3716

0.3284

Plane

0.6585

0.4256

0.7577

0.8249

Ship

0.6707

0.7492

0.5240

0.5247

Ground-track field

0.2387

0.1080

0.5653

0.5854

Large vehicle

0.4116

0.6943

0.4237

0.5052

Helicopter

0.1888

0.0153

0.4745

0.4592

Basketball court

0.4278

0.1813

0.3824

0.4363

Roundabout

0.6571

0.1918

0.4940

0.5084

Small vehicle

0.3509

0.8190

0.4669

0.4719

Storage tank

0.5073

0.2969

0.4593

0.4639

Soccer-ball field

0.2482

0.1106

0.4054

0.4275

Swimming pool

0.0091

0.0751

0.5659

0.5527

Baseball diamond

0.4566

0.1030

0.6626

0.6768

Bold datas are the best among these methods

Table 6

F1-Measure of YOLO-v2, YOLO-v3, and our method

Classes

F1-M

YOLO-v2

YOLO-v3

Horizon of our method

Rotation of our method

Tennis court

0.3688

0.3945

0.8425

0.9041

Harbour

0.2112

0.1424

0.6879

0.6547

Bridge

0.1309

0.0223

0.4553

0.4008

Plane

0.4531

0.2482

0.8167

0.8654

Ship

0.3373

0.3224

0.6310

0.5747

Ground-track field

0.2281

0.0192

0.6024

0.6172

Large vehicle

0.2694

0.2317

0.5509

0.5784

Helicopter

0.0492

0.0074

0.4869

0.4737

Basketball court

0.1591

0.0300

0.4729

0.5203

Roundabout

0.1909

0.0706

0.5613

0.5800

Small vehicle

0.2987

0.1783

0.5171

0.5032

Storage tank

0.3465

0.2485

0.5891

0.5949

Soccer-ball field

0.1937

0.0297

0.4839

0.5014

Swimming pool

0.0074

0.0317

0.5753

0.5590

Baseball diamond

0.3388

0.0305

0.6791

0.6950

Bold datas are the best among these methods

Table 7

AP and mAP of YOLO-v2, YOLO-v3, and our method

Classes

AP

YOLO-v2

YOLO-v3

Horizon of our method

Rotation of our method

Tennis court

0.5498

0.6996

0.7591

0.8590

Harbour

0.3259

0.2297

0.5779

0.5310

Bridge

0.1840

0.1019

0.2874

0.2345

Plane

0.5841

0.3443

0.7502

0.8151

Ship

0.4905

0.5619

0.4918

0.4252

Ground-track field

0.2004

0.0694

0.4870

0.5051

Large vehicle

0.2241

0.4181

0.3941

0.3985

Helicopter

0.1591

0.0182

0.4208

0.3757

Basketball court

0.2817

0.1483

0.3416

0.3830

Roundabout

0.4416

0.0492

0.4446

0.4586

Small vehicle

0.2235

0.3535

0.3643

0.3466

Storage tank

0.4113

0.2002

0.4453

0.4511

Soccer-ball field

0.2150

0.0937

0.3534

0.3751

Swimming pool

0.0007

0.0027

0.4551

0.4416

Baseball diamond

0.3250

0.0081

0.5943

0.6199

mAP

0.3078

0.2199

0.4778

0.4813

Bold datas are the best among these methods

Table 8

Speed of YOLO-v2, YOLO-v3, and our method

 

YOLO-v2

YOLO-v3

Our method

Speed (fps)

25.8771

5.9878

17.5131

From the tables above, we can easily draw a conclusion that our object framework performs better on satellite imagery than YOLO-v2 and YOLO-v3. It achieved competitive results of Precision, Recall, F1-Measure, and AP. The mAP of our method is 17.4% higher than that of YOLO-v2, and 26.1% higher than that of YOLO-v3. The model which we used performed extremely well on helicopters, ports, and swimming pools.

The proposed approach can detect arbitrary-oriented small objects on DOTA dataset demonstrated (Fig. 3). Even if the image resolution is as high as 3000 × 4000, small objects with only 15 pixels can also be detected, which validated the efficiency of the object detection method.
Fig. 3

Some detection results of the proposed framework on DOTA dataset

5.2 Pansharpening results

Three datasets mentioned in 4.2 are utilized to show the results of image fusion methods mentioned in Sect. 3.2. The results of Kaggle Dstl Satellite Imagery Feature Detection Competition Data are shown in Fig. 4. Figure 4a is PAN image, Fig. 4b is MS image with 16 bands.
Fig. 4

Fusion results of Kaggle Dstl satellite imagery feature detection competition data

From the results of our image fusion experiment, it is worth underlining that CS methods have a higher spectral distortion, but the final products have a better visual appearance, while MRA methods have a higher spatial distortion, but the spectral consistency is better.

As a whole, the fusion images contain more local features and contents and resolution of them has been significantly improved.

5.3 Evaluation of fusion object detection

To evaluate the fusion object detection framework, we utilize PAN image, MS image, and the images fused by the methods have mentioned as the object detection framework’s inputs, and statistics results on three datasets. We use the number of detected objects to evaluate the performance of the experiment showed in Table 9.
Table 9

Number of detected objects

Method

Number of detected objects

PAN

13

MS

5

BDSD

85

GS

21

PCA

37

PRACS

145

HPF

85

MTF

114

MTF_GLP_HPM

111

MTF_GLP_HPM_PP

98

MTF_GLP_CBD

108

Indusion

120

SFIM

82

We utilize 15 images from the three datasets. It is obvious that the number of objects detected by fused images is much more than that of PAN and MS images. And some detected objects in MS images with low resolution and single band PAN images are not true objects as Fig. 5.
Fig. 5

Detection performance on MS, PAN, and MTF_GLP_HPM_PP images

From a statistical point of view, none object is detected in MS image, three objects are detected in PAN image and ten objects are detected in fused image by MTF_GLP_HPM_PP method. The performance of object detection task on satellite imagery has improved obviously through proposed fusion object detection method.

6 Conclusion

This paper proposes a fusion object detection method of satellite imagery with arbitrary-oriented region convolutional neural network. The proposed method provides a simple and effective fusion method that not only truthfully represents the scene with pansharpening method but also reduces the processing time. Moreover, arbitrary-oriented region convolutional neural network is introduced to learn the local feature patterns and utilized to locate object of interest detection. Experimental results showed that the proposed object detection method performs better than the other state-of-art methods, e.g., YOLO-v2, YOLO-v3. The experiments on real-world dataset demonstrated that the proposed method has a significant improvement over object detection method.

Footnotes

Notes

Acknowledgements

This work is jointly supported by National Natural Science Foundation of China (Grant Nos. 61673262, 61603249), and key project of Science and Technology Commission of Shanghai Municipality (Grant No. 16JC1401100).

References

  1. 1.
    Krizhevsky A, Sutskever I, Hinton G (2012) ImageNet classification with deep convolutional neural networks. In: NIPS. Curran Associates IncGoogle Scholar
  2. 2.
    Russakovsky O, Deng J, Su H et al (2015) ImageNet large scale visual recognition challenge. Int J Comput Vis 115(3):211–252MathSciNetCrossRefGoogle Scholar
  3. 3.
    Everingham M, Winn J (2006) The PASCAL visual object classes challenge 2007 (VOC2007) development kit. Int J Comput Vis 111(1):98–136CrossRefGoogle Scholar
  4. 4.
    Hinton GE, Salakhutdinov RR (2006) Reducing the dimensionality of data with neural networks. Science 313(5786):504–507MathSciNetzbMATHCrossRefGoogle Scholar
  5. 5.
    LeCun Y, Boser B, Denker JS, Henderson D, Howard RE, Hubbard W, Jackel LD (1989) Backpropagation applied to handwritten zip code recognition. Neural Comput 1(4):541–551CrossRefGoogle Scholar
  6. 6.
    Krizhevsky A, Sutskever I, Hinton GE (2012) ImageNet classification with deep convolutional neural networks. In: Proceedings of the 25th international conference on neural information processing systems. MIT Press, Lake Tahoe, NV, pp 1097–1105Google Scholar
  7. 7.
    Vishwakarma S, Agrawal A (2013) A survey on activity recognition and behavior understanding in video surveillance. Visual Comput 29(10):983–1009CrossRefGoogle Scholar
  8. 8.
    Zhao ZQ, Zheng P, Xu ST et al (2018) Object detection with deep learning: a review. arXiv:1807.05511
  9. 9.
    LeCun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521(7553):436–444CrossRefGoogle Scholar
  10. 10.
    Ren S, He K, Girshick R et al (2015) Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell 39(6):1137–1149CrossRefGoogle Scholar
  11. 11.
    Liu W, Anguelov D, Erhan D et al (2016) SSD: single shot multibox detector. In: Leibe B., Matas J., Sebe N., Welling M. (eds) Computer vision – ECCV 2016. ECCV 2016. Lecture Notes in Computer Science, vol 9905. Springer, ChamCrossRefGoogle Scholar
  12. 12.
    Redmon J, Divvala S, Girshick R et al (2015) You only look once: unified, real-time object detection. arXiv:1506.02640
  13. 13.
    Redmon, J, Farhadi A (2017) [IEEE 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)—Honolulu, HI (2017.7.21-2017.7.26)] 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)—YOLO9000: better, faster, stronger, pp 6517–6525Google Scholar
  14. 14.
    Redmon J, Farhadi A (2018) YOLOv3: an incremental improvement. arXiv:1804.02767
  15. 15.
    Van Etten A (2018) You only look twice: rapid multi-scale object detection in satellite imagery. arXiv:1805.09512
  16. 16.
    Jiang Y, Zhu X, Wang X et al (2017) R2CNN: rotational region CNN for orientation robust scene text detection. arXiv:1706.09579
  17. 17.
    Girshick R, Donahue J, Darrell T et al (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In: IEEE conference on computer vision & pattern recognitionGoogle Scholar
  18. 18.
    Girshick R (2015) Fast R-CNN. Computer ScienceGoogle Scholar
  19. 19.
    Ma J, Shao W, Ye H, Wang L, Wang H, Zheng Y, Xue X (2017) Arbitrary-oriented scene text detection via rotation proposals. arXiv preprint arXiv:1703.01086
  20. 20.
    Vivone G, Alparone L, Chanussot J et al (2014) A critical comparison among pansharpening algorithms. IEEE Trans Geosci Remote Sens 53(5):2565–2586CrossRefGoogle Scholar
  21. 21.
    Chaudhuri S, Kotwal K (2013) Hyperspectral image fusion. Springer, BerlinCrossRefGoogle Scholar
  22. 22.
    Middleton EM, Ungar SG, Mandl DJ, Ong L, Frye SW, Campbell PE, Landis DR, Young JP, Pollack NH (2013) The earth observing one (eo-1) satellite mission: over a decade in space. IEEE J Sel Top Appl Earth Observ Remote Sens 6(2):243–256CrossRefGoogle Scholar
  23. 23.
    Jing Z, Pan H, Xiao G (2015) Application to environmental surveillance: dynamic image estimation fusion and optimal remote sensing with fuzzy integral. Springer, Cham, pp 159–189.  https://doi.org/10.1007/978-3-319-12892-4_7 CrossRefGoogle Scholar
  24. 24.
    Zhongliang J, Han P, Yuankai L, Peng D (2018) Non-cooperative target tracking, fusion and control: algorithms and advances. Springer, BerlinGoogle Scholar
  25. 25.
    Pan H, Jing Z, Qiao L, Li M (2018) Visible and infrared image fusion using l0-generalized total variation model. Sci China Inf Sci 61(4):049103MathSciNetCrossRefGoogle Scholar
  26. 26.
    Shen HF, Meng XC, Zhang LP (2016) An integrated framework for the spatio-temporal-spectral fusion of remote sensing images. IEEE Trans Geosci Remote Sens 54(12):7135–7148CrossRefGoogle Scholar
  27. 27.
    Zhang LP, Shen HF (2016) Progress and future of remote sensing data fusion. J Remote Sens 20(5):1050–1061Google Scholar
  28. 28.
    Aiazzi B, Alparone L, Baronti S et al (2012) Twenty-five years of pansharpening: a critical review and new developments. In: Chen CH (ed) Signal and image processing for remote sensing, 2nd edn. CRC Press, Boca Raton, FL, pp 533–548Google Scholar
  29. 29.
    Chavez PS Jr, Sides SC, Anderson JA (1991) Comparison of three different methods to merge multiresolution and multispectral data: landsat TM and SPOT panchromatic. Photogramm Eng Remote Sens 57(3):295–303Google Scholar
  30. 30.
    Laben CA, Brower BV (2000) Process for enhancing the spatial resolution of multispectral imagery using pansharpening: United States, 6011875[P]. 04 Jan 2000Google Scholar
  31. 31.
    Carper W, Lillesand T, Kiefer R (1990) The use of intensity-hue-saturation transformations for merging SPOT panchromatic and multispectral image data. Photogramm Eng Remote Sens 56(4):459–467Google Scholar
  32. 32.
    Meng XC, Li J, Shen HF et al (2016) Pansharpening with a guided filter based on three-layer decomposition. Sensors 16(7):1068CrossRefGoogle Scholar
  33. 33.
    Ranchin T, Wald L (2000) Fusion of high spatial and spectral resolution images: the ARSIS concept and its implementation. Photogramm Eng Remote Sens 66(1):49–61Google Scholar
  34. 34.
    Aiazzi B, Alparone L, Baronti S, Garzelli A (2006) MTF-tailored multiscale fusion of high-resolution MS and Pan imagery. Photogramm Eng Remote Sens 72(5):591–596CrossRefGoogle Scholar
  35. 35.
    Li WJ, Wen WP, Wang QH (2015) A study of remote sensing image fusion method based on Contourlet transform. Remote Sens Land Resources 27(2):44–50.  https://doi.org/10.6046/gtzyyg.2015.02.07 MathSciNetCrossRefGoogle Scholar
  36. 36.
    Tu TM, Su SC, Shyu HC et al (2001) A new look at IHS-like image fusion methods. Inf Fusion 2(3):177–186CrossRefGoogle Scholar
  37. 37.
    Joyce X. Deep learning for object detection: a comprehensive review [EB/OL]. https://towardsdatascience.com/deep-learning-for-object-detection-a-comprehensive-review-73930816d8d9. 12 Sept 2017/28 May 2019
  38. 38.
    Girshick R, Donahue J, Darrell T, Malik J (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In: CVPR 2014Google Scholar
  39. 39.
    Thomas C, Ranchin T, Wald L, Chanussot J (2008) Synthesis of multispectral images to high spatial resolution: a critical review of fusion methods based on remote sensing physics. IEEE Trans Geosci Remote Sens 46(5):1301–1312CrossRefGoogle Scholar
  40. 40.
    Tu T-M, Huang PS, Hung C-L, Chang C-P (2004) A fast intensity-hue-saturation fusion technique with spectral adjustment for IKONOS imagery. IEEE Geosci Remote Sens Lett 1(4):309–312CrossRefGoogle Scholar
  41. 41.
    Gillespie R, Kahle AB, Walker RE (1987) Color enhancement of highly correlated images—II. Channel ratio and “Chromaticity” transform techniques. Remote Sens Environ 22(3):343–365CrossRefGoogle Scholar
  42. 42.
    Chavez PS Jr, Kwarteng AW (1989) Extracting spectral contrast in Landsat thematic mapper image data using selective principal component analysis. Photogramm Eng Remote Sens 55(3):339–348Google Scholar
  43. 43.
    Aiazzi B, Baronti S, Selva M (2007) Improving component substitution pansharpening through multivariate regression of MS + Pan data. IEEE Trans Geosci Remote Sens 45(10):3230–3239CrossRefGoogle Scholar
  44. 44.
    Garzelli A, Nencini F, Capobianco L (2008) Optimal MMSE pan sharpening of very high resolution multispectral images. IEEE Trans Geosci Remote Sens 46(1):228–236CrossRefGoogle Scholar
  45. 45.
    Choi J, Yu K, Kim Y (2011) A new adaptive component-substitution based satellite image fusion by using partial replacement. IEEE Trans Geosci Remote Sens 49(1):295–309CrossRefGoogle Scholar
  46. 46.
    Dou W, Chen Y, Li X, Sui D (2007) A general framework for component substitution image fusion: an implementation using fast image fusion method. Comput Geosci 33(2):219–228CrossRefGoogle Scholar
  47. 47.
    Schowengerdt RA (1997) Remote sensing: models and methods for imageprocessing, 2nd edn. Academic, Orlando, FLGoogle Scholar
  48. 48.
    Liu JG (2000) Smoothing filter based intensity modulation: a spectral preserve image fusion technique for improving spatial details. Int J Remote Sens 21(18):3461–3472CrossRefGoogle Scholar
  49. 49.
    Wald L, Ranchin T (2002) Comment: Liu ‘Smoothing filter-based intensitymodulation: a spectral preserve image fusion technique for improving spatial details’. Int J Remote Sens 23(3):593–597CrossRefGoogle Scholar
  50. 50.
    Khan MM, Chanussot J, Condat L, Montavert A (2008) Indusion: fusion of multispectral and panchromatic images using the induction scaling technique. IEEE Geosci Remote Sens Lett 5(1):98–102CrossRefGoogle Scholar
  51. 51.
    Aiazzi B, Alparone L, Baronti S, Garzelli A, Selva M (2003) An MTF-based spectral distortion minimizing model for pan-sharpening of very high resolution multispectral images of urban areas. In: Proceedings on 2nd GRSS/ISPRS joint workshop on remote sensing and data fusion over urban areas, pp 90–94Google Scholar
  52. 52.
    Xia GS, Bai X, Ding J et al (2017) DOTA: a large-scale dataset for object detection in aerial images. arXiv:1711.10398
  53. 53.
    Alparone L et al (2007) Comparison of pansharpening algorithms: outcome of the 2006 GRS-S data fusion contest. IEEE Trans Geosci Remote Sens 45(10):3012–3021CrossRefGoogle Scholar
  54. 54.
    Dstl Satellite Imagery Feature Detection[EB/OL]. https://www.kaggle.com/c/dstl-satellite-imagery-feature-detection. 08 March 2017/28 May 2019
  55. 55.

Copyright information

© Shanghai Jiao Tong University 2019

Authors and Affiliations

  • Ying Ya
    • 1
  • Han Pan
    • 1
    Email author
  • Zhongliang Jing
    • 1
  • Xuanguang Ren
    • 1
  • Lingfeng Qiao
    • 1
  1. 1.School of Aeronautics and AstronauticsShanghai Jiao Tong UniversityShanghaiChina

Personalised recommendations