Multimodal medical image fusion with convolution sparse representation and mutual information correlation in NSST domain

Multimodal medical image is an effective method to solve a series of clinical problems, such as clinical diagnosis and postoperative treatment. In this study, a medical image fusion method based on convolutional sparse representation (CSR) and mutual information correlation is proposed. In this method, the source image is decomposed into one high-frequency and one low-frequency sub-band by non-subsampled shearlet transform. For the high-frequency sub-band, CSR is used for high-frequency coefficient fusion. For the low-frequency sub-band, different fusion strategies are used for different regions by mutual information correlation analysis. Analysis of two kinds of medical image fusion problems, namely, CT–MRI and MRI–SPECT, reveals that the performance of this method is robust in terms of five common objective metrics. Compared with the other six advanced medical image fusion methods, the experimental results show that the proposed method achieves better results in subjective vision and objective evaluation metrics.


Introduction
In recent years, medical imaging has become an indispensable means in clinical diagnosis, surgery, and radiotherapy. However, single-modality medical images only focus on a certain type of morphological features. For example, computed tomography (CT) images reflect structural information about bone but are insensitive to soft tissues with a similar density. Although magnetic resonance imaging (MRI) has a strong ability to display soft tissues, it is poor in showing bone lesions and calcified lesions. Therefore, image fusion can obtain complementary information from different modalities of medical images and help clinicians perform postoperative detection and tumor and bone growth monitoring [1,2]. Among many medical image fusion methods, a kind of method based on multiscale transform has attracted the B Guoqi Xie 15002441@qq.com 1 School of Computer and Communication, Hunan Institute of Engineering, Xiangtan 411104, People's Republic of China 2 College of Computer Science and Electronic Engineering, Hunan University, Changsha 410082, People's Republic of China attention of researchers, because it adopts a similar multiresolution processing mechanism to the human visual system. These methods include pyramid-based decomposition [3], wavelet-based decomposition [4] and multiscale geometric analysis decomposition [5]. These methods exist two common problems: first, it is difficult to determine the decomposition level; Second, the fusion strategy is difficult to choose.
Decomposition level is a key problem to be solved. When the decomposition level is low, it cannot extract enough spatial details from the source map, whereas when the decomposition level is high, the fusion of high-frequency sub-band is more sensitive to noise and registration [6][7][8]. The general solution is simply to decompose an image into one high-frequency sub-band and one low-frequency subband. The high-frequency sub-bands contain more details and edge information, whereas the low-frequency sub-band contains the contour and structure information of images. In recent years, multiscale decomposition methods based on NSCT and NSST are popular because of their multiscale, multidirectional, and shift invariant. In particular, NSST has attracted more attention because of its superior computational efficiency to NSCT. Compared with pyramidbased methods, such as Gaussian pyramid decomposition, Laplacian pyramid decomposition [9], and gradient pyramid transformation [10], the method based on NSST can be decomposed from multiple directions, thus obtaining more image details. Compared with wavelet methods, such as discrete wave and dual tree complex wavelet, the method based on NSST can represent the curve and edge details of image well. Compared with multiscale geometric transformations, such as contourlet transform (COT) [11] and shearlet transform (ST) [12], the method based on NSST does not produce pseudo Gibbs phenomenon due to frequency aliasing. However, most of the existing NSST decomposition methods have a higher decomposition levels, which not only increases the amount of calculation, but also makes high-frequency sub-bands susceptible to noise. To preserve the structural information of the image as much as possible and to extract additional salient details, a new multiscale decomposition method is proposed in this study. Unaffected by the scale parameters of the general multiscale decomposition, this method uses NSST to decompose the image only in two scales, namely, one high-frequency sub-band and one low-frequency sub-band. In addition to using convolution sparse representation to enhance the detailed information of high-frequency sub-band, correlation analysis should be used to extract the detailed information of low-frequency sub-band due to the rich detailed information contained in low-frequency sub-band. Sparse representation seeks to represent image features with as few sparse vectors as possible, which is widely used in image reconstruction and denoising. The improvement of convolution sparse representation is that the sparse coefficients of local image blocks are replaced by global sparse coefficients.
The fusion strategy is important for the quality of fused image. In multiscale decomposition, a common strategy is to measure the activity of the decomposition coefficients first and then fuse them in accordance with the mean or maximum value. For example, in [13,14], high-and the lowfrequency sub-bands adopt the maximum scheme for fusion. However, low-frequency sub-bands provides structure information similar to the source image, whereas high-frequency sub-bands contain important details, thus, the same fusion scheme cannot consider the similarity and importance of the image simultaneously. In [15], a weighted average fusion strategy is adopted for similar regions of images, in this strategy, weight is calculated using the Siamese network. However, the definition of similar regions by this method directly affects the effect of the final image fusion. Recently, principal component analysis (PCA) [16], sparse representation [17,18], smallest univalue segment assimilating nucleus (SUSAN) [19], and pulse coupled neural network (PCNN) [20,21] have been used to enhance the salient information of fused images and measure the activity of decomposition coefficients. However, these methods have their own problems either in the selection of sparse dictionaries or in the training time. To obtain better fusion effect, different fusion strategies are selected according to different sub-bands in this study, namely, maximum fusion is used for the high-frequency sub-band and details of the low-frequency sub-band, and weighted average fusion is used for the similar structural information in the low-frequency sub-band.
This study focuses on the determination of decomposition scale and the fusion strategy of different frequency bands in NSST decomposition. To avoid the influence of high noise and registration on the fusion of high-frequency sub-band when the NSST decomposition scale is too high, this study only carries out one-level decomposition of NSST, that is, one high-frequency sub-band and one low-frequency subband. How to use mutual information correlation analysis to mine the detailed information in the low-frequency subband is one of the research objectives of this study. It has been explained above that it is inappropriate for all sub-bands to adopt the same fusion strategy. Another objective of this study is to study which fusion strategy should be adopted for high-frequency sub-band, and similar and dissimilar regions of low-frequency sub-band.
The main innovation points of this study include the following three aspects: 1. The convolutional sparse representation (CSR) model is used to process the high-frequency sub-band, which increases the detailed features and reduces the block effect caused by NSST decomposition, as well as the redundant information of different source graphs. 2. Mutual information correlation is used to extract detail information of low-frequency sub-bands. Given that only two-scale decomposition is conducted, the lowfrequency sub-band contains abundant details. The mutual information correlation analysis can find the regions containing detailed information from the lowfrequency sub-band. 3. Two different fusion strategies are used for lowfrequency sub-band. The structural information of similarity is fused using the weighted average scheme, where the weight takes the product of the correlation analysis coefficient and the regional energy sum. The Laplacian energy gradient was used to measure the activity of the dissimilar regions to reflect the contrast changes of the regions.
The remaining sections of this paper are organized as follows: the next section describes related work about NSST and CSR. The following section explains the methods in detail. In the next section, a comparative experiment is simulated, and the corresponding results are analyzed. The last section summarizes the study.

Related work
NSST Non-subsampled contourlet and non-subsampled shearlet waves are two popular multiscale geometric decomposition methods, because they are multiscale, multidirectional, and shift invariant. Given that the non-subsampled shearlet wave does not limit its direction and does not need to reverse the directional filter bank, its computational efficiency is higher than that of NSCT. NSST consists of two processes: non-subsampled pyramid scale decomposition (NSPFs) and shift-invariant shearlet filter banks (SFBS). Figure 1 shows the framework of two-level NSST decomposition. The input image is decomposed into a highfrequency sub-band and a low-frequency sub-band after the first-level scale decomposition by NSPF, and then the low-frequency sub-band is decomposed into the secondlevel high-frequency sub-band and low-frequency sub-band. Therefore, the input image decomposed by L-level NSST will be transformed into L high-frequency sub-bands and one low-frequency sub-band. At each scale, multiple directions of sub-bands can be obtained by SFBs. Moreover, given that the traditional subsampling decomposition may bring frequency overlap, the pseudo-Gibbs phenomenon easily occurs. Thus, non-subsampled decomposition is adopted in NSST [22].

Convolutional sparse representation
The idea of SR comes from the learning process of image structures by the receptor field of simple cells in the visual cortex V1 area. Given its simple representation, SR has been widely used in image denoising [23], feature extraction [24], and super-resolution [25,26]. Yang [27] and Yin [28] Here, y ∈ R n is the stacked vector representation of the image patch √ n × √ n. D ∈ R n×m is an overcomplete dictionary, and x ∈ R m is the sparse coefficient to be solved. However, the disadvantage of this model is that the sparse coefficient is obtained through the calculation of overlapping patches, thus, the global sparse coefficient of the whole image cannot be obtained. To improve fusion performance, Wohlberg [29] proposed the convolution form of SR. Liu [30] integrated morphological component analysis (MCA) and CSR into a unified optimization framework, which could realize multicomponent and global SR of source images simultaneously. CSR is given by the following equation: Here, Y is the whole image, which is modeled as the sum of the convolution map between M local dictionary filters and global coefficients. the global single value and shift invariance features of CSR are conducive to extracting more detailed information and enhancing its robustness.
Another difficulty in SR is the learning of overcomplete dictionaries. In SR, the more fully the dictionary represents all the details of the image, the more likely the result of the reconstruction can restore the source image. The design of dictionaries usually adopts two methods: one is based on known transformation basis, such as discrete cosine transforms and wavelet basis. However, as data and application range change, the performance of such fixed dictionaries degrades considerably. The second is a learningbased approach. K-SVD and its improved dictionary learning method are widely used in medical image fusion [31,32]. The adaptive K-SVD dictionary is constantly updated through iterative training, and it is updated alternately with sparse coding. The disadvantage is that the dictionary training time is long. In multimodal medical image fusion, the structure of medical images from different sensing devices is more complex, and data are more redundant. Therefore, dictionary learning based on joint block clustering is a better choice. By clustering similar patches of all source images, a complete and compact dictionary can be formed.

Proposed framework
As shown in Fig. 2, based on the general framework of image fusion in multiscale transformation domain, the proposed method mainly consists of three stages: multiscale decomposition, low-frequency and high-frequency sub-band fusion, and NSST reconstruction. For simplicity, two source diagrams are used for illustration. First, NSST is applied to source image I a and I b . After first-level decomposition, a high-frequency sub-band with significant details and a low-frequency sub-band with structure information can be obtained. Second, a CSR-based maximum fusion strategy is used for the fusion of high-frequency coefficients. For the low-frequency sub-band, mutual information is used for the correlation analysis, and then local energy coefficients are used for the weighted summation of similar parts, whereas domain energy gradient and maximum fusion are used for dissimilar parts. Finally, the image is reconstructed by NSST inverse solution.

High-frequency sub-band fusion
Since the high-frequency sub-band contains the details of the image, the fusion of the high-frequency sub-band is mainly to fuse the salient features of the high-frequency sub-band. The advantage of convolution sparse representation is that it can describe these features with fewer sparse coefficients.
The sparse coefficients of the high-frequency sub-band of each source image are obtained by Eq. (2). Set X k m indicates the sparse coefficient of the high-frequency sub-band of the k image, X k m, 1:N (x, y) indicates the content of the location (x, y); it is a N dimensional vector. Here, the norm L 1 of X k m, 1:N (x, y) is used to measure the activity level of the source image. Thus, the sparse coefficient fusion rule of high-frequency sub-band is defined as The fused sparse coefficients are reconstructed by Eq. (4). The fused high-frequency sub-band is defined as

Low-frequency sub-band fusion
Given the low decomposition scale, the low-frequency subband still has abundant important information. To extract such information further, normalized mutual information (NMI) is used for correlation analysis. For the area with high correlation, the fusion strategy of local energy weighted summation is adopted to preserve energy as much as possible. For the area with low correlation, the neighborhood energy gradient (NEG) is adopted to highlight the contrast edge information of the source map as much as possible. The basis of pixel-level image fusion is that the input images are linear and complementary. By correlation analysis of low-frequency sub-band, the salient features of the source maps can be preserved. Mutual information is often used in multimodal image registration, which is a statistical correlation method based on gray value. The greater the mutual information between two images, the higher the correlation between the two images. The mutual information quantity of an image can be calculated by Kullback-Leibler Divergence, and the mathematical form is as follows: H (y) − y P(y) log 2 P(y).
P(x) and P(y) represent the probability distribution of random variables X and Y , P(x, y) represents the joint distribution, H (x, y) represents the joint entropy of X and Y , and the joint entropy reflects the correlation of random variables X and Y . Here, X represents the image I a and Y represents the image I b . The joint entropy of the two can be calculated using the joint histogram.
The mutual information result is easily affected by the clustering result with numerous clusters, thus, the NMI maps the mutual information to the interval range of [0, 1] and is defined as After a 3 × 3 sliding window partitioning, the correlation of the local region of the low-frequency sub-band can be obtained. Here, the mutual information T of the whole image is regarded as the threshold of correlation. When the local region NMI is greater than the threshold T , the image blocks are correlated. Therefore, weighted average scheme is used. In medical image fusion, the intensities of different source images at the same location may vary magnificently, because the source images are captured with different imaging mechanisms. Therefore, the weight matrix cannot be evaluated with a simple average. Here, we use the center pixel energy to calculate the matrix, the center pixel energy is an adaptive weight calculate method based on region energy, and the mathematical form is defined by the following equation: Here, N is the radius of the local region (2N + 1, 2N + 1), L m is the low-frequency coefficients of m image, and W L indicates the weight of the local pixel. Given that the lowfrequency sub-band image is relatively smooth, the weight can be directly represented in 2 2N −d , where d is the distance from the field pixel to the center point, if N 1, then normalized W L is defined as For the low-frequency sub-band with high correlation, the coefficient blocks are fused by the strategy of the weighted sum of energy of the center pixel and are defined as For the low-frequency sub-band with low correlation, that is, when the NMI of the local region is less than the threshold T , the fusion of the coefficient blocks adopts the strategy of maximum energy gradient. NEG is essentially the Sum of Laplace energy, which is a parameter characterizing image edge features. NEG reflects the contrast change of the neighbor window and the edge information of the image block and is defined as Here, Ω indicates a neighbor window, after the activity is measured by NEG, the maximum value is taken for the fusion of low-frequency coefficients, and the equation is as follows: The detailed description of the algorithm is shown in Algorithm 1.

Comparison algorithms
Nine medical image fusion methods, which have been proposed in recent years, are compared with the proposed method. These methods are based on SR or multiscale transformation and include LP-ASR [33], SR-NSCT [34], parameter-adaptive and pulse-coupled neural network (PA-PCNN) [13], PC-LLE [14], IOI-LLF [35], CNN-CP [15], CoF-MLE-NSST [36], PSO-NSST [37] and PCN-N-NSST [38]. The LP-ASR method is based on Laplacian Pyramid decomposition and adaptive SR, and the sparse coefficient fusion scheme is used to reduce the noise of the high-frequency components. The SR-NSC method incorporates NSCT into the SR fusion framework, and different fusion strategies are used for low-and high-frequency coefficients. The PA-PCNN method first performs NSST decomposition on the source images, and then a PA-PCNN model is used in the high-frequency sub-band fusion. After NSCT decomposition in the PC-LLE method, the highfrequency sub-bands are fused by the phase consistency rule. The image of Interest-Laplacian filter (IOI-LLF) method uses local LLF to decompose the source image into residual and Ground images and further decomposes the residual image based on IOI. The CNN-CP method uses a trained Siamese convolution network to fuse the pixel activity information of the source image and generate a weight map. The first two methods are based on SR, whereas the last four methods are all multiscale decomposition methods.

Objective evaluation metrics
To evaluate the performance of the various methods, five widely recognized object metrics, namely, entropy (EN) [39], structural similarity (Q e ) [40], mutual information (MI) [41], gradient (Q AB / F ) [42] and the human eye visual perception (VIF) [43], are used in the experiment. EN can reflect the amount of information contained in the fused image; Q e represents the degree of similarity between the fused and source images; MI is a mutual information indicator used to measure the information contained in the fused image; Q AB / F is a quality metric based on gradient, which is mainly used to measure the edge information of fused images; VIF is the information ratio between the fused image and the source image and is used to evaluate the human visualization performance of the fused image.

Experimental settings
In this experiment, 10 groups of CT-MRI images and 10 groups of MRI-SPECT images are used in fusion performance tests. As shown in Fig. 3, the first row (a) shows two sets of CT-MRI images. The second row (b) shows two sets of MRI-SPECT images. these images are from the Whole Brain Atlas provided by Harvard Medical School; each image has a resolution size of 256 × 256. All experiments are programmed by MATLAB 2014a, and the simulation experiment environment is Intel(R) Core(TM) I7-8565U CPU @ 1.80 GHz and 8.00 GB RAM.  To evaluate the performance of each fusion method objectively, Tables 1 and 2, respectively, show the average scores of the CT-MRI and MRI-SPCET fusion results. The higher the index value, the better the fusion performance, where the highest score is indicated in bold, and the lowest score is indicated by subscript. In addition, the performance of the proposed method is compared with that of several recent NSST methods. Among them, COF-MLE-NSST method uses co-occurrence filter to measure the activity of lowfrequency sub-band coefficient, PSO-NSST method uses particle swarm optimization algorithm to optimize the membership function of low-frequency sub-band fuzzy logic system, and PCNN-NSST method uses PCNN to fuse highfrequency sub-band. Compared with the other nine methods, the proposed method ranks first in Q e , MI, and Q AB / F for CT-MRI and MRI-SPECT images, indicating that it preserves most of the structure information in the source images and keeps the edge of the source image and structure well. At the same time, because the method based on transformation domain is accompanied by the loss of a certain amount of information, the ranking of EN and VIF is not the highest, but the ranking is still relatively high, indicating that the proposed method has good robustness. The proposed method is inferior to the PA-PCNN method in VIF, because the latter adopts a neuron perceptron similar to that of humans.

Experimental results
To compared the computational costs of different fusion methods, the total time of 10 groups of CT-MRI fusion images is first calculated and then divided by 10 to obtain the average running time. The calculation was repeated 10 times, and the results and standard deviations are shown in Table 3. The proposed method is inferior to the LP-ASR and PC-LLE methods but superior to the other four methods. Particularly, the performance of the proposed method is similar to that of PA-PCNN, but the computational efficiency is higher, because the iteration process of PCNN is time consuming. Here, the IOI-LLF method has the lowest calculation efficiency, because IOI takes too much time.
The proposed algorithm's performance was also evaluated by changing the value of parameters used in the proposed method, such as NSST decomposition level, and the directions number. These values are obtained over 20 pairs of multi-modality medical images, and the average outcomes are shown in Table 4. From Table 4, it can be analyzed that as the decomposition level and directions are increasing, the values of En and MI are also increased. The values of Q e , Q AB / F and VIF are optimal when Level 3. In general, with the increase of level, the value of each metrics increases slightly.

Conclusions
In this study, we propose a multimodal medical image fusion method based on NSST and mutual information correlation analysis. Based on NSST scale decomposition, this method uses CSR to enhance the high-frequency detail information and uses mutual information correlation to mine the detail information of low-frequency sub-band. Then, different fusion strategies are adopted for different areas of   images of the LP-ASR,  SR-NSCT, IOI-LLF, PC-LLE,  CNN-CP, PA-PCNN,  CoF-MLE-NSST, PSO-NSST,  PCNN-NSST and the proposed  algorithms low-frequency sub-band according to correlation. To achieve this goal, two new activity level measurement methods based on the domain energy gradient and central pixel energy sum are designed. By comparing with other advanced methods and numerous experiments, the effectiveness of the proposed method is proven. However, the method still has the following limitations: first, the setting of the threshold of the low-frequency sub-band correlation analysis has a certain influence on the final fusion effect. If the threshold is set too small, then the extraction of detail information is insufficient; if the threshold is set too large, then meaningless details in the MRI image are introduced into the fused image, causing artifacts. In this study, the mutual information of the whole source image is used as the threshold value for the correlation analysis of low-frequency sub-bands; this strategy is not an optimal scheme. In addition, Table 3 shows that this method is not as fast as some fusion methods, because the local mutual information correlation is calculated by sliding window, resulting in low calculation efficiency. In the future, we will devote ourselves to the research of a more effective threshold determination scheme by combining the prior information of source images.