1 Introduction

1.1 Background

The concept of multi-modal systems is related to the study of information representation in the field of human–computer interaction. The term “mode” refers to the representation and exchange of information on a specific physical medium. Due to the development of medical technology and science, medical image fusion has gained wide attention in the field of image processing. Medical imaging methods can be broadly divided into two types: anatomical imaging and functional imaging. Single-mode medical images can only provide a particular aspect of health information and cannot fully reflect all the information in certain parts of the body [1]. In clinical practice, doctors often need to comprehensively analyze different types of medical images of the same part to diagnose the patient's condition; this increases the difficulty of diagnosis. Therefore, fusion processing of multi-modal medical images can be used to comprehensively analyze different medical image information in a fusion image and provide doctors with a more adequate basis for the judgment of clinical diagnosis and treatment [2].

Deep learning provides scientific methods for processing large-scale medical images and screening big data, as well as for the diagnosis and efficacy evaluation of various major diseases in clinical medicine. This major scientific problem in medical image analysis needs to be solved urgently, considering it is a key cutting-edge medical image technology [1]. Multi-modal medical image fusion based on deep learning can be used to effectively extract and integrate the feature information of different modes, improve the clinical applicability of medical images in the diagnosis and evaluation of medical problems, and provide quantitative analysis, real-time monitoring, and treatment planning for doctors and researchers. Multi-modal fusion combines the information of multiple modes for target prediction (classification or regression), which was previously understood as multi-source information fusion [3]. For example, videos as a type of multimedia can be subdivided into multiple single modes, such as dynamic text, dynamic images, and dynamic voice [4]. Research shows that information processing methods based on the multi-modal concept often perform better than traditional single-model methods [5].

Multi-modal medical image fusion is the process of fusing multiple images using single or multiple imaging methods to improve the image quality while preserving specific features [6]. Medical image fusion involves several interesting fields, such as image processing, computer vision, pattern recognition, machine learning, and artificial intelligence, and hence has been widely used in clinical practice. Doctors can understand lesions in different ways through the application of medical image fusion.

1.2 Motivation for this Paper

Multi-modal deep learning is the fusion of various types of information via deep-learning techniques. Imaging technology plays an important role in medical diagnosis. The information provided by a single-mode medical image is limited since large amounts of information need to be processed in clinical diagnosis. In multi-modal technology, a single mode of the medical image can supplement the weakness of another mode to accurately evaluate the medical condition and obtain diagnostic information through the fusion of information from multiple modes. Furthermore, multi-modal deep learning can jointly learn the potentially shared information of each mode data through the complementary fusion of different feature sets, which improves the effectiveness and accuracy of data tasks [7]. Additionally, multi-modal medical image fusion based on deep learning can effectively extract and integrate the feature information of different modes, thereby improving the clinical applicability of medical images in the diagnosis and evaluation of medical problems. Considering the above advantages, the application of multi-modal deep learning in medicine (MMDLM) has rapidly attracted wide attention.

This study was conducted to address the following gaps in the existing literature:

  1. 1.

    a lack of effective literature on the multi-modal data fusion mechanisms of medical images to sort and summarize research points.

  2. 2.

    the limited knowledge of data analysis algorithms as well as a lack of clinical experts and data scientists.

  3. 3.

    insufficient information on the bibliometric analysis of the application of multi-modal deep learning in medical imaging.

Based on the above-mentioned points, we conducted a comprehensive review and bibliometric analysis of this field to explore potential models or scientific development paths of multi-modal deep learning in medical image applications.

The contributions of our study are as follows:

  1. 1.

    We macroscopically analyze the scope of medical multi-modal applications and mainstream pre-training methods, and discuss the adaptability of each method.

  2. 2.

    We discuss various topics in medical multi-modal fusion methods, ranging from micro-scale methods (such as algorithm-based convolutional neural networks, deep iterations as well as fully supervised, weakly supervised, and unsupervised learning) to process-based statistics (such as methods in intelligent diagnosis, efficacy assessment, and prognostic applications) and detailed comparative studies based on organ, such as diseases of the brain, eye, breast, lung, bone, and skin.

  3. 3.

    Based on the above-mentioned in-depth analysis, we summarize the challenges faced by MMDLM applications and present directions for future scholars.

  4. 4.

    We analyze literature metrology, cited journals, and literature as well as keywords to arouse the interest of researchers in multi-modal deep-learning applications in the medical field. Furthermore, periodical distributions are discussed to effectively help scholars search for related research topics. Co-author, co-occurrence, co-citation, and literature coupling analyses are also performed.

The scope of the discussion in this review is shown in Fig. 1.

Fig. 1
figure 1

Scope of discussion in this review

1.3 Structure of this Paper

The remainder of the paper is organized as follows: Sect. 1 describes the development of the multi-modal concept and some of the related problems encountered thus far. Section 2 reviews the progress of multi-modal deep learning in medical applications. Section 3 describes the multi-modal fusion pre-training algorithms, medical multi-modal fusion methods, their performances, and a comparative study of their key features. Section 4 describes the use of the VOSviewer software as well as the co-authorship, co-citation, co-occurrence, and literature coupling analyses that were performed in this study. Finally, the important conclusions thus derived are discussed in Sect. 5.

2 Literature Review

2.1 Multi-modal Medical Applications

A modality is a particular mode wherein something exists, is experienced, or is expressed. When a research problem comprises several modes, it is characterized as a multi-modal research problem. Simultaneously, modes can also be defined in a very broad manner. For example, data regarding two different languages, or datasets collected under two different circumstances, can be regarded as two modes [4]. To better understand the world around us, we should be able to interpret multiple signals simultaneously. For example, images are often associated with labels and text explanations, and text often contains images for the clear expression of the central idea of an article. Considering that different modes have different statistical properties, MMDLM can be used for processing and understanding multi-source modal information using deep-learning techniques.

At present, multi-modal learning for image, video, audio, and semantics is being deeply investigated. Particularly, research is being conducted on deep neural networks to learn multi-layer representations and for the abstraction of data before it is converted into high-level abstract features of the network. Image analysis has made important research progress in various medical fields such as classification, segmentation, detection, and localization [4]. Deep convolutional networks have been actively used in medical image analysis in areas such as segmentation, anomaly detection, disease classification, computer-assisted diagnosis, and retrieval. For a long time, medical imaging has been a diagnostic method in clinical applications. Recent advances in hardware design, security programs, computing resources, and data storage capabilities have significantly benefited the field of medical imaging [2]. Currently, the main application areas of medical image analysis include the segmentation, classification, and anomaly detection of images generated using a wide range of clinical imaging modes. For example, in 2021, Qian’s team from the University of Southern California proposed a deep-learning system based on multi-modal and multi-angle medical ultrasound images, and successfully verified the accuracy, robustness, and effectiveness of the system in the prospective clinical environment of several hospitals. The results showed that the interpretable artificial intelligence (AI) assisted diagnosis system can significantly optimize the diagnosis results of human doctors, improve the clinical applicability of its auxiliary diagnosis, and provide new ideas for subsequent clinical translational research [8]. With the rapid development of medical information technology and modernization of medical equipment, an enormous amount and variety of medical data has emerged. Medical data can be broadly classified into three main categories according to the specific information and forms they present:

  1. 1.

    Clinical text data, which mainly includes structured test data such as hemoglobin and urine routine, as well as unstructured text data such as patient complaints and pathology texts recorded by doctors.

  2. 2.

    Image and waveform data, including imaging data such as ultrasound images, CT images, MRI images, and signal data such as ECG and EEG.

  3. 3.

    Biomics data, which can be subdivided into genomic, transcriptomic, proteomic, and other categories according to different molecular levels.

Each type of patient-related data is a data modality, and the different modalities of medical data provide information about the patient's diagnosis and treatment from a specific perspective. They contain overlapping and complementary information, further improving the accuracy of diagnosis and treatment by combining multiple types of medical information.

2.2 Multi-modal Pretraining

Multi-modal recognition is a method that extracts the complementarity between different modes, such as assisting physicians in diagnosis, the core of which lies in the fusion of medical images and texts (electronic medical records, laboratory reports, etc.). Multi-modal matching focuses on how to align two modal features, images, and texts. Table 1 shows the main studies, key application areas, and methods of common multi-modal pre-training for comparative analysis. To complete the medical multi-modal fusion method and performance comparison research, we conducted a detailed comparison and survey of the literature. There are currently two main approaches to multi-modal tasks: light fusion and heavy fusion. The light fusion approach is usually effortless, such as the vector inner product, as represented by CLIP [9]and ALIGN [10], which use a two-tower structure focusing on multi-modal alignment to facilitate text matching, retrieval, and other downstream tasks. The heavy fusion approach is based on pre-trained Transformers [11], as represented by OSCAR [12], UNITER [13], VINVL [14], etc. These methods can be regarded as a single-tower structure that focuses on incorporating multi-modal information with an attention mechanism to perform additional tasks. Heavy fusion can interpret VQA [15], captions, and other downstream tasks that require information fusion and understanding, which the ALIGN algorithm [16] cannot perform. However, this approach is not as efficient as CLIP [9] in retrieval. Ultimately, the algorithm depends on the task. At present, there is a trend of unifying the two methods: the two-tower model, which is used as the base, and the single-tower model, which is incorporated in the upper layer. Alignment is performed before fusion. Multi-modal fusion refers to incorporating the information of multiple modes for classification or regression tasks. The benefits of multi-modal fusion are as follows: (1) more robust inference results can be generated for different modal representations of the same phenomenon, (2) auxiliary information that is not visible in a single-mode can be retrieved from multiple scales, and (3) for a multi-modal system, modal fusion can operate normally, even when a certain mode disappears. The modal and optimization methods used by the neural network for modal fusion may be different, however, the concept of information fusion through collaborative hidden layers is the same. Neural networks are also used for sequential multi-modal fusion and usually use RNNs (Recurrent Neural Networks) and LSTM (Long Short-Term Memory). Typical applications are audio-visual emotion classification, electronic medical records, etc. Some advantages of a deep neural network for modal fusion are as follows. It can (1) learn from large amounts of data, (2) perform end-to-end learning of multi-modal feature representation and fusion, and (3) performs better than non-deep-learning methods and can learn a complex decision boundary [4]. Table 1 shows that the multi-modal pre-trained model and its variants update and iterate very quickly, covering keywords including contrast learning, text and text matching, feature space alignment, understanding and generation, Chinese image generation, and transfer learning. These key technologies can be widely used to assist medical clinical applications, such as wearable device-based multi-source data health monitoring, human–machine automatic health assessment, and machine-based surgical safety assessment. To develop lightweight models and products that can adapt to complex medical data from multiple sources and assist clinical applications as soon as possible, research institutions (e.g., universities) and leading companies (e.g., Google) are conducting relevant research. Recently, multi-modal pre-trained models based on the Transformer structure have become very popular. Pretraining can be carried out using a large amount of unlabeled data, and then fine-tuning can be done with a small amount of labeled data. For example, as the simplest single-mode model, ViLT [17] is fast in visual and text processing, which provides a good foundation for embedded devices such as clinical consultations and surgeries. The approach used for multi-modal fusion depends on the task and data, and existing work often proposes various fusion methods without any real unified theoretical support. To efficiently and automatically select the fusion strategy according to the task or data, a neural architecture search (NAS) [18] is highly effective. Of course, there are challenges in multi-modal fusion; for example, the sequential information of different modes may not be aligned.

Table 1 Mainstream multi-modal pre-training models

2.3 Medical Imaging and Non-imaging Models

Table 2 shows the research objects and tasks of medical image and non-image modalities. El-Sappagh et al.’s (2020) research has achieved excellent clinical results. They proposed a deep-learning fusion model based on Bidirectional Long Short-term Memory (BiLSTM) networks and Convolutional Neural Networks (CNNs). The multi-modal multitask model, based on five modalities [i.e., Magnetic Resonance Imaging (MRI), PET Positron Emission Computed Tomography, neuropsychological data, cognitive score data, and evaluation data], jointly predicts variables such as Alzheimer’s disease (AD) multistage progression tasks and four key cognitive scores [19]. Table 2 also shows that The Cancer Genome Atlas (TCGA), The Alzheimer ‘s Disease Neuroimaging Initiative (ADNI), and other open-source datasets remain the first choice of most researchers. However, we can also observe that open-source data for medical multi-modal applications remain comparatively uncommon, which may be influenced by medical ethics. MMDLM is an important research direction for researchers in cancer prevention and therapy, which is related to the difficulty of cancer prevention and treatment [7, 20,21,22,23,24,25].

Table 2 Number of objects and tasks studied in medical imaging and non-imaging models

2.4 Deep Multi-modal Fusion Methods and Performance

Multi-modal fusion is a key research point in multi-modal research. It integrates information extracted from different modes into a stable multi-modal representation. Multi-modal fusion is related to representations, and a process that focuses on using some architecture to merge representations of different single modes is classified as fusion. Fusion methods can be divided into late and early fusion according to their different locations. Since early and late fusion will inhibit inter-model interactions, current research focuses on intermediate fusion methods, which enable these fusion operations to be placed in multiple layers of the deep-learning model. There are three methods for the fusion of text and image: simple operation-based, attention-based, and tensor-based approaches. Figure 2 shows the schematic structure of the three multi-modal fusion methods.

Fig. 2
figure 2

Schematic diagram of the structure of the three multi-modal fusion methods

Fusion is based on integrating feature vectors from different modes in simple ways, such as vector splicing, vector weighted sum, and so on. For multi-modal tasks, the first approach that comes to mind should be based on simple operations [19,20,21, 26,27,28,29,30], and the fusion of this approach has achieved the desired results in medical multi-modal applications. Holste et al. studied 10,185 breast enhancement MRI (DCE-MRI) data from 5248 women. They extracted clinical indications and breast density information from mammograms using a multi-modal model to reduce these data to 2D maximum intensity projections and then linked them to 18 non-image features, and they achieved an Area Under Curve (AUC) of 0.849 [20]. This shows that simple operation-based methods with excellent datasets can also produce promising results. However, this method still has a problem; there is not enough interaction between the two modes, and the coupling relationship is insufficient.

The attention-based fusion approach, which is based on an attention mechanism, has been widely used in multi-modal medical applications. To enable the model to pay attention to text, such as text in medical images and medical records, the mechanism gives different parts of the image feature vector different weights according to the characteristics of the image and text features. This enables the model to extract effective features from multi-source data and leverage the advantages of the multi-modal fusion [13, 25, 31, 32]. For multi-modal tasks, Silva et al. proposed an end-to-end multi-modal deep-learning generalized prognostic prediction model that predicted survival rates for all 33 cancers studied in a TCGA program. The model also used more input data modes than those of previous studies, including histopathological sectional images of clinical information and different genomic data [32]. The multi-modal fusion method-based tensors, also known as the bilinear pooling fusion method, are predominantly used to fuse visual feature vectors and text feature vectors to attain a joint representation space [33]. Through the stepwise decomposition of the weight tensor, an efficient multi-modal fusion model can be achieved. In recent years, the most important multi-modal fusion methods have been attention-based and bilinear pooling methods using attention-based fusion and tensor-based fusion [13, 34]. Clinical decision-making in oncology involves multi-modal data such as radiological scans, molecular analysis, histological sections, and clinical factors. Braman et al. used a deep orthogonal fusion (DOF) model to predict the overall survival of patients with glioma from different multi-modal data. The model learning will come from the multi-parameter information on MRI, the biopsy of the mode (such as images of DNA sequencing, and/or H&E (Hematoxylin–eosin Staining) slides), and clinical variable information combined into an integrated multi-modal risk model, at the same time introducing a multi-modal organization loss, through a complementary embedded model to improve performance. When the DOF model predicted the overall survival of glioma patients, the median C-index was 0.788 ± 0.067, which is significantly better than the best-performing single-peak model (median C-index: 0.718 ± 0.064; P = 0.023) [34]. Faisal Mahmood's team used multi-modal deep learning to integrate and analyze whole-section images and genetic profiling data from 14 cancers. The algorithm predicts good and poor prognostic outcomes in different forms, predicts patient prognosis based on morphological and molecular levels at the disease and patient levels, combines spatial distribution of tumor, stromal and immune cells in the tumor microenvironment to synthesize and consider an open database has been established for further exploration, as well as for biomarker discovery and characterization [35]. Figure 3 displays a performance comparison of several researchers who conducted multi-modal deep learning using three different fusion modes.

Fig. 3
figure 3

Comparison of multi-modal fusion performance (AUC/C-index/Accuracy)

Through an overall comparative analysis, the three-stage deep feature learning and fusion diagnosis framework proposed by Zhou et al. [46] is considered a good approach among the sample of studies we collected. The framework is mainly designed to identify Alzheimer's disease (AD) and its precursor states, and it progressively integrates multi-modal imaging and genetic data at each stage, effectively alleviating the problem of multi-modal data heterogeneity. The framework also partially solves the problem of incomplete multi-modal data by designing a staged deep-learning strategy. Overall the framework achieves an all-round balance between the data and the fusion process. Based on our analysis, we believe that a balanced approach to data and fusion process is crucial for the success of multi-modal deep learning in medicine. Therefore, we recommend that future studies focus on gradually integrating or extracting the features of different modalities in a specific order during the fusion process, while also prioritizing data quality and progressively making it more complete.

In multi-modal deep learning, the process of collecting effective features from different modes is called “multi-modal fusion”. In this process, the several modes are not simply and separately given as input to the model. The fusion of the different data modes can occur at different stages of the process. For example, the simplest early fusion technique involves concatenating the input modes or features before the processing stage; however, this technique cannot be applied to complex data modes. A more sophisticated approach is intermediate fusion, wherein the representations of the different modes are combined and co-learned during training. It allows for modal-specific preprocessing while capturing interactions between data modes to achieve joint representations. Late fusion is also a simple method wherein a separate model is trained for each mode and combined with the output probability for joint representation. However, such a fusion method misses the opportunity to extract information from the interaction between the modes. Over the past few years, deep learning has transitioned from mode-specific architectures—such as the CNN for graphics or RNN for text—to the Transformer. This novel architecture performs well across various input and output modes and tasks. One promising aspect of the Transformer is its ability to learn meaningful representations from unlabeled data, as the resources required to obtain high-quality markup in the medical field are limited and expensive. At the same time, because of the restrictions associated with privacy protection, medical ethics, and other relevant rules, it is highly difficult to obtain medical data. One possible solution is to use the available data from one mode to aid learning using another via a multi-modal learning task called “co-learning.” For example, some studies have suggested that a Transformer that is pre-trained on untagged language data may generalize well to other tasks. In medicine, a model architecture called “CycleGans”, which was trained using unpaired uncontrasted or contrast computed tomography (CT) scan images, is used to generate uncontrasted or contrast CT scan images [36].

3 Discussion

In the previous section, we reviewed the current models used for researching disease prognosis and diagnosis using deep learning-based approaches that merge images and non-images. Through a literature analysis, it was found that as of 2021, Transformers have assisted in the rapid emergence of multitask and multi-mode AI. OpenAI built the first model, called DALL·E [38], with a Transformer architecture that can process both image and text data simultaneously, known as the image version of GPT-3[49]. The CLIP model, which can pair text and images, was likewise introduced [9]. Facebook released a series of improved Transformer models, one of which is UniT [41], which can simultaneously handle two modes of data and seven tasks, such as natural language processing, natural language understanding, image recognition, and object detection. While most of these multi-tasking, multi-modal AI systems are in the research and experimental stage, some have already achieved good results in practical applications. For instance, electronic health records (EHR) have a complex multi-modal structure. Xu et al. used Neural Architecture Search (NAS) and Multi-modal Fusion Architecture Search (MUFASA) to select both single-mode and cross-mode network architectures. This approach outperformed the single-mode NAS on publicly accessible EHR datasets [37]. Feature-level fusion methods are further divided into operation-, tensor-, attention-, subspace-, and graph-based methods [17]. While the operation-based approach is intuitive and effective, it may lead to poor performance when learning complex interactions between different modes. The tensor-based approach has the risk of overfitting. The attention mechanism-based method is a quite effective method for multi-modal feature fusion and can calculate the inter-modal and intra-modal importance features; hence, it is widely used in multi-modal fusion applications. Furthermore, with the help of self-attention, the steps of modal fusion do not need to be designed carefully. One can simply splice the multi-modal information into a sequence and use a Transformer encoder to learn their binary relations and merge the information within and between modes simultaneously or run multiple modes in parallel. Then, the Transformer decoder can be used to perform the cross-mode fusion. The Transformer can easily handle input types with graphs, which are a more universal input structure. Both images and sequences can be converted into graphs, and hence, the Transformer is a more universal network structure. For example, Stanford University has created a set of open-source Transformer models called ConVIRT [37], which can automatically annotate X-rays with text. The self-attention mechanism does not require the user to explicitly designate the prior adjacency matrix, but simply input sufficient data (if available) and let the model learn the edge weights on its own. Compared with CNNs and RNNs, the Transformer has a larger number of parameters, stronger expression ability, and requires more training data. Accordingly, to effectively assist physicians in diagnosis and therapy, scientists should conduct comparative studies on multi-modal learning in the medical field and develop and share more benchmark datasets. Simultaneously, although multi-modal learning is advantageous for model performance optimization, research on modal selection should focus on calculating the model capacity, data quality, and specific tasks.

The successful advancement of multi-modal deep learning in medicine requires a large amount of data, and challenges are encountered when applying data, models, and performing complex tasks in this field. We list the main challenges below.

  1. 1.

    The diversity and uncertainty in medical datasets, including image and non-image data, sample size, depth of phenotypic analysis, heterogeneity and diversity of participants, degree of data standardization and harmonization, and degree of correlation among data sources together constitute a greater challenge than that posed by a single-mode deep-learning model. The challenge of handling the highly variable data found in real-world clinical databases must therefore be considered to ensure effective application.

  2. 2.

    In the development of multi-modal medical research and clinical applications, the collection, linking, and cost-effective annotation of multi-dimensional medical data also leads to challenges in terms of cost and speed.

  3. 3.

    In multi-modal medical fusion, it is necessary to properly associate all the data types in the dataset and extract effective feature sets. However, small and incomplete datasets as well as non-standardized data structures, which are prevalent in this field, pose significant challenges.

  4. 4.

    For some modes (e.g., 3D imaging and genomics), processing even a single point in time or individual instance of data requires a large amount of computing power. Hence, building models to simultaneously and rapidly process large-scale tumor pathological slides, genomics, or medical text data is an important fundamental challenge.

  5. 5.

    When collecting health and clinical data for research, privacy concerns can be raised by patients and doctors. Establishing a trusted mechanism to monitor and mitigate these issues is critical, and it requires researchers to propose and explore more solutions.

  6. 6.

    Multi-modal medical data fusion analysis is a multidisciplinary field that requires repeated interactions among clinicians, statistical analysis engineers, algorithm engineers, bioinformatics engineers, and professionals from other disciplines to determine research schemes. This interaction is hindered by the significant challenges of collaboration.

To meaningfully process and integrate the information in different medical data and increase the participation of AI in the assisted diagnosis and treatment process, joint efforts between the medical community and AI researchers will be required to construct and validate new models and ultimately demonstrate their ability to improve diagnosis and treatment.

4 Bibliometrics

Bibliometrics is an effective quantitative method for examining research activities in a specific field. To describe, evaluate, and monitor MMDLM-related research results, we conducted a comprehensive analysis based on annual trends, countries, institutions, journals, highly cited papers, co-citations, coauthors, co-occurrence analyses, and literature coupling [17].

4.1 Literature Resources

To investigate this topic, we selected literature that adheres to the principles of representativeness and universality. The Web of Science (WoS) Core Collection database was employed as the data source for the retrieval of the literature related to multi-modal fusion and deep learning based on medical images. For the query, we used TS = (Multi-modal deep learning (OR (Multi-modal deep learning)) AND medicine). A total of 920 records were retrieved on this subject. Simultaneously, we used the refining function of the WoS to extract the key information of the remaining publications, eliminate irrelevant or weakly related publications, and select a final total of 879 publications as the basic target data of this study. The literature was collected from January 2010 to April 2022. All the selected publications met the following criteria: (1) the research content (in part/in whole) presented a multi-modal deep-learning fusion approach for medical applications and (2) the publication focused on the improvement of the deep-learning algorithm and its application in one or more detailed medical fields.

4.2 Analytical Methods

Considering the large number of publications identified in this study, it would have been challenging to manually extract the information individually and further explore the relationship between them. Therefore, it was necessary to sort them using a bibliometric analysis, which aims to study the distribution structure, quantitative relationships, and variation in the literature using measurement methods such as mathematics and statistics [50]. Common bibliometric mapping software tools include HistCite [51], CiteSpace [52], and VOSviewer [53, 54]. By comparing the features of these software tools, we found that VOSviewer is easier to use than the other software. By constructing the relationship and visually analyzing of network data (mainly literature knowledge units), the software can render a scientific knowledge graph and show the structure, evolution, cooperation, and other relationships in the knowledge domain. Its outstanding feature is its strong graphical display ability, which is suitable for large-scale data. We used VOSviewer to explore the collaborative publications and networks of multi-modal fusion in medical image research, as well as the collaboration and distribution among countries, research institutions and authors on the subject. We set the node types to “country”, “agency”, and “author” in VOSviewer software. At the same time, considering the close relationship between countries and institutions, we added nodes for countries and institutions in the same graph. By setting the node type to “category” and selecting the timeline view of the software, we visualized the evolution of the objects studied in this domain. When the node type is set to “Reference”, co-citation analysis can be carried out to analyze representative research articles. In addition, when the node type is set to “Terms”, cluster analysis is conducted according to the nodes in the co-occurrence graph. Ultimately, cluster analysis and term co-occurrence are used to distinguish the frontiers in research and hotspots in the different development stages of this research field.

4.3 Literature Analysis

4.3.1 Trends in the Number of Published Papers

We investigated 879 papers that were published between 2010 and 2022 (Fig. 4). The growth in the number of publications can be divided into three stages, which we call preparation, rise, and prosperity.

Fig. 4
figure 4

Changes in the number of MMDLM publications from 2010 to 2022

4.3.2 Annual Trends and Possible Explanations

By analyzing the collaboration networks between different countries and institutions, we can clearly see the achievements of countries and research institutions in the field of multi-modal fusion skills, and further identify the collaboration between them by analyzing relevant publications. Figure 4 shows the annual change in the number of multi-modal deep-learning medical publications.

With the widespread application of artificial intelligence in medicine, the number of MMDLM publications continues to expand. It saw particularly strong growth in 2017. There are two possible reasons for this rise:

  1. 1.

    Many researchers have begun to use deep learning to solve medical problems.

  2. 2.

    The pioneering research achievements globally have piqued the interest and increased the confidence of researchers in this approach.

4.3.3 Countries

The analysis of literature data shows that scholars from 35 countries/regions have published publications about MMDLM, but over 85% of the publications were contributed by scholars from five active countries, as shown in Fig. 5. The United States was the largest contributor to MMDLM publications, with its scholars contributing 298 publications or 33.9%. The contribution of Chinese scholars to publications in this field ranks second in the world, accounting for 32.65%. After the United States and China, Germany, the United Kingdom, and Canada ranked third, fourth, and fifth with 10.92%, 9.88%, and 6.37%, respectively.

Fig. 5
figure 5

Author distribution by country

4.3.4 Institutions

After calculating the outstanding publication contributors at the national level, we further analyzed the outstanding publication contributors at the institutional level. According to the collected data, the League of European Research Universities (LERU) is a prominent institution that has contributed 79 publications, accounting for 8.957% of the global publication volume on this topic. As an important contributing country, China ranked second and third in terms of the intellectual output of the University of Chinese Academy of Sciences and Shanghai Jiao Tong University, with 36 and 33 publications, respectively. Table 3 shows the details of the top-16 most productive organizations that actively generated MMDLM-related publications. Of the top-16 institutions, the main countries/regions with which these institutions are affiliated are the United States (6), China (4), the United Kingdom (2), LERU (1), Germany (1), and France (1).

Table 3 Author distribution by institution

4.3.5 Highly Cited Papers

To identify the most influential development ideas and scientific thinking in MMDLM research, we selected 12 highly cited papers (HCPs) and 1 “hot” paper from the WoS and ranked them according to their total citation frequency. Table 4 lists authors, journal names, regions, titles, and citations for these HCPs. The MMDLM-related HCP proposed by Arbabsiar [55], which has been cited 381 times, can be regarded as a pioneering work in the single-discipline prediction of brain dysfunction based on neuroimaging. Emerging trends such as multi-modal brain imaging, survival prediction, disease subtype classification, and multi-modal attention mechanisms are also discussed. Emerging imaging digitalization technologies and data-intensive computational methods such as reinforcement learning meet the needs of post-operative brain tumor prediction and evaluation applications. The paper, published in NeuroImage (a top journal), has been cited more than any other paper in the analysis. HCPs have proposed different types of multi-modal deep-learning methods for solving medical processes, such as different fusion models for physical medical image segmentation [6], brain tumor segmentation [24], multi-organ detection [56], anatomical education [57], multiple diagnoses of Alzheimer's disease [58], the prognosis of rectal cancer [59], breast mass detection [60], accurate diagnosis [61], and the treatment of tumors [62]. These theories and techniques are considered an indispensable part of MMDLM research.

Table 4 Highly cited papers

4.3.6 Research Landscape

MMDLM research is not restricted to the “Medical” and “Computer Science” domains but covers 93 WoS categories. This manifests in the widespread application of MMDLM theories and approaches in various domains. “Radiology Nuclear Medicine Medical Imaging”, “Engineering Biomedical”, “Imaging Science Photographic Technology”, and “Artificial Intelligence” are the biggest categories, containing approximately 50% of the related documents. Figure 6 shows the primary WoS categories that belong to MMDLM-related documents. In addition to the “Computer Science” and “Medical” categories, MMDLM-related documents were also seen in the “Neurosciences”, “Mathematical Computational Biology”, “Multidisciplinary Sciences”, “Optics”, “Telecommunications”, “Clinical Neurology”, “Neuroimaging”, “Oncology”, “Instruments Instrumentation”, and “Biochemical Research Methods”, categories. This is a profound demonstration of the wide application of MMDLM.

Fig. 6
figure 6

MMDLM research categories

4.3.7 Keyword Co-occurrence Analysis for MMDLM

Keyword co-occurrences can effectively reveal research hotspots in this field. To explore the research hotspots of MMDLM, VOSviewer literature analysis software was used to perform bibliometric analysis on the keyword co-occurrences of the 879 analyzed studies. We obtained a total of 186 keywords from MMDLM-related publications. VOSviewer visualization software was used to simulate the MMDLM keyword co-occurrence network (Fig. 5). Each node in the visual platform keyword representing density has a special color, which is closely related to the link density on the node, and the color of the node depends on the degree of closeness of the node neighbor relationship. The red component in the keywords indicates its frequency is high. In contrast, keywords that appear less frequently have an amber color. Density visualization is very effective for understanding the overall structure and focusing on the most important components. Table 5 shows the first 13 keywords of MMDLM-related publications. According to Fig. 7, we can intuitively identify the following research hotspots of MMDLM:

  1. 1.

    Deep learning plays a core role in medical multi-modal applications, which should be related to computer science and technology disciplines in most of the literature.

  2. 2.

    At present, researchers mainly focus on the classification and segmentation of physiological data, which is a traditional deep-learning application.

  3. 3.

    Tumor and brain science are important areas of concern for researchers.

Table 5 Top-10 keywords of MMDLM-related publications
Fig. 7
figure 7

Keyword co-occurrence network of MMDLM-related publications

The node and phrase font sizes in Fig. 7 represent weights. As can be seen from the network diagram, the larger the node (keyword) font, the larger the corresponding weight is. At the same time, the Euclidean distance of two nodes indicates the strength of the coupling. A direct link between two keywords indicates that they occur simultaneously. The more densely connected the lines are, the more frequently they occur. The document analysis software VOSviewer splits the keywords of MMDLM-related publications into eight clusters according to the coupling relationship, and each cluster is represented by the same color. The keywords “deep learning” and “classification” appeared most frequently. Keywords with high frequency include “segmentation” (86), “MRI” (63), and “magnetic resonance imaging” (34). To represent the working frequency of two keywords, the link strength between two nodes represents the co-occurrence frequency strength, which shows the quantitative parameters of the coupling relationship between two nodes. Calculating the total link strength of a node is the sum of the link strength of that node and the link strength of all other associated nodes. By observing this intensity, it is possible to intuitively calculate the degree of closeness of correlated studies, which may lead researchers to pay attention to the subcategories of the topic research and determine future directions.

4.3.8 Co-citation Analysis for MMDLM-Related Publications

When two (or more) articles are simultaneously cited by one or more later articles, the two articles are said to have a co-citation relationship. In co-citation analysis, more representative literature is frequently selected as the analysis objects, and hence a network analysis method was adopted to perform cluster analysis on this literature. The knowledge graph of a research area can be intuitively calculated. At the same time, co-citation analysis is widely used to disclose the coupling of authors, literature, and journals in the research field. In this section, we introduce the co-citation of authors, literature, and journals in the medical applications of multi-modal deep learning. Figure 8 shows the journal collaboration network for MMDLM-related publications. By analyzing and summarizing the co-citation relationships between authors and publications in the research field, the author's citation network can be displayed, which can reveal the author's research interests. The VOSviewer software was used to draw the author atlas of MMDLM scholars, as shown in Fig. 9. Unsurprisingly, the node coupling degree indicates that the nodes of Km, Kamnitsas, and Bengio are the largest. It also shows that they actively participate in this field.

Fig. 8
figure 8

Journal co-authorship network of MMDLM-related publications

Fig. 9
figure 9

Co-authorship network of MMDLM-related publications

4.3.9 Bibliographic Coupling Analysis

In citation network, the bibliographic coupling of two articles refers to the number of the same articles in the references of both articles, that is, the number of other articles cited by both articles at the same time. We can define the undirected document coupling network corresponding to the directed citation network as follows: if two articles have at least one identical reference, there will be an edge between the corresponding two nodes. Literature coupling reflects the relationship between two cited references, while co-citation reflects the relationship between two cited references. Publication coupling analysis can show which domain the publications are more concerned with, and the weight correlation in coupling analysis depends on the number of reference publications they share. Consistent with the above coauthor analysis and co-citation analysis, we focus on discussing and showing the publication coupling network relationships of 879 articles by author (see Fig. 10), journal (see Fig. 11), institute (see Fig. 12), and country (see Fig. 13).

Fig. 10
figure 10

Author bibliographic coupling network of MMDLM-related publications

Fig. 11
figure 11

Journal bibliographic coupling network of MMDLM-related publications

Fig. 12
figure 12

Institute bibliographic coupling network of MMDLM-related publications

Fig. 13
figure 13

Country bibliographic coupling network of MMDLM-related publications

5 Challenges and Perspectives

The successful advancement of multi-modal deep learning in medicine requires a large amount of data, and challenges are encountered when applying data, models, and performing complex tasks in this field. We list the main challenges below.

  1. 1.

    Lack of standardized data collection and annotation protocols, which can lead to bias and limit the generalizability of models.

  2. 2.

    Interpretability of multi-modal deep-learning models is still a challenge, which may hinder their application in clinical practice.

  3. 3.

    Heterogeneity of data from different modalities, which may require different preprocessing and integration techniques to be effectively combined.

  4. 4.

    Availability and accessibility of multi-modal data, as it may require collaboration between different institutions and data sharing agreements to collect and integrate data from multiple sources.

  5. 5.

    Model overfitting, which can lead to poor generalization performance on new data and be particularly problematic in medical applications.

  6. 6.

    Ethical implications of using deep-learning models in medical decision-making, such as potential biases and errors in the models, as well as the impact on patient privacy and autonomy.

A few visions

  1. 1.

    Multi-modal deep learning holds great promise for improving medical diagnosis and treatment, with the ability to generate more accurate and personalized predictions.

  2. 2.

    Integration of multi-modal data can improve the accuracy of diagnosis and prognosis of various diseases, such as cancer and Alzheimer's disease, and help clinicians predict disease risk and personalize treatment plans.

  3. 3.

    Multi-modal deep learning can provide more accurate and timely diagnoses and treatment recommendations, aiding clinical decision-making.

  4. 4.

    Approaches being explored include developing more standardized protocols for data collection and annotation, developing more interpretable and transparent models, and facilitating the sharing and integration of multi-modal data.

In summary, the application of multi-modal deep learning in medical research and clinical practice holds great promise for improving healthcare outcomes and personalized treatment. However, it is important to address the challenges associated with these models, including issues such as data heterogeneity, interpretability, and ethical considerations. By working together to address these challenges, we can unlock the full potential of multi-modal deep learning in healthcare, and improve the diagnosis, treatment, and outcomes for patients.

6 Conclusion

In this report, we comprehensively discussed the performance of medical multi-modal deep learning from the aspects of pre-trained networks, fusion approaches, and models in clinical and application studies, and we compared their key characteristics. In addition, we set out six significant challenges for multi-modal healthcare convergence. In the future, to obtain a clearer understanding of this new research trend, we will measure and evaluate the influential researchers, institutions, and research directions using bibliometric methods. From the results obtained in this study, we can conclude that the research can be approximately divided into three periods: preparation, from 2010 to 2016, rise, from 2017 to 2019, and prosperity, from 2020 to the present. Aiming at the current situation of insufficient polymorphic data fusion in deep learning in medical applications and to address the needs of human–machine collaborative medical cross-modal auxiliary diagnosis and treatment applications, we reviewed the use of deep learning to perform multi-modal medical data and medical knowledge fusion analysis research and establish a medical heterogeneous multi-dimensional data retrieval and matching mechanism. To this end, network methods were reviewed to provide improved multi-layer semantic feature matching of medical heterologous data, breakthrough multiple modal heterogeneous data fusion mechanisms, and perform multi-modal depth learning to provide a research and application basis to assist the doctors in their auxiliary diagnosis and treatment. Through the analysis of the collected research literature, we found that researchers are particularly concerned about cancer research based on artificial intelligence technology. In terms of scientific adaptation, they focus on diverse approaches to model integration. In addition, it is hoped that deep-learning models can be applied to long-term health monitoring and disease prevention to ensure sufficient data support for the in-depth development of AI in the medical field and further improve multi-modal systems. We believe that the findings of this report will play an important role in guiding and developing the importance of AI in clinical medicine and research.