Background

Pathogenomics is an innovative imaging analysis method that leverages invasive techniques to draw correlations between genomics with pathological image features. This approach provides a more profound comprehension of tumor biology and allows for the capture of the inherent heterogeneity of tumors. The ultimate objective of pathogenomics is to develop specific imaging biomarkers that combine genotypic and phenotypic metrics.

With an estimated 19.3 million new cancer cases and nearly 10 million cancer deaths occurred worldwide in 2020 [1], innovation in cancer diagnosis and treatment is desperately needed. Cancer diagnosis and prediction of treatment and prognosis often harness heterogeneous data resources, including whole slide images (WSI), molecular profiles, and clinical data such as patient age and comorbidities [2, 3]. Several recent studies have illustrated that patterns found in high-dimensional, multimodal data can improve prognostication of disease invasiveness, stratification, and patient outcomes compared to unimodal data [2, 4,5,6,7].

As the field of anatomical pathology moves from slides to digitized whole slide images, and the breakthrough of next-generation sequencing (NGS), alongside rapid progress via deep learning and other advanced machine learning methods has been made in each of individual modalities, major unsolved questions about how to take advantage of multimodal data for integration and mine useful information remain. Thereby, this is a critical opportunity to develop joint histopathology-genomics analysis based on artificial intelligence (AI) approaches that leverage phenotypic and genotypic information in an integrated manner, that is, pathogenomics.

At present, the analysis of histopathological imaging is still mainly in the stage of manual way [2, 8,9,10], supplemented by computer quantitative analysis. Through great progress made by quantitative analysis at this stage, the analysis mainly studies a few common features, such as morphological features, color and texture features, shape features, etc., while use single feature cannot cover the complexity and variability of tumors, leaving an urgent problem to be solved in quantitative analysis of computational pathology [11]. Subjective and qualitative histopathology-based image analysis of the tumor microenvironment (TME), alongside with quantitative examination of omic analysis, especially genomics, is the standard-of-care for most cancers in modern clinical practice [3, 5, 6, 12,13,14,15]. Tumor microenvironment is the complex cellular environment that mainly composed of blood vessels, extracellular matrix (ECM), immune cells, fibroblasts, inflammatory cells, and various signaling molecules around the tumor [16]. Although histological tissue analysis provides significant morphological and spatial information on the TMI, with different inter- and intra-observer variability qualitatively examined by experienced pathologists, interpretation at the histological level alone can hardly take advantage of abundant phenotypic information that has been shown to correlate with prognosis.

Genomic analysis focuses on monitoring cellular activity at the molecular level, compared to pathology quantifies disease phenotypes. Current modern sequencing technologies, such as single-cell sequencing, can parse the genomic information of a single cell in tumor specimens, while spatially resolved transcriptomics and multipath immunofluorescence technologies can simultaneously parse histopathological morphology and gene expression in space [17,18,19,20,21]. Bulk sequencing can reveal the presence and quantity of all genes and TME in tumors within a given time, to help us understand the molecular discrepancy between disease phenotypes and responses to treatment. Genomic analysis of tissue biopsies can provide quantitative information on genome expression and alterations, including gene mutations, copy number changes (CNV), and DNA methylation, but it is challenging to recognize tumor-induced genotype measurements and alterations via no-tumor entities such as normal cells.

Therefore, the emergence of pathogenomics provides an exciting opportunity to combine morphological information from histopathology and molecular information from genomic profiles to better quantify the tumor microenvironment and harness advanced machine learning algorithms for the discovery of potential histopathology-genomics based biomarkers and precision oncology in accurate diagnosis, treatment, and prognosis prediction. To exemplify this premise of pathogenomics, we will focus on three major modalities in cancer data: histopathology, genomics profile, and clinical information (Fig. 1).

Fig. 1
figure 1

Example data modalities for multimodal integration include clinical, pathological, and genomic profiles. Submodels extract unimodal features across each data modality. Then, a multimodal integration step generates intermodal features—a tensor modal fusion network. And final sub-models infer multi-task learning in patient outcomes and clinical prognostication, including patient stratification and molecular subtyping, survival prediction and treatment response, discovery of key features and novel biomarkers. (Created with BioRender.com.) WSI whole slide image, IHC immunohistochemistry

The integration of histopathological phenotypes and genotypes could help us:

  1. (1)

    Understand context-aware linkage between tissue architectures and molecular properties;

  2. (2)

    Capture more “understandable” spatial features through systematic and quantitative analysis;

  3. (3)

    Discover novel diagnostic and prognostic image-omics-based biomarkers;

  4. (4)

    Gain complimentary information for visualizing pathological and molecular context of cancer;

  5. (5)

    Develop multimodal fusion models for prognostication of survival situation, gene mutation signatures, patient stratification and treatment response.

In this work, we provide a brief review of representative works that focus on integrating pathomics and genomics for cancer diagnosis, treatment, and prognosis, including the correlation of pathomics and genomics, fusion of pathomics and genomics. We also present challenges, potential opportunities, and perspectives for future work. An overview of the fusion of pathomics and genomics analysis is shown in Fig. 1.

Correlation between pathological and genomic profile of cancer

Correlating pathological morphology with large-scale genomic analysis has become a burgeoning field of research in recent years [8, 22]. Given that most WSIs currently lack pixel-level annotations, it can be verified to a certain extent whether it is consistent with known biological mechanisms by correlating image features with molecular data. For instance, to explore whether image immune features are related to immune regulatory genes, to further validate whether the image features obtained by machine learning algorithms are reliable to replace the doctor’s manual estimation in the future. Moreover, the association analysis of image features and molecular expression patterns can bring new inspiration to cancer biology research and help to find potential new biomarkers [23].

Whole slide images and computational pathology

Whole slide images offer a wealth of pathological information, including details about nucleus shape, texture, global structure, local structure, collagen pattern, and tumor-infiltrating lymphocytes (TILs) pattern. However, the high complexity of WSIs, due to their large size (a resolution of 100k × 100k is common), presence of color information (hematoxylin and eosin and immunohistochemistry), no apparent anatomical orientation as in radiology, availability of information at multiple scales (e.g., 4×, 20×), and multiple z-stack levels [11], make it challenging for human readers to precisely extract such visual information. Fortunately, the advent of artificial intelligence and machine learning tools in digital pathology enable mining histopathological morphometric phenotypes and might, ultimately, improve precision oncology.

Stained tissue samples, observed under an optical microscope, provide detailed insights into the morphological distribution of cells in different biological states, as well as the composition of the tumor microenvironment (TME). Whole slide image allows clinicians to analyze and evaluate these varied aspects of the TME, such as tissues and cells of cancer patients. This analysis can help identify the benign or malignant nature of cancer, classify the tumor grade, extent of invasion, and prognosis, leading to a qualitative or semi-quantitative diagnosis. As a result, WSI is currently considered the “gold standard” for clinical diagnosis of cancer. However, given the high self-heterogeneity of tumors, manual estimation on images can be subjective and its accuracy can be affected by the pathologist’s own clinical experience and working conditions. This can lead to inevitable human bias, resulting in disagreement in diagnosis or even misdiagnosis. In recent years, with the rapid development of machine learning, more and more studies have started applying advanced algorithms to WSIs to automatically identify and quantitatively analyze important tissues and cells in images, thereby assisting clinical evaluation and related computational pathology studies. Colorectal [24,25,26], breast [27, 28], gastrointestinal [29, 30], prostate [31, 32], and lung cancers [33,34,35] can be retrieved by automatic classification or quantitative analysis in advanced machine learning algorithms using multicenter, large cohort WSI data.

Computational pathology reveals prognostication of gene mutations, cancer subtypes, stratification, and prognosis

The causal and inferential relationships between gene expression and pathology are indeed crucial in the discovery of biomarkers. Hematoxylin–eosin (H&E)-stained WSIs and immunohistochemistry (IHC) data have been leveraged to predict molecular features of tumors and to discover new prognostic associations with clinical outcomes. We refer readers to several well-chosen extraordinary reviews in these areas [2, 8, 11, 36].

One notable multicenter example in pan-cancer showed that deep residual learning (Resnet 18) can predict microsatellite instability (MSI) status directly from H&E histology [37, 38], suggested a convenient and effective way to identify biomarker for response to immune checkpoint blockade (ICB). Similarly, deep-learning models can assist pathologists in the detection of cancer subtype or gene mutations [24, 33]. However, these deep learning methods rely heavily on large training cohorts and suffer from poor interpretability analysis, thousands of hand-crafted features or manual pixel-level annotations are often needed to achieve excellent, generalizable performance depending on task and data complexity. Harnessing different data modalities at a large clinical-grade scale often requires reducing this burden of time-consuming annotation, especially in multimodal tasks.

Interpretable quantitative analysis of histological images can also be performed using weakly supervised learning without tissue annotation, identifying biological features such as TILs and other properties of the TME and their correlation with molecular features. A recent study found that HE2RNA [39], a model based on the integration of multiple data modes, can be trained to systematically predict RNA-Seq profiles from whole-slide images alone, without expert annotation. The proposed model HE2RNA can be applied for clinical diagnostic purposes such as the identification of tumors with MSI. And other studies [33, 40,41,42] have linked biologically interpretable features with clinical outcomes, which have revealed gene mutations, tumor composition, and prognosis. Table 1 summarizes the representative research works that relate to pathomics and genomics correlation.

Table 1 Overview of research works on correlating pathomics with genomics

Molecular signatures are the most intuitive measurements of response to therapeutic interventions, while survival analysis for cancer prognosis prediction is a standard approach for biomarker discovery, stratification of patients into different treatment groups, and therapeutic response prediction [43]. Many studies have explored the relationship between pathological phenotypic characteristics and cancer prognosis prediction. The histopathological boundary morphology and spatial distribution morphological characteristics extracted based on different machine learning models have predictive effects on cancer grading and prognosis. Recent work has incorporated deep learning into survival analysis, with common covariates including CNV, mutation status, and RNA sequencing (RNA-Seq) expression [44, 45], to examine the relationship between gene signatures and survival outcomes. Nevertheless, these survival analysis for cancer outcome prediction is mainly based on genomic profiles, lack of leveraging heterogeneous information from the inherent phenotypic data sources, including diagnostic slides, IHC slides, which has known significant prognostic value.

Therefore, some studies further combine gene expression data at the molecular level with histopathological features to improve the accuracy of prognosis prediction. For instance, Savage et al. [46] found that combining gene expression, copy number variation, and pathological image features could improve chemotherapy efficacy and prognosis prediction in breast cancer. Cheng et al. [47] found that combining morphological features of cells in pathological images with feature genes of functional genetic data can lead to better prognosis prediction of renal cancer than using image data or genetic data alone. Mobadersany et al. [48] first proposed a deep learning-based framework to predict the prognostic ability of pathological image features, and then extended this model to unify image features with genomic biomarker data to predict glioma prognosis.

Correlation between histopathological and genomic profiles found that patients have a better prediction of prognosis [30, 49,50,51], but further refinement is needed to better address clinically meaningful subgroups. Emerging spatial genomics techniques [7, 15, 18, 52] and complementary clinical and imaging modalities are opportunities to enrich these data and refine prognostication.

Fusion of histology and genomic profile of cancer

Complementary information from combining these various modalities, including the morphological and spatial information from digital pathology, and the molecular changes underpinning them, and the corresponding structured pathology reports, is already accelerating biological discovery and applied multimodal tools research in cancer. We suggest that such unimodal models across histopathology, molecular, and clinical domains discussed above will become the building blocks of integrated multimodal models for pathogenomics. The design choices for multimodal models in pathogenomics are shown in Fig. 2.

Fig. 2
figure 2

Fusion strategies for multimodal models with genomic profiles, histopathological images, and clinical information. Solid arrows denote stages with learnable parameters (linear or otherwise), dashed arrows denote stages without learnable parameters, and dashed and dotted arrows denote the options for learnable parameters, depending on the model architecture

Understanding the histological context of genomic data is indeed essential for a comprehensive understanding of a tumor’s clinical behavior. Given the intra-tumor heterogeneity, the expression level of certain genes may vary significantly across different regions within the same tumor. Besides, the diagnostic slide of tissue samples provides a global view of tumor morphology, and thus pathomic analysis could alleviate the sampling issues raised in genomic analysis. However, relying solely on pathological features or genomic expressions may not be able to provide biological explanations for some clinical behaviors. Therefore, many researchers have attempted to combine these two data modalities to create more accurate and reasonable diagnostic companion tools.

Multimodal fusion approaches

The increasing availability of biomedical heterogeneous data modalities, such as electrical health records, medical images, and multi-omics sequences, has paved the way for the development of multimodal intelligence solutions that capture the complexity of human health and disease [56, 57].

These solutions utilize model-agnostic fusion methods, which means that multimodal fusion does not directly rely on specific machine learning methods [27]. Model-agnostic approaches can be divided into early fusion, late fusion, and hybrid fusion [58].

Early fusion

Early Fusion refers to the process of combining the original data of different modes or the learned private features of each mode representation before they are fed into a machine learning model. Since early fusion is feature-based, it can ensure that low-dimensional information, such as spatial features of each modality, is preserved during fusion. This method is suitable for each modality that does not have complete information to complete the target task independently. In other words, the fusion decision in early fusion is directly made based on the original data or the learned private features of each mode representation. This fusion method can be expressed as Eq. (1):

$${y = D\left( {F\left( {x_{1} ,x_{2} , \ldots ,x_{n} } \right)} \right)},$$
(1)

where \(x_{i}\) denotes i-th data source, \(f\left( {x_{i} } \right)\) represents the feature extracted by the model, \(F\) represents the feature fusion method, and \(D\) represents the decision method made by the model based on the fusion feature.

Late fusion

In late fusion, each modality’s data is processed separately and the results are then combined to make a final decision. This approach allows each modality to focus on the features that are most relevant to it, which can improve the accuracy of the final decision. However, it requires that each individual model be highly accurate, as the final prediction is dependent on the accuracy of all individual models. This fusion method can be expressed as Eqs. (2) and (3):

$${y = D\left( {d\left( {x_{1} } \right),d(x_{2} ), \ldots ,d(x_{n} )} \right)},$$
(2)
$${d\left( {x_{i} } \right) = M_{i} \left( {f\left( {x_{i} } \right)} \right)},$$
(3)

where \(x_{i}\) denotes i-th data source, \(f\left( {x_{i} } \right)\) represents the feature extracted by the model, \(d(x_{i} )\) represents the decision judgment made by the model \(M_{i}\), and \(D\) represents the comprehensive decision method.

The commonly used comprehensive decision-making methods in late fusion include: (1) average fusion: the final prediction is made by averaging the confidence score of the output of the multimodal model; (2) weighted fusion: the final decision is made by calculating the comprehensive confidence score by weighting the output confidence of the multimodal model [59].

Hybrid fusion

The hybrid fusion method can leverage the strengths of both early and late fusion, potentially leading to improved performance. First, through Model1, the \(n\) modal entity objects \((x_{1} ,x_{2} , \ldots ,x_{n}\)) are taken as input to obtain the preliminary decision result, and then the decision result is integrated with the decision result of other unimodal models include \(k - n\)-th modal to obtain the final prediction output. This fusion method can be expressed as Eq. (3) and (4):

$${y = D\left( {d\left( {x_{1} ,x_{2} , \ldots ,x_{n} } \right), d\left( {x_{n + 1} } \right), \ldots , d\left( {x_{k} } \right)} \right)},$$
(4)
$${d\left( {x_{i} } \right) = M_{i} \left( {f\left( {x_{i} } \right)} \right)},$$
(5)

where \(x_{i}\) denotes i-th data source, \(f\left( {x_{i} } \right)\) represents the feature extracted by the model, \(d(x_{i} )\) represents the decision judgment made by the model \(M_{i}\), and \(D\) represents the comprehensive decision method.

The foreland of methodology and application for pathogenomics integration

Prognosis refers to the prospect of recovery as anticipated from the usual course of disease or peculiarities of the case, such as the probability of the patient's tumor recurrence, distant metastasis, or death. Risk stratification can be accomplished using in the tumor, nodes, and metastases (TNM) staging system, molecular features or clinical variables. However, ameliorating prognostic insight is an active area of research with frontiers in survival modeling, including multimodal and pan-cancer approaches generally. Table 2 gives an overview of research works that harness pathomics and genomics for multimodal fusion to apply in clinical prediction tasks.

Table 2 Summarizes the representative multimodal fusion works that combine pathomics and genomics for better clinical prediction tasks

The main goal of multimodal fusion technology is to narrow the distribution gap in semantic subspace while maintaining the integrity of modal-specific semantics. Meanwhile, the key of multimodal fusion architecture is to implement feature concatenation. One of prevailing multimodal learning applications is based on Autoencoder (AE) models [60], providing DL models with capabilities to integrate different data modalities into a single end-to-end optimized model [61]. AE architecture usually starts with encoding each input modality into a representation vector of lower dimension, followed by a feature combination step to aggregate these vectors together, which comprises an encoder and a decoder working in tandem [62]. For instance, Tan et al. [15] developed SpaCell based on AE models to integrate millions of pixel intensity values with thousands of gene expression measurements from spatially barcoded spots in pathological tissue. This approach showed better performance than unimodal method alone.

Feature vector concatenation is also a common straightforward strategy for integrating pathomics and genomics [48, 63,64,65]. Shao [66] proposed an ordinal multi-modal feature selection (OMMFS) method that identified important features from each modality with the consideration of the intrinsic relationship between modalities. Chen [5] introduced a sophisticated end-to-end integrated late fusion framework for fusing the learned deep features from histology images, at patch-level and cell graph-level, and learned genomic features from genomic profiles. Pairwise feature interactions across modalities by taking the Kronecker product of unimodal feature representations and gating attention mechanism, were used to construct prognostic models for glioma and Clear Cell Renal Cell Carcinoma (CCRCC). Cheerla [63] constructed a deep learning-based pan-cancer model with auto encoder to exact four data modalities (gene expression, miRNA data, clinical data, and WSI) into a single feature vector for each patient, handing missing data through a resilient, multimodal dropout method, to predict survival of patients. These studies above showed combination of WSI and genomic data and found that the model performance outperformed than unimodal, while survival model is also suitable for computational modeling at the pan-cancer level. Wulczyn et al. [67] trained survival models for 10 cancer types and evaluated the predictive performance of their model in each cancer type. Vale-silva et al. [51] trained pan-cancer and multimodal models across 33 cancer types. The current consensus seems to be that histopathology-based feature can facilitate survival patterns by using genomic or clinical variables. However, the ultimate application in clinical setting may depend heavily on the selective image features, model type and well-curated datasets, among other factors.

Deep learning approaches could capture different perspectives of tumor morphology, but for a successful model to translate into new insights, it is critical to disambiguate tissue types to comprehend model predictions. Tissue type area ratio [30] and connectivity [68] will affect the final prediction results. The morphology of the intratumor stroma may be a stronger prognostic indicator than the tumor itself [69, 70]. Furthermore, the loss function adapted to the correct checking nature of the survival data outperforms a single binary classifier. However, multi-task approaches that combine multiple binary classifiers or survival loss with binary classifiers may yield better risk stratification.

In general, the studies discussed above demonstrate that multimodal fusion strategies with histopathological images and genomic profiles improve clinical prediction and patient stratification over digital pathology and molecular methods alone. However, the improvements observed in these early studies ought to be confirmed by adequate statistical analysis and external validation. Further experiments are indispensable to demonstrate the performance of generalizability and robustness to apply these approaches in real clinical settings. As for transcriptomic studies, much higher sample sizes are needed to make broader conclusions from the experimental reports. Besides, large language models (LLMs) [71, 72] have exhibited exceptional performance in the field of natural language processing, various general models similar to ChatGPT [71] have been developed in multi-omics tasks, such as scBERT [73], Geneformer [74], SIMBA [75] and scDesign3 [76], which can realize cell type annotation, network dynamics predictions, gene regulation, single cell simulation, etc. generic tasks in genomics. There is also work on segment anything model (SAM) [77]-based general tools for segmenting medical images [78] in clinical diagnosis. The latest research is that MI-zero [79] developed by Mahmood lab to realize the general classification of pathological images and ClinicalGPT [80] developed by BUPT and unified multimodal transformer-based models [81] related to general medical diagnosis. However, there is currently no overarching LLM for pathogenomics, indicating the great opportunities for future growth in this field.

Challenges and opportunities in clinical adoption

In summary, the application of artificial intelligence in pathogenomics has demonstrated exceptional capabilities in the prediction of gene mutations, cancer subtypes, stratification, and prognosis. This has significantly contributed to the theoretical foundations for precise diagnosis, treatment, and prognosis of tumors. However, the integration of quantitative measurements from multi-modality data for clinical prognostication remains a formidable challenge. This is largely due to the high dimensionality and heterogeneity of the data.

Multi-source, complex and unstandardized datasets in data availability

It is increasingly recognized that many cancer characteristics have an impact on the prognosis of cancer patients, including genomics, proteomics, clinical parameters, and invasive biomarkers of tumors. Perhaps the greatest challenge in multimodal machine learning is data scarcity in multi-source, complex and unstandardized datasets [12, 87].

The performance of any AI-based approach depends primarily on the quantity and quality of input data. The data used to train the AI algorithm needs to be clean, carefully collected and curated, have a maximum signal-to-noise ratio, and be as accurate and comprehensive as possible to achieve maximum predictive performance. When harnessing complex, unstandardized datasets from multicenter sources, the availability of datasets played a crucial role in the next process. Stained tissue specimens are often manually located and scanned, with limited clinical annotations and tremendous storage requirements. While AI algorithms [88,89,90] have been developed to standardize data, including staining and color normalization techniques. In recent years, studies have [91, 92] also been devoted to establishing comprehensive quality control and standardization tools, providing useful insights for preprocessing heterogeneous and multicenter datasets.

Developing and implementing the multi-modal fusion model requires access to matched pathology, genomic data, and clinical data. Such cross-institutional data sharing is essential to promote and test model generalizability. However, most medical datasets are still too sparse to be useful for the training of advanced machine learning techniques, and how to overcome these challenges is urgent to be solved. Leading platforms include the database of Genotypes and Phenotypes (dbGaP), the European Genome-phenome Archive (EGA), The Cancer Imaging Archive (TCIA), the Genomic Data Commons (GDC), and other resources in the National Cancer Institute (NCI) Cancer Research Data Commons. Beyond matched genomic data and H&E WSIs of TCGA and the Molecular Taxonomy of Breast Cancer International Consortium (METABRIC), public resources contain only small patient cohorts with multiple data modalities (IvyGAP [17]). To achieve this on a large scale, the exciting news is that Genomics England and the UK National Pathology Imaging Co-operative (NPIC [3]) announced a new initiative, the Genomics Pathology Imaging Collection (GPIC), combining digital pathology and genomic data to create a unique multi-omic resource for cancer research. This collaboration builds on the rich data of the 100,000 Genomes Project to add over 250,000 additional high-resolution WSI alongside matched structured pathology reports, somatic and germline sequence data, radiology data, and longitudinal clinical data in the Genomics England Trusted Research Environment. GPIC as a unique pathomic dataset of world-leading scale and quality, enables the next generation of AI for clinical cancer diagnostics, treatment and prognosis, greatly alleviating the challenge of public data scarcity.

On the other hand, federated learning [93] is a potential solution for the logistical challenges of anonymizing data and institutional privacy policies, especially via decentralized dataset distillation in resource-constrained edge environments [94]. Depending on the choice of model, federated learning may require novel training methods but enables training on multi-institutional cohorts without data leaving the local network [95, 96].

Lack of repeatability and reproducibility in clinical settings

Repeatability and benchmarking are one of the major challenges of AI, and many published biomedical AI studies fail to provide source code, test data, or both. A key reason to try to validate independently with separate test sets is to ensure that these methods are resilient to pre-analysis sources of variation, including WSI preparation, scanner Models, and protocols.

To foster transparency, scientific repeatability, and measurable development, researchers should be encouraged to place new intermodal architectures and preprocessing solutions in standard repositories [97, 98] such as ModelHub.ai, github.com. and commercial vendors like Amazon S3, open-source products, Delta Lake for instance.

Meanwhile, due to the variability of the convolution kernel of the model, the overfitting or underfitting of the training data, etc., the identification of imaging biomarkers related to the prognosis from research will be irreproducible. Multimodal machine learning (ML) is more prone to overfitting [99, 100], because in most cases, the multimodal dataset is smaller and the multimodal model needs to fit more parameters. Traditional ML models enable investigators to calculate the required dataset size for a tolerable generalization error prior to analytical workflows. In addition to center-specific confounders, the actual clinical setting has unpredictable effects on model performance, often resulting in substantial performance degradations.

Multimodal ML models should be used judiciously for tasks with large statistical sample sizes and with strategies to combat overfittings, such as early stopping, data augmentation, gradient blending, weight decay, hard and soft sharing of hidden layer parameters. Investigators must be wary of spurious results due to institutional biases and small sizes, with cross-validation, retrospective external validation, prospective validation and clinical trials serving as key measures to assess algorithm effectiveness.

Interpretability of models for trustworthy AI

Comprehension of how abstracted features from different modalities affect the model's inference remains another significant problem, while the exploration of interpretability of ML models, making the deep learning model, treated as a “black box”, in a more trustworthy way [101, 102].

Despite their high accuracy and ease of applicability, the lack of interpretability and contrasting domain-inspired intuitive criticism in handcrafted networks is a possible potential obstacle to the clinical application of deep neural networks. Hand-crafted feature-based AI approaches in histological images can provide better interpretability because they are often developed in conjunction with domain experts since the features were pre-fined, either in a domain-agnostic [101, 103, 104] or domain-inspired [105] manner. However, creating bespoke hand-crafted features is often a challenging and trivial task due to the considerable time and domain knowledge that pathologists or oncologists have invested in developing this method. This could critically impact the trustworthiness of model performance.

Interpretation of extracted features hinders the development of multimodal fusion studies to a certain degree. Computational fusion methods require not only the consideration of the discriminative power of the extracted features in the task but also the interpretability of these features. Focused efforts to clarify the concept within medicine have shown that clinicians generally view interpretability as transparency in model reasoning, adjustable features, and limitations [106].

The field of multimodal in biomedicine particularly stands to benefit from interpretable AI, both in terms of imaging patterns of clinical features and molecular expression profiles of disease. Towards this end, we summarize various interpretability techniques for intuitive classification, organizing them according to two fundamental characteristics: ante-hoc explanatory methods, where the target models incorporate an explanation module into their architecture so that they are capable of explaining their predictions; post-hoc explanatory methods, where aim to explain already trained and fixed target models. In this pathogenomics review, we mainly focus on feature-based explanations. As for feature-based post-hoc explanatory methods, LRP [107, 108], LIME [109], DeepLIFT [110], SHAP [111], Integrated Gradients [109], L2X [112], Anchors [113], Grad-CAM [114] and LS-Tree [115] are currently the most widespread form of explanatory techniques; while self-explanatory methods with feature-based explanations include RCNN [116], CAR [117], INVASE [118], as well as the influential class of attention models [119] are commonly used for common features like super-pixels for images as well. Figure 3 shows the classification of interpretability techniques, and structured representation of the varied categories of interpretability methods shown in this review.

Fig. 3
figure 3

Classification of feature-based interpretability methods. Generally, it can be divided into ante-hoc explanatory methods and post-hoc explanatory methods

In the realm of ante-hoc interpretability, the primary advantages lie in intrinsic transparency and credibility. This approach integrates explicability into the design and training process of the model, enabling concurrent predictions and explanations. However, its limitations include potentially increased complexity in model design and the possibility that an excessive focus on interpretability might compromise performance. Balancing model accuracy with interpretability poses a significant challenge in ante-hoc interpretability research and represents a crucial direction for future advancements in explainable AI research [120, 121]. Regarding post-hoc interpretability, its strengths are flexibility and wide applicability. This approach is adaptable to various types and complexities of models without necessitating architectural modifications. However, a notable downside is the potential for misconceptions. The evaluative challenge lies in devising methods that faithfully represent the decision-making model, mitigating inconsistencies between the explanations and the model's actual behaviors. This ensures the reliability and safety of the interpretative outcomes [120, 121]. In summary, within pathogenomics research, ante-hoc explanations provide profound insights, especially beneficial for relatively straightforward models or those require high levels of explicability from its inception, particularly in multimodal fusion analyses. Conversely, post-hoc interpretations provide flexibility and practicality in pathogenomics correlation and fusion analysis models, especially for complex and already-developed models.

We believe that researchers should comprehend learning models from a biological and clinical perspectives to facilitate reasonable multimodal implementation. Understanding a model is as crucial as enhancing its predictive power and can lead to greater mechanistic insight and testable hypotheses.

Conclusions

We summarize the current state of pathogenomics, combining synthesized complementary modalities of data with emerging multimodal AI-driven approaches for better comprehension of diagnostic, prognostic, and predictive decision-making for oncology. One future direction of pathogenomics is the exploration of more “omics” techniques (transcriptomics, proteomics, metabolomics, etc.) combined with functional imaging data (such as perfusion, diffusion imaging and spectroscopy, etc.) to open up more new avenues for multidimensional pathogenomics via LLMs techniques. In conclusion, with further in-depth research, pathogenomics will play a more active role in the medical field, especially in cancer research, and is likely to revolutionize the diagnosis, treatment and prognosis progress of cancer patients via taking advantage of complementary information in an intuitive manner, and ultimately open novel perspectives for precision oncology and empower a healthy and productive life in the coming decade.