1 Introduction

In 1957, Francis Cricks proposed the central dogma of molecular biology which explains the flow of genetic information in a living organism summarized in a pathway (Fig. 1) from DNA (Deoxyribonucleic Acid) to RNA (Ribonucleic Acid) and from RNA to protein (a functional form of the DNA) [12]. DNA has double-helical strands containing four basic units called nucleotides: Adenine (A), Thymine (T), Cytosine (C), and Guanine (G). The two strands of DNA are linked with a chemical bond between bases; A is paired with T and C with G. The DNA base sequence contains all biological information to be transcribed into a  protein product. In living cells, DNA is organized in the form of chromosomes, which are further organized into segments of DNA called genes that encode for proteins (see Fig. 2). The sum of all genes or genetic material that an organism possesses is known as the genome [15]. The field of life science that focuses on studying the genome or genomic sequences of organisms is called genomics. The human genome possesses approximately 3 billion DNA base pairs, and the field of human genomics aims to link the genome with molecular and physical characteristics [2]. It is a data-driven science that involves high-throughput next-generation sequencing (NGS) technology development that generates data on the whole genome of an organism. These sequencing techniques include whole exome sequencing (WES), whole genome sequencing (WGS), as well as transcriptomic, chromatin, and epigenetic profiling.

Fig. 1
figure 1

The Central Dogma of Molecular Biology: Genetic information is transformed from DNA to RNA in the process of transcription. RNA is then translated into the final protein product, which have a variety of functions 

Fig. 2
figure 2

The Nucleotide bases make a chemical bond to form a double helical structure called DNA. Genetic material is made up of DNA that is tightly packed into chromosomes. Only a certain region of DNA contains genes that code for proteins

In 2001, the completion of the Human Genome Project (HGP) was an important scientific development in the field of genomics, by providing the reference of most of the human genome. Recent technological advances in long reads have allowed for improved human genome reference by sequencing the remaining 8% of the human genome [30]. The sequencing of the whole genome allowed a better understanding of the genetic variation among the organisms or even within the different cells, tissues, and disease states of an organism. The findings of the HGP suggested that all humans are 99.9% genetically identical and only 0.01% variation in the human genome can make all humans phenotypically different, such as their disease susceptibility, responses towards drugs, and physical traits (hair colour, eye colour, height, intelligence, etc.) [14]. One major aim of genomics is to identify the underlying changes or mutations that may occur in DNA sequences to alter cellular processes and cause the disease states, usually done through genome-wide association studies (GWAS) (see Fig. 3). It is worth noting that not all occuring mutations are disease-causing; for example: not all single nucleotide polymorphisms (SNPs) (a single change in a base pair) or indels (insertions or deletions of small pieces of DNA) change the DNA sequence coding for a protein (synonymous mutations) or expression of the genes [18]. If we can identify which variation is linked to a specific disease, we will be able to design better treatments, drugs, or even cures. McGuire et al. [26] reported that investigating genetic variation could improve our understanding of why certain people respond differently to the same medications. That is where the personalized medicine concept comes, where pharmacogenomics can develop and prescribe personalized medicine to an individual though understanding their genetic makeup.

Fig. 3
figure 3

A mutation has been detected in the gene sequence that is responsible for the disease state. For complete analysis, a genome-wide association study is required

Cancer is one the most prevalent chronic diseases that caused by genome alteration. Including base substitutions, deletions, rearrangements, or amplifications. The mechanisms of sequence alteration vary between different cancer types.

Furthermore, genomics also plays an important role in managing and understanding infectious diseases on both population and individual levels [24]. In particular, it helps researchers identify and keep track of the emergence of drug resistance in pathogenic organisms. For example, during the COVID-19 pandemic, genomics helped scientists track virus (pathogens) transmission to understand how the strain was evolving to aid the development of effective  vaccines. Genomics is also enabling more targeted tests such as for rare disorders, tumour genome analysis, and non-invasive prenatal screening. Furthermore, genomics has also revolutionized the field of agriculture by helping scientist understand the genetic makeup of livestock and crops. Genomics will allow scientists to develop genetically modified organisms (GMO) that can be pests resistant, tolerate harsh environmental conditions and increases yield. This is needed to handle the challenges associated with the growing world population to ensure food security [32]. Biodiversity can also be understood by comparing the genomes of various species and look at the underlying principle of the evolutionary history of organisms and their adaption to different environmental conditions.

2 Cancer genomics

Cancer develops because of alterations or mutations that occur in the DNA sequence of genes that regulate cell survival, division, or other hallmarks of the transformed phenotype, resulting in the development of uncontrollable cell growth and the spread of abnormal cells. These cells can acquire genetic mutations that affect normal cell growth mechanisms and lead to formation of tumours. For several years researchers have been trying to understand the biological basis of various cancers showing variable clinical outcomes. Genomics can provide insights into the underlying principles of this heterogeneous and complex disease. Genomic studies help researchers identify multiple mutated genes that cause cancer and these are called oncogenes. A common example is the TP53 gene, which is mutated in different cancers [42]. The use of NGS technologies can provide us with the whole genome profiling of a cancer patient, which can help in identifying and understanding clinically relevant genetic variations that can be targeted for potential therapies. In 1998 the first molecular targeted drug was introduced based on comprehensive genomic profiling (CGP) called trastuzumab in patients with ERBB2-overexpressed breast cancer. Since then, several novel targeted therapies have been discovered including BRAF melanoma inhibitors, BCR/ABL chronic myeloid leukaemia inhibitors, and epidermal growth factor receptor (EGFR) non-small cell lung cancer tyrosine kinase inhibitors which provided robust therapeutic responses [29].

Most large-scale genome projects such as the International Cancer Genome Consortium (ICGC) and The Cancer Genome Atlas (TCGA) have mainly focused on cancer genome characterization. Through genetic variations that occur in the coding part of the genome and have identified numerous novel mutations. However, only 2% of the human genome is coding with the remaining 98% being non-coding and there being very limited information on how variations in the non-coding part of the genome can affect the development of cancer [42]. Elliott & Larsson [16] reported that mutations in the non-coding part of the genome are abundant, but their effects are so far poorly understood. Recent studies have shown that variations in the non-coding regulatory region of the human genome are highly associated with disease conditions. The identification and mechanisms of gene regulatory regions are not only important to understand the function of the genome, but they also help in the widening the understanding of disease causation and provide a better overview of disease state. For example, a mutation in the regulatory region of the RB1 gene has been found to be a major source of brain cancer glioblastoma [4]. Genes have multiple non-coding regulatory regions that include promoters, activators, enhancers, and silencers. This paper will focus on enhancer regions, and why identifying and understanding the mechanisms of enhancers is important in cancer.

3 Enhancers and cancer

Enhancers are non-regulatory elements responsible for controlling the transcription of one or more genes (Fig. 4). The characterisation and identification of these regions is important in the context of human disease to determine disease causation and aid in the development of novel drugs. Enhancers can activate expression of proximal or distal genes, the latter function is important in chromosome looping and Topologically Associated Domains [10]. Enhancers are cell-type specific and are located in different regions of the genome [6]. Recent studies have shown that enhancers are associated with several epigenetic markers, including histone modification signals such as H3K4me1/2/3 [11], H3K27ac [13], and H3K9ac [21], cofactors (e.g. cohesion and mediator complex) and, chromatin-modifying molecules (e.g. p300). This histone modification based data can provide the significant evidence in predicting active enhancers. In general there are two types of enhancers: (1) Signal-dependent or inducible enhancers and (2) cell type-specific enhancers. The latter cover the majority of all enhancers present in the genome.

Fig. 4
figure 4

Enhancers are the non-regulatory elements of DNA. They can make 3D contact with other non-regulatory elements of DNA and are bound by transcription factor proteins to control gene expression [31]

As all human cell types can possess the same genome, cell-type specific enhancers are an important factor to determine cell-type specific gene expression programming. The mammalian genome contains millions of enhancers but only a small number of enhancers are active in each cell type and the activity of enhancers is specific to its targeted gene [22]. The term super-enhancers (Fig. 5) used to represent the active enhancer clusters that are present in a high abundance in a specific genomic region [48]. These enhancers mainly regulate genes that are important for determining the cell identity. Therefore, enhancers provide the basis of cell identity and mutations in these enhancer regions can cause abnormal cell growth and cause several diseases. The fluctuations in DNA methylation are also a cause of cancer development and can directly affect the activity of the enhancers [40]. Furthermore, mutations in regulatory regions of oncogenes can show a major impact in causing brain tumours [4]. In glioblastoma, EGFR amplification is linked with the remodelled enhancer landscapes through the synthesis of FOXG1 and SOX9-dependent transcriptional factors. This signature is highly sensitive to small molecules that disrupt H3K27ac inhibitors. H3K27ac is a histone modification that plays a key role in the epigenetic regulation that controls the gene transcription, enhancer activity and chromatin structure. EGFR amplification in glioblastoma is highly sensitive to small molecules that disrupt H3K27ac inhibitors and activate  an oncogenic gene expression program. This results in the activation of repetive element expression including an endogenous retroviral element [37]. Furthermore, SMARCBI is the core subunit of SWI/SNF chromatin remodelling that is lost in cancer which is responsible for maintaining the SWI/SNF complexes and it may result in the disruption of the enhancer-mediated region of genes necessary for cell differentiation [].

Fig. 5
figure 5

Systematic representation of typical and super-enhancers

Thandapani [43] reported that H3K4me1 histone modification of enhancers is catalyzed by MLL3/MLL4. In various cancer types, MLL3 and MLL4 are mutated, which can reduce the amount of H3K4me1 on enhancers and prevent binding of the mediator complex to those enhancers. MLL4 loss impaired the super-enhancer in lung cancers for the tumour suppressor PER2 gene. Mutation in MLL3 and MLL4 can also lead to therapeutic resistance and dysregulation of enhancers in various cancers. Therefore, a deep understanding of enhancer patterns can help reveal novel activation mechanisms oncogenes in cancers.

4 Challenges for enhancer prediction

Enhancers are regions of DNA that are responsible for the transcription of one or more genes [46]. The position of enhancers is variable and relative to their target and can occur downstream, upstream, or within the introns of a gene. Enhancers may make 3D contact with promoters to achieve regulation of distal genes [23]. Furthermore, there is no specific motif or code for enhancers, and they may only be active in specific temporal, environmental, and spatial conditions [3]. These characteristics of enhancers make it difficult to identify and annotate the enhancers. The experimental approaches for the identification of enhancers fail to provide a complete list of active enhancers and do not help researchers to understand why certain DNA regions act as enhancers and others do not [5].

5 Explainable Artificial Intelligence (XAI)

Explainable Artificial Intelligence (XAI) is an emerging and necessary field of artificial intelligence, particularly in the field of healthcare, XAI, is designed to enhance human trust in artificial intelligence models by providing the explainability of the model that how a specific model has been generated, and explaining the results of respective models to allow the better understanding of the problem statement. Therefore, it helps the user to improve performance of the model and provide more application domains. Hagras [19] explains the main important features of XAI including (1) Transparency: It is right to describe how a decision has been made as it affects people’s lives, and the explanation should be in a human-understandable format and language. (2) Causality: Does the model provide a complete explanation of the underlying phenomena while providing the correct inferences from the data? (3) Bias: The AI models are trained on the dataset that is coming from the real world, so how can we be sure that these models are also incorporating the biases? (4) Fairness: Can we able to make sure that the decisions that are made by the AI systems are fair? And (5) Safety: Without a depth understanding of the data how can we rely on AI models? A potential XAI system should be able to incorporate all the mentioned features to provide complete transparency to the user.

Hagras [19] also mentions the existing three main approaches to creating an XAI system. The first one is the deep explanation which modifies the deep learning models’ techniques to understand explainable structures. A few examples include deepLIFT [39] and layer-wise relevance propagation [7]. The second approach is interpretable models: this is an approach to interpret casual models or learn structures that can be applied to the graphical models, i.e. Hidden Markov Model (HMM), statistical models such as naïve Bayes, logistic regression or random forest. However, the output of these techniques is only understandable by experts and not by laymen. The last approach is the Model Induction: which can be used to interpret the model from any opaque box models. Hagras [19] also mentioned that the best approach to provide the explainability to users is by providing them the IF–THEN rules along with the linguistic labels which can explain model output. Fuzzy logic systems (FLS) are one of the AI technique that provide IF-THEN rules and linguistic labels AI model architecture shown in Fig. 6. FLS has 4 main components 1) Fuzzifier: this converts crisp input into fuzzy sets. 2) Inference: This component generates the ideal rules for respective inputs. 3) Rule base: it contains the membership functions and rules that control or regulate the decision-making process in a fuzzy logic system. The rules are saved here in the form of IF-THEN conditions. And 4) Defuzzification: It transforms the fuzzy set outputs into crisp outputs. An FLS directly converts the real number measurements into linguistic labels. They may take the form of good, fair, bad; high, medium, low, or various combinations of the descriptive variables. Then these linguistic labels are used to define the if–then rule base for describing the situation in a form that is explainable and understandable. An example of a fuzzy rule may be “IF the tumour size is large and the homogeneity is high between the cells THEN the patient has a malignant tumour”, here the linguistic labels are large, high, and malignant. It is very simple for any individual, independent of their expertise, to understand what is being measured in the situation and what will be the output. There are two main types of fuzzy logic systems Type-1 FLS and Type-2 FLS. The main distinction between Type-1 FLS and Type-2 FLS is that Type-1 FLS are unable to directly handle the uncertainties because of the specific nature of the membership functions. Type-1 FLS takes the input measured in real numbers also called crisp inputs in terms of fuzzy logic system and fuzzifiers these values into fuzzy sets in the fuzzifier block. After fuzzifying the inputs, the input fuzzy sets map onto the output fuzzy sets by using the fuzzy rules fired in the inference box. Figure 7 represents the Type-1 fuzzy logic membership functions for the decision-making process of early breast cancer detection. Sizilio et al., [40] took two input features (1) tumour area (range: 185–4255) and (2) homogeneity range [0.01–0.45], and the outputs would be either benign (non-cancerous tumour) range [0–0.5], malignant (cancerous tumour) range [0.6–1] or undefined range [0.5–0.6]. In the Type-2 fuzzy logic system instead of defining the crisp membership functions, the fuzzy set includes another representation layer in the form of a footprint of uncertainty (FOU) around the membership functions, which provides the additional degree of freedom to handle the uncertainties [1, 34]. Type-2 inference fuzzy system structure is shown in Fig. 8.

Fig. 6
figure 6

Systematic Architecture of Fuzzy Logic System (FLS).

Fig. 7
figure 7

Representation of the tumour area (smaller and larger), tumour homogeneity (more and less) inputs and Benign, malignant, and undefined output membership functions [36]

Fig. 8
figure 8

An overview of the Type-2 Fuzzy Logic System Operation

6 Opaque box AI v/s explainable rule-based models

It is challenging to understand the prediction of the opaque box model (e.g., deep learning) where the cumulative model complexity can be used to achieve high prediction accuracies by these models. Alternatively, interpretable models can provide a better understanding of how predictions have been made. To achieve transparency in the model a concept of explainable AI has been proposed that explains the whole process of the model, i.e. the underlying procedure for explaining the methods, procedures and output of the model that should be understandable by any human [28, 35]. A comparison of the opaque box model and the XAI model is shown in Fig. 9.

Fig. 9
figure 9

Comparison between opaque box models and Explainable artificial intelligence

The Rule-based explainable AI (XAI) model that generates the natural language IF/THEN rules as a classification algorithm based on type 2 fuzzy logic, generates, integrates, and tests rules for accuracy and validity. This XAI model can help the user to understand which rules are used by the classifier in making the prediction. Rule-based explainable AI (XAI) is a class of artificial intelligence that explain the rule and insights into how AI-based system can make predictions and decisions. XAI can explore the reasoning behind the process of decision-making and provide details on how the system will work in the future and the system's advantages and drawbacks [19]. XAI allows researchers to understand the insights of the predicted results. Opaque-box models like a neural network, random forest and deep learning can always create confusion like “How does the system predict the result”, “How does the model work”, “Are the results correct”, and “How do overcome the errors”, “is the result trustworthy”. The use of XAI systems can overcome this confusion and provide a clear and transparent prediction with explainable rules[44].

7 Computational methods used for the prediction of cancer

Cancer is a multifaceted and complex disease that continues to be a major healthcare challenge worldwide. Accurate prediction and early detection of cancer are critical but at the same time important for reducing the burden of this disease. In recent years, the advancement in the field of machine learning and artificial intelligence has shown promising results in early detection, diagnosis, and treatment of various cancers. Ström [41] reported that AI in combination with cancer screening methods that include biopsy examination can increase the success rate of breast cancer treatment. Computational radiology uses AI techniques such as computer vision, pattern recognition or lesion detection for the classification of lesions according to Breast Imaging Reporting and Data System (BIRADS) and systematic diagnosis reporting. Mavaddat et al. [25] reported a genetic variant model that calculates the polygenic risk score to estimate the breast cancer risk in a patient. Bakas et al. [8] have proposed a deep convolutional neural network AI-based model that uses magnetic resonance imaging (MRI) data as input and generates rapid and accurate 3D segmentation of glioblastoma. However, the MRI data failed to generate accurate results. Zhou et al. [50] proposed a support vector machine risk model that uses both clinical and genetic data, for predicting ovarian cancer. Mehrotra et al. [27], reported a deep learning-based AI model for the classification of brain tumours. The model is trained on the Magnetic Resonance Imaging (MRI) dataset, and it helps in the classification of both malignant and benign tumour cells. The model achieved an accuracy of 99.04%. Wankhede and Selvarani [47] published an MLL-CNN (multilevel layer model R-CNN) that is based on a relative description model and feature weight factor-based feature selection strategy for the classification of the brain cancer. They trained their model on MRI images to predict the glioblastoma. The model achieves an accuracy of 89%, specificity of 97% and sensitivity of 98%. Toumazis et al. [45] uses the Bayesian model to detect lung cancer based on various risk factors such as smoking exposure, genetics, and age.

However, where the machine learning and Deep learning-based model achieves higher accuracy, these techniques are unable to explain how a particular result has been classified by a specific model. To resolve this issue, Gaur et al. [17] proposed a new model that uses eXplainable AI modelling techniques for the prediction of brain tumours. XAI techniques allow the model to make decisions based on certain rules, that ultimately help researchers or scientists to easily trace results. The study uses the MRI image data for prediction and achieves an accuracy of 94.64%. However, there are still gaps which need to be filled and there is a necessity to develop a molecular level-based feature set for the identification of the real cause of the diseases and their accurate prediction.

8 Enhancers predictions methods

There are numerous experimental techniques used for the identification of enhancers. The first technique is transcription factor binding site mapping onto the genome using ChIP-seq data [38]. The second technique is the use of epigenetic markers (i.e., H3K27ac and H3K4me1) to identify active enhancers. The third approach is the identification of binding sites of the histone acyltransferase EP300, a transcription factor protein that is required for the acetylation of nucleosomes and is recruited by other TFs (Lee & Young [23]). In this approach, the histone modification data will be used to differentiate the active and non-active enhancers [13]. Another approach for genome-wide identification of enhancers is STARR-seq (Self-transcribing active regulatory region sequencing) a massively parallel reporter assay that allows the identification of the enhancers based on the genome-wide activity and provides a quantitative measure of each region in the genome to act as an enhancer and its activity [5].

Computational tools, specifically machine learning tools, are taking the lead in the identification of genome-wide enhancers [36]. These tools use histone modification and high-throughput sequencing assay data as a training data set and based on the extracted features predict the enhancers in genomes. Machine learning methods suffer from biases and tend to predict promoters as enhancers [20]. Promoters are the upstream of Transcription Start Site that define where the RNA polymerase begins the gene transcription [24]. Machine learning methods such as neural networks give high enhancers prediction accuracy, however, they fail to explain the rules and insights through which the algorithm is making the prediction [9]. Additionally, neural networks require large amounts of data as a training set [33].

A accurate prediction of enhancers is necessary to understand the role of non-regulatory genome regions in the context of disease. Wolfe et al. [49] developed a Ruled-based explainable (XAI) model for the identification of the enhancers in Drosophila melanogaster cell lines. The model was trained on histone modification ChIP-seq data of histone modifications and STARR-seq data. For evaluating model performance, the XAI model was compared with traditional machine learning models for enhancer prediction and annotation. Using this approach, the machine learning model was trained on the same histone modification data as an explainable model, that accurately predicts enhancer locations and generalises to other cell lines without adjustment. The project was based on the following aims: (1) Train the XAI model on the histone modification ChIP-seq data. (2) Defining, interpreting, and implementing the rules for prediction of the XAI model, and (3) Using this model to predict change in enhancers in other developmental, physiological or disease contexts. A comparison of the opaque box model and the XAI model used for enhancer prediction is shown in Fig. 10.

Fig. 10
figure 10

Comparison of opaque box model and XAI model. Both models are trained on genomic data, The opaque models give non-traceable predictions compared to XAI, which provides predictions along with the IF–THEN rule base that is understandable to layman

9 Why explainability is needed

The scientific community is working with a large amount of genomic data, and the focus has shifted to understanding it fully and making it useable for the healthcare sector. The alteration in genomic region of living organisms can cause numerous diseases. Multiple machine learning tools based on neural networks, deep learning and random forests have been developed and have gained high accuracy and efficiency, but these tools lack explainability in their prediction results. However, for the genomic scientific community, there is a need to develop explainable generalized models that will help researchers understand the prediction and replicate them clinically to speed up traditional experimental methods aiming to develop new drugs, personalised therapies or propose new treatments and cures for diseases. Therefore, there is a need to offer models that guarantee explainability and transparency in their prediction that will be understandable to a layman which can pave the way to developing predictions quickly to help improve disease outcomes, such as with cancer, through personalized medicines.