Big data and deep learning for RNA biology

Hwang, Hyeonseo; Jeon, Hyeonseong; Yeo, Nagyeong; Baek, Daehyun

doi:10.1038/s12276-024-01243-w

Big data and deep learning for RNA biology

Review Article
Open access
Published: 14 June 2024

Volume 56, pages 1293–1321, (2024)
Cite this article

Download PDF

You have full access to this open access article

Experimental & Molecular Medicine

Big data and deep learning for RNA biology

Download PDF

Hyeonseo Hwang ORCID: orcid.org/0000-0001-8213-5977¹^na1,
Hyeonseong Jeon^2,3^na1,
Nagyeong Yeo¹^na1 &
…
Daehyun Baek^1,2,3

16k Accesses
1 Citation
48 Altmetric
Explore all metrics

Abstract

The exponential growth of big data in RNA biology (RB) has led to the development of deep learning (DL) models that have driven crucial discoveries. As constantly evidenced by DL studies in other fields, the successful implementation of DL in RB depends heavily on the effective utilization of large-scale datasets from public databases. In achieving this goal, data encoding methods, learning algorithms, and techniques that align well with biological domain knowledge have played pivotal roles. In this review, we provide guiding principles for applying these DL concepts to various problems in RB by demonstrating successful examples and associated methodologies. We also discuss the remaining challenges in developing DL models for RB and suggest strategies to overcome these challenges. Overall, this review aims to illuminate the compelling potential of DL for RB and ways to apply this powerful technology to investigate the intriguing biology of RNA more effectively.

Deep dive into RNA: a systematic literature review on RNA structure prediction using machine learning methods

Article Open access 15 August 2024

Selene: a PyTorch-based deep learning library for sequence data

Article 28 March 2019

Potential Applications of Deep Learning in Bioinformatics Big Data Analysis

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Leveraging big data with deep learning in RNA biology

Over the last decade, deep learning (DL) has proven to be a versatile tool in biology, aiding in multiple breakthroughs in structural biology, genomics, and transcriptomics¹. The power of DL lies in its unique ability to harness the potential of big data². Recently, big data have been rapidly accumulating in multiple domains of biology³. In particular, high-throughput experiments based on RNA sequencing (RNA-seq) have led to the generation of massive amounts of RNA biology (RB) data⁴. Analyzing these big biological data with DL has led to novel scientific discoveries about RNA and related biological processes. Therefore, it would be beneficial to review the current progress of DL in RB, focusing on the role of big data.

DL models are multilayered artificial neural networks that learn to generate representations of input data. These models can perform downstream tasks such as regression, classification, and generation. They have higher degrees of freedom than do conventional machine learning algorithms and thus can effectively learn representations from high-dimensional data⁵. This property has allowed DL models to achieve groundbreaking success in various fields, including computer vision⁶, natural language processing⁷, and structural biology⁸. Constructing such effective DL models requires sufficiently large datasets. However, the availability of such datasets is often a major bottleneck. Auspiciously, the amount of biological data has exploded due to the universal use of high-throughput experiments in RB.

RB is an integrative field of biology in which biological processes involving diverse types of RNA are studied. The utilization of DL in this field has been driven by high-throughput experiments using RNA-seq. These experiments have led to the generation of large-scale biological data for systematically examining RNA-related phenomena. Among the representative examples are cross-linking and immunoprecipitation (CLIP) for protein binding analysis^9,10, N⁶-methyladenosine-sequencing (m6A-seq) for RNA modification analysis¹¹, DMS-seq for RNA structure analysis¹², and single-cell RNA sequencing¹³.

Although DL is a promising approach for revealing the mechanisms underlying RNA-related biological processes, this approach is not without challenges. First, most of the popular DL architectures and algorithms have not been optimized for biological data and tasks. Second, obtaining a training dataset of sufficient size and quality is often difficult. Third, the difficulty in understanding the prediction of DL models often impedes the conception of scientific hypotheses, which requires understanding the causal relationship between the input and the output¹⁴. Nevertheless, successful examples of employing DL in various domains of RB demonstrate the feasibility of acquiring biological insights from big biological data. This goal can be achieved by selecting and optimizing adequate strategies and techniques of DL regarding the characteristics of transcriptomic data.

This review provides an introductory guide to employing DL for novel discoveries in RB. First, we review widely used large public databases for RB, focusing on their utility in building DL datasets. Next, we describe how popular DL methods can be employed to exploit and complement the characteristics of RB datasets. We then introduce the methods for encoding various types of RB data into input features and popular deep neural network architectures suitable for processing these features. The primary goal of these sections is to provide a foundational understanding for designing and training DL models that can learn robust representations of big RB data. Subsequently, we review successful applications of DL in various domains of RB. Finally, we discuss the desiderata and open challenges in applying DL to RB.

Public sources of large-scale RNA biology data

Training a DL model for RB applications starts with obtaining a suitable training dataset. Multiple de facto standard datasets for training and benchmarking exist for conventional DL applications, such as computer vision and natural language processing. However, such datasets are scarce in RB. Therefore, researchers often have to construct new training datasets by collecting data from existing public databases. When building training datasets from public biological databases, filtering, labeling, and normalizing the experimental data using metadata are essential. Metadata are collections of sample or experiment-associated information, often including experiment type, experimental group, sample type, organism, health status, and sequencing platform metadata^15,16. Most public biological databases provide metadata, but their format, stringency, and completeness vary widely. In this section, we review such public databases for RB, focusing on experimental data types and metadata (Table 1).

Table 1 Public databases for large-scale RB data.

Full size table

GEO and SRA

The Gene Expression Omnibus (GEO) is a public repository of gene expression data, including RNA-seq and microarray data, managed by the NCBI¹⁷. As of January 2024, GEO contains expression data for more than 6.97 million biosamples, including more than 1.86 million RNA samples. One representative example of public data archived in the database is the National Institutes of Health (NIH) Roadmap Epigenomics Project, which provides 111 reference human epigenomes with gene expression profiles¹⁸. Since GEO does not store raw data from high-throughput sequencing experiments, the data are archived in the Sequence Read Archive (SRA). SRA was established to provide a public archive of high-throughput sequencing data in conjunction with GEO¹⁹. As of January 2024, SRA has hosted more than 10.9 million publicly available RNA experiments.

While GEO serves as a unique and irreplaceable data source for RB researchers, one of its limitations is its lenient control over metadata items and terms, resulting in the incoherence and fragmentation of metadata^20,21. Aliases and missing metadata fields often hamper the automated filtering and labeling of experimental data. While there have been several efforts to mitigate this issue^21,22, improving the integrity of GEO metadata remains a challenge. Moreover, ensuring the quality of the data and analysis pipeline is primarily the responsibility of the submitters, introducing a potential source of data inconsistencies and imperfections when compiling a training dataset from GEO. Therefore, while GEO is an unparalleled source of RB data, cautious filtering, validation, and normalization are required to assemble a training dataset from the database.

ENCODE

The Encyclopedia of DNA Elements (ENCODE) project, driven by the ENCODE consortium, which is organized and funded by the National Human Genome Research Institute (NHGRI), provides a wide range of functional genomics data²³. In the third phase of ENCODE (ENCODE 3), the quality and quantity of public functional genomics were improved by adding 5992 experiments, including 170 eCLIP experiments, 78 RBNS experiments, 155 RAMPAGE experiments, 198 miRNA/small RNA-seq experiments, and 340 total/poly(A) RNA-seq experiments²⁴. Moreover, ENCODE contains tissue expression data (EN-TEx), which include gene expression profiles and 15 functional genomics assay data from 30 human tissues²⁵. One of the most prominent features of the database is that it enforces standardized experiment-specific quality control methods for both the data and the metadata²⁶. The experimental data are audited, and audit flags are placed on a variety of potential imperfections in the data. Moreover, the use of controlled fields and terminologies for each metadata entry is enforced. ENCODE also provides unified analysis pipelines for multiple types of RNA experiments, improving the reproducibility of the analyses. Another notable feature of ENCODE is that it supports a powerful filtering functionality, allowing researchers to select data based on the assay type, target gene, target organism, cell line, sequencing platform, and many other features. Therefore, ENCODE is an essential source of quality-controlled data for constructing training datasets for functional genomics DL models.

ArrayExpress & ENA

ArrayExpress is a public archive for functional genomic data managed by the European Bioinformatics Institute (EBI)²⁷. This archive includes 46 types of functional genomics data, including 13,988 RNA-seq studies and 51,250 array studies. The submitted data are enforced to meet minimal metadata requirements and are manually curated by bioinformaticians, but the database does not provide detailed audit information. The raw data from high-throughput sequencing experiments in ArrayExpress are archived in the European Nucleotide Archive (ENA). The submitted data are required to meet quality standards. The ENA manages the metadata to ensure that the minimal standards are met using controlled fields and vocabulary²⁸.

FANTOM

The Functional Annotation of the Mammalian Genome (FANTOM) consortium aims to improve the understanding of the gene regulation network²⁹. The FANTOM database provides atlases of gene regulatory elements and noncoding RNAs^30,31. The database primarily hosts RNA-seq and cap analysis of gene expression (CAGE) data. CAGE accurately maps transcription start sites (TSSs) by pulling down 5’ caps³². The current phase of the project, FANTOM6, is focused on characterizing the global regulatory effect of human lncRNAs³³.

GTEx, TCGA, and ICGC

Genotype–Tissue Expression (GTEx) is a project initiated by the NIH to provide RNA-seq data for various human tissues associated with genomic sequences and map tissue-specific and global expression quantitative trait loci (eQTL)³⁴. The completed GTEx v8 project encompasses 54 tissue types of human adults, and the developmental GTEx (dGTEx) project, which is ongoing, includes tissue samples from infants and juveniles. The Cancer Genome Atlas (TCGA), managed by the National Cancer Institute (NCI), also provides RNA-seq data from various human tissues, both normal and cancer³⁵. Along with standard RNA-seq data, TCGA provides miRNA-seq data from more than 1800 samples. Therefore, TCGA can provide a potential source of training data for investigating the miRNA landscape and regulatory network. Statistical data, such as read counts and normalized expression levels, are publicly available in both the GTEx and TCGA databases. In contrast, raw sequence data are available only for researchers authorized by the NIH Database of Genotypes and Phenotypes (dbGaP). A collaboration between TCGA consortium and the International Cancer Genome Consortium (ICGC) led to the Pan-Cancer Analysis of Whole Genomes (PCAWG) project. The sequencing effort included 2793 cancer patients across 20 primary sites, and RNA-seq data were obtained from 1222 patients³⁶. These databases serve as important data sources when constructing training datasets for tasks involving differential gene expression regulation, eQTL, or cancer-associated gene expression regulation.

Characteristics of RNA biology data and related deep learning methods

It is ideal to have a perfectly curated and sufficiently large dataset to train a DL model. However, obtaining such a dataset is not always possible, and adopting this approach is especially challenging in biology due to the relatively high cost of data production. Indeed, there have been multiple efforts to construct comprehensive biology databases, including GEO, SRA, and ENCODE, as reviewed in the previous section. However, such databases often suffer from limited metadata integrity^20,21. Therefore, researchers often have to mitigate this issue when building training datasets for DL models in RB. A critical step in mitigating training dataset imperfection is to select an appropriate method for training DL models. Here, we review popular DL methods and describe their utility in leveraging RB data.

Supervised learning

The term paradigm in machine learning corresponds to the classification of learning methods based on the input data and task³⁷. Supervised learning utilizes labeled data to train a model to accurately predict the labels in the training set, whereas unsupervised learning trains a model to efficiently represent the input data without using labels. The combination of the two methods is termed semi-supervised learning, in which a model is simultaneously trained on large unlabeled datasets and small labeled datasets³⁸. Supervised learning is the dominant choice among these paradigms since it can be directly applied to various downstream prediction tasks. Many predictive applications of DL in RB have also adopted supervised learning, e.g., miRNA target prediction, gene expression prediction, RBP binding prediction, ncRNA biogenesis prediction, and RNA modification prediction.

Supervised learning provides a straightforward framework for DL, but it is not without limitations. Supervised learning usually requires a sufficient amount of reliable labeled data, which is not always available, especially in RB³⁹. Specifically, uncontrolled metadata in public RB databases impede the construction of labeled training datasets. Moreover, supervised learning can easily let DL models pick up biases in data labels, often introduced by experimental artifacts or data analysis pipelines. Even after standard filtering procedures, biological noise in experimentally generated data labels can complicate the training of DL models using supervised learning. Therefore, training a model to learn biological knowledge with supervised learning can occasionally be challenging. In such cases, exploring other DL paradigms can be beneficial.

Self-supervised learning

Labeling data for supervised learning is often expensive and time-consuming, and the amount of labeled data is less than the amount of unlabeled data in most domains. This problem can be overcome by leveraging unsupervised learning. However, unsupervised learning methods are applicable only to limited types of RB tasks, such as cell clustering and substructure detection^40,41. To address this limitation, a self-supervised learning paradigm has evolved. In self-supervised learning, a model is trained on unlabeled data as in unsupervised learning. However, the model is trained with a supervised learning objective by generating labels from unlabeled data itself^42,43. This paradigm has gained momentum in natural language processing applications through the success of large language models^7,43.

To employ self-supervised training in practice, a model is pre-trained on a large unlabeled dataset in a self-supervised manner, learning generalizable knowledge such as language structure. Then, the pre-trained model is fine-tuned to perform a specific downstream task using a smaller labeled dataset in a supervised manner. For example, scBERT was pre-trained on unannotated single-cell transcriptome data to predict the expression of randomly masked-out genes. Then, the model was fine-tuned to annotate and discover cell types⁴⁴. Several other studies have applied a self-supervision framework for RB tasks, including disease modeling⁴⁵, small molecule–miRNA association prediction⁴⁶, and RBP prediction^47,48. The success of these examples demonstrates the feasibility of learning biological context using massive amounts of unlabeled data. Therefore, self-supervised learning offers promising potential for deciphering the complex context of the transcriptome.

Domain adaptation

Training DL models for RB is often obstructed by the scarcity of training data. This issue is not unique to RB⁴⁹. Multiple DL methods have been proposed for resolving this issue. In some cases, the scarcity of training data in the desired domain can be overcome by leveraging data from other domains. Domain adaptation is a DL technique for such cases in which a model is trained to capture domain-invariant knowledge without using the target domain label^50,51. In biology, the domain concept may correspond to different biological levels, organisms, cell lines, or batches of experiments. For example, scDEAL was initially trained with bulk-level cancer drug response prediction tasks using bulk RNA-seq data. Then, domain adaptation was employed to transfer the learned knowledge for single-cell-level drug response prediction tasks using scRNA-seq data⁵². Several other DL studies in RB have used domain adaptation methods to mitigate the scarcity of training data in the target domain, including isoform function prediction⁵³, transcription factor (TF) binding site prediction⁵⁴, and single-cell RNA-seq (scRNA-seq) data classification⁵⁵. Therefore, domain adaptation can be useful in RB when generating training data is too costly, but related data in another domain are available.

Meta-learning

Meta-learning, or ‘learning-to-learn’, is a collection of methods that allow a model to improve its ability to learn new tasks. In meta-learning, the model is trained on multiple tasks by iterating over each task to learn the knowledge that is generalizable to all the trained tasks as well as new tasks that it can encounter in the future. Meta-learning techniques are instrumental in few-shot tasks, where a model has to make predictions about classes that have only a few examples in the training set^56,57. There have been efforts to employ meta-learning for RB since limited training data are a common issue. For example, MARS was trained on annotated and unannotated single-cell transcriptomes from multiple tissues. A meta-learning framework was employed to train a model that can annotate single-cell transcriptomes of novel tissues containing cell types that were not encountered during training⁵⁸. Other successful examples of these methods include cancer survival prediction via gene expression⁵⁹, ncRNA-encoded peptide (ncPEP) disease association prediction⁶⁰, and lncRNA localization prediction⁶¹.

Data augmentation

Data augmentation is a popular technique used in data-limited settings, in which diverse transformations are applied to input data to generate additional synthetic examples⁶². Ideally, the transformations should not affect the task outcome. The simplest forms of augmentation include cropping, rotation, flipping, resizing, recoloring, and blurring. However, such transformations are not always directly applicable to RB data. Instead, reverse complementing the nucleic acid sequence, shifting the sequence, or adding single-nucleotide insertions can be employed for RB data augmentation. Data augmentation has been used in various deep models for RB tasks, including coding potential prediction⁶³ and gene expression prediction⁵⁴.

Ensemble

Ensemble learning is another technique that has been shown to improve the model performance when training data are limited. In ensemble learning, multiple DL models are combined via various methods, such as training multiple models with sampled data and voting, choosing the best model for each example, and training a model to combine the outputs of multiple models^64,65. Multiple RB studies, including RBP motif prediction⁶⁶, lncRNA identification^63,67, splicing variant prediction⁶⁸, and RNA modification detection via nanopore sequencing⁶⁹, have leveraged ensemble techniques.

In summary, while supervised learning is a dominant paradigm of DL in RB, its utility is often limited by the availability of experimentally generated labels. To overcome this limitation, various methods, including self-supervised learning, domain adaptation, meta-learning, data augmentation, and ensemble, can be utilized. The potential of these methods can be leveraged for RB studies by selecting an appropriate method for a given task after analyzing the characteristics and imperfections of training data.

Encoding RNA biology data into deep learning input features

DL models require numeric multidimensional arrays, termed tensors, as inputs to extract task-relevant features through matrix multiplication and nonlinear operations. However, many biological data are not provided in such forms. Thus, to utilize DL models in RB, it is crucial to transform these biological data into tensors through a process called encoding. Choosing a suitable encoding technique is cardinal to training a robust DL model. If an encoding method reflects too much information specific to a sample or an experiment, the DL model will fail to generalize. Conversely, the model will fail to learn if an encoding method removes too much information from the original data. Therefore, the essence of encoding is to efficiently represent the generalizable properties of the original data relevant to a given task. In the following, we present representative encoding techniques for RB data that have been successfully utilized in multiple studies.

Nucleic acid sequences

Several types of biological data involving RNA can be leveraged to predict biological processes, including gene expression, RBP binding affinities, and regulatory elements. Among them, nucleic acid sequences are the most widely used data type since, in principle, numerous biological properties can be predicted solely from sequence data. The most common way of encoding nucleic acid sequences is to one-hot encode the bases, which may allow the DL model to automatically capture relevant features (Fig. 1a). In this simple one-hot encoding approach, each base is represented by a binary vector with a single 1, the position of which represents each nucleotide base. For instance, base A is encoded as [1, 0, 0, 0], and base C is encoded as [0, 1, 0, 0]. Moreover, some studies further improved simple mononucleotide one-hot encoding by integrating matrices corresponding to dinucleotides, trinucleotides, purines, pyrimidines, strong hydrogen bonds, weak hydrogen bonds, amino groups, and ketone groups^70,71. Other studies have enriched the encoding through one-hot encoding of the k-mers in sliding windows⁶⁶ and one-hot encoding of the pairwise alignment results⁷².

**Fig. 1: Processing RB data into input features of DL models.**

Although one-hot encoding is the most common method for encoding nucleic acid sequences, many studies have adopted techniques from natural language processing to produce more informative encodings of nucleotide sequences. By considering biological sequences as sentences and k-mers in the sequence as words, biological sequences can be encoded using the natural language encoding technique termed word2vec⁷³. In word2vec, each word is represented by a large vector called a word embedding. A word embedding is learned by either predicting a word of interest from the surrounding words or predicting surrounding words from a word. Several studies have used word2vec to encode sequence data in RB to improve the performance^72,74,75,76. Word2vec is generalized by expanding the subject from a word to a paragraph. This technique, named doc2vec⁷⁷, was also adopted in RB to generate embeddings for the whole gene or miRNA sequences⁷⁸. Word embeddings can also be generated during training time by randomly initializing embeddings and updating these vectors via a neural network layer called the embedding layer. This method can generate word embeddings that are more relevant to the objective of the model. For example, m6Anet employs this scheme to generate motif embedding vectors⁷⁹. In addition to the encoding methods mentioned above, the numbers of nucleotide monomers and polymers were also used to encode sequence data^80,81.

Expressions

Expression data are important input features since the expression levels of regulatory genes and landmark genes are correlated with diverse biological processes. These data are primarily derived from RNA-seq and CAGE. The simplest method for encoding expression data is to use normalized expression values from RNA-seq data^82,83 (Fig. 1b). Aptardi extended this simple encoding method by considering expression value fluctuations in a local region. Specifically, the authors partitioned the region into three equal-sized bins and calculated the differences in expression values between neighboring partitions⁸¹. Other studies, such as DMIL-IsoFun and HiCoEx, encoded global interactions of expression data with a 2-dimensional matrix^84,85.

Structures

RNA structures are fundamental data for investigating the interaction, functions, and stability of RNAs. Many DL methods utilize the RNA structure in various ways^80,82 (Fig. 1c). The simplest way to encode the RNA structure is the minimum free energy (MFE)^86,87. However, as a scalar value, the MFE conveys limited information, and DL models may benefit from more information-rich vector representations of RNA structures. There are several methods for encoding RNA structure data as a 1D tensor. For instance, the nucleotide-wise probability of structural contexts, such as hairpin loops, inner loops, and multiloops, can be used to obtain a continuous or binarized encoding tensor of RNA structures^88,89. Furthermore, the frequency and distance of these structural contexts can be added to the encoding⁸⁶. Apart from encoding structural contexts, the 3D coordinates of each atom or nucleotide in a tertiary structure can be encoded as a 1D tensor⁹⁰. The RNA structure can be encoded as a 2D tensor of pairwise binding probabilities^72,85,91.

Bindings

Another type of data frequently used in DL for RB is the binding between RNA and regulatory molecules, including TFs, miRNAs, and RBPs, which play crucial roles in biological processes (Fig. 1d). One simple method for encoding protein‒RNA/DNA binding data is to count the known binding motifs in the region of interest^80,92. However, this method does not accurately reflect the in vivo interactions between biological molecules. Several studies utilized in vivo binding data derived from CLIP data to address this issue. To encode the information from CLIP data, coverage vector⁹³ and binarized binding matrix⁹⁴ were used. Unlike protein–RNA interactions, RNA–RNA interaction data are often encoded as multidimensional tensors to convey base-pairing information. One way to encode miRNA‒target interactions is via a 3D matrix, in which the first two dimensions represent the miRNA and target sequence positions, and the last dimension represents the one-hot encoded combination of miRNA bases and target bases⁹⁵.

Deep learning architectures for leveraging RNA biology data

The DL architecture is the structure of neural networks, including connections between layers and operations in each layer. According to the universal approximation theorem, a neural network with a single sufficiently large hidden layer and a nonlinear function can approximate any continuous function⁹⁶. However, such shallow architectures usually fall short in practical tasks, as they fail to learn or require too many neurons. Nevertheless, such problems have been alleviated with improved architectures. For instance, before the introduction of an efficient architecture named the convolutional neural network (CNN) by LeCun et al.⁹⁷, DL was not widely applied to images due to the inefficiency and suboptimal performance. After it was introduced, the CNN architecture was widely adopted in computer vision tasks and exceeded human-level performance⁶. Therefore, choosing a suitable architecture that can work efficiently with a given data type and task is crucial. Here, we review DL architectures that have been shown to work effectively with various types of biological data and tasks.

Multilayer perceptrons

The multilayer perceptron (MLP) is the simplest form of a DL model and comprises multiple hidden layers of neurons (Fig. 2a). Each neuron receives input from other neurons from the previous layer, performs linear transformation, and applies a nonlinear activation function. During training, the weight of each neuron is iteratively updated via backpropagation, which involves the computation of the gradient of the loss function for each training example⁹⁸. MLPs can learn to generate informative vector representations of the input data, which allows them to perform predictive tasks such as regression and classification. Several early DL models in RB have been developed based on the MLP architecture for various tasks, including lncRNA prediction, gene expression prediction, tissue-specific alternative splicing prediction, and structure-based RBP binding site prediction^80,99,100. Another neural network architecture similar to the MLP is the deep belief network (DBN), where each layer is initially trained individually and then trained together to construct and train a deeper neural network¹⁰¹. The DBN has been utilized for lncRNA identification⁹⁹, tumor clustering⁴⁰, and RBP binding site prediction¹⁰².

**Fig. 2: DL architectures for RB models.**

Convolutional neural networks

The local patterns and interactions observed in tensor encodings of biological data, such as one-hot sequences, are fundamental features that are relevant to biological processes. For instance, the motifs for regulatory elements can be discovered by examining local interactions on DNA sequences, and RBP binding motifs can be discovered from local interactions of RNA sequences. Among DL architectures, CNNs have achieved robust performance in learning and aggregating local interactions^97,103. Inspired by this success, CNN-based models have been frequently utilized for DL in RB. CNNs process data using weight matrices called convolution kernels, whose elements are randomly initialized (Fig. 2b). The core of the CNN is the convolution layer, which can be understood as the convolution kernels sliding across the input tensor. In each sliding step, the sum of the elementwise product for each kernel and each patch of the input tensor is calculated. By iterating through this process, the kernels can be updated to capture task-relevant local interactions in the input data. When biological sequences are used as inputs, convolution kernels can function as automatically learned motif detectors¹⁰⁴. Hence, the natural use of CNNs is to detect local motif-related features such as regulatory elements^105,106,107, protein–nucleic acid interactions¹⁰⁸, nucleic acid–nucleic acid interactions⁹⁵, 5′ UTR strength prediction¹⁰⁹, and polyadenylation (poly(A))^70,110. The CNN architecture has outperformed the classical machine learning and MLP architectures in various RB tasks. Several papers benchmarked the CNN and other architectures for specific RB tasks and demonstrated the robust performance of the CNN^111,112.

Sophisticated biological properties require models to capture long-range interactions. This can be achieved by stacking additional convolution layers. However, simply adding more layers to the CNN results in suboptimal performance since the increased distance between the input and the output impedes the propagation of the gradient through the model. To overcome this limitation, He et al. proposed ResNet⁶, which utilizes skip connections to connect the input directly to the deeper layers. This architecture enables the stable training of the deep CNN models. ResNets have been used for DL in RB to capture longer interactions among captured motifs on biological sequences or data to predict splicing^113,114, regulatory activities¹⁰⁵, and N⁶-methyladenosine (m6A) modification¹¹⁵. ResNets can capture longer interactions than can simple CNNs but usually struggle to capture interactions between elements thousands of bases apart, such as splicing donors and receptor sequences. Therefore, several RB tasks, such as splicing prediction, require additional methods to capture long-range interactions. This problem can be addressed by adopting dilated convolution. In dilated convolution, each element of the kernel is applied to every n-th element of the input, increasing the range in which interactions can be captured¹¹⁶. Dilated convolution has been used in various RB tasks that require capturing long-range interactions^{54,105,113,115}. In addition, integrating multiple convolution layers of different sizes, as suggested in the Inception architecture, can improve the performance of CNNs by allowing them to capture interactions of diverse sizes¹¹⁷. The idea was adopted in DeepExpression to predict gene expression from sequences¹⁰⁷ and in CUP-AI-Dx to classify metastatic cancers using RNA-seq¹¹⁸.

Recurrent neural networks

The natural analogy between a convolutional kernel and a motif detector has led many researchers to use CNNs to process biological sequence data. However, it is common in natural language processing to process sequence data with recurrent neural networks (RNNs)¹¹⁹. RNNs process input sequences recursively in order (Fig. 2c). In each recursion, the recurrent unit receives information from the prefix of the input sequence, processes the information, and passes it to the next unit. This recursive process results in the intrinsic ability of RNNs to capture interactions among sequence elements¹²⁰, such as cis-regulatory elements. However, in practice, RNNs exhibit suboptimal performance when processing long input sequences. For long inputs, an extended number of recursion steps can lead to excessive accumulation or forgetting of past information, which leads to a failure to learn. Long short-term memory (LSTM) addresses this problem by automatically learning what proportion of information to forget or remember from the past sequence¹²¹. In RB, several studies have utilized LSTM or similar architectures, such as gated recurrent unit (GRU)¹²², to predict miRNA–gene associations⁷⁸, poly(A) sites⁸¹, coding potentials⁶³, and differential gene expression¹²³. Notably, while most RNN models for natural language processing operate in a single direction, many RNN models for RB utilize bidirectional LSTM (BiLSTM)¹²⁴ to exploit the bidirectional nature of genomic sequences and biological interactions. In addition, an RNN can be hybridized with a CNN, which, in principle, allows the CNN to detect local features and the RNN to capture higher-level interactions between captured features. This approach is also often used in RB for tasks such as alternative splicing prediction¹²⁵, isoform function prediction⁵³, RBP binding-altering variant identification¹²⁶, and poly(A) site prediction¹²⁷. A benchmark study demonstrated the superior performance of an RNN/CNN hybrid over a CNN or an RNN in RBP binding prediction¹²⁸.

Transformers

Both CNNs and RNNs have intrinsic limitations when processing sequence data: they can only capture long-range interactions indirectly through multiple layers or steps. This prevents CNNs and RNNs from learning the dynamic context of the sequence. To overcome these limitations, Vaswani et al. introduced the transformer architecture¹²⁹. The transformer directly captures long-range interactions among sequence elements by using the self-attention mechanism (Fig. 2d). Self-attention captures every possible pairwise interaction between every sequence element using an attention matrix. Each weight of the matrix represents the relevance of each pairwise interaction to the task. During training, the weights of the attention matrix are updated to reflect the sequence context. Hence, transformers can better capture long-range interactions and dynamic contexts in sequences than CNNs and RNNs. Numerous studies have demonstrated the superiority of transformers in processing natural language, images¹³⁰, and even biological data⁸. In RB, the transformer architecture has been applied to poly(A) signal prediction¹³¹, gene expression prediction¹³², cell type annotation⁴⁴, network-level disease modeling⁴⁵, and circRNA–miRNA interaction prediction¹³³.

Graph neural networks

Several RB data, including contact information, structure, coexpression, and base pairing data, can be represented as graphs instead of sequences. Representing biological data as a graph allows a flexible representation of interactions between biological entities. Graph neural networks (GNNs) capture information from such graph-type data. In a GNN, a vector corresponding to each node is updated by aggregating the information from connected nodes (Fig. 2e)^134,135,136. The GNN can learn to capture interactions between nodes by repeating such updates¹³⁷. In RB, GNNs have been applied to predict isoform function from isoform association graphs⁸⁴, to predict RBP binding sites from graph representations of RNA secondary structures⁹¹, to integrate gene interaction networks and other biological networks¹³⁸ and to predict gene coexpression from gene contact graphs derived from Hi-C data⁸⁵.

Applications of deep learning in RNA biology

We have introduced crucial factors in developing DL models for RB, including large-scale data sources, encoding techniques, paradigms, and architectures. In this section, we demonstrate how these factors collectively contribute to building effective DL models for RB by reviewing specific examples. We review successful DL models that have achieved robust performance in important RB tasks or have showcased the potential of DL for biological discoveries (Fig. 3). By reviewing these models, we not only demonstrate the competence of DL in RB research but also suggest good practices for designing and training DL models that effectively leverage biological data.

Noncoding RNAs

Noncoding RNAs (ncRNAs) are essential layers of transcriptional, posttranscriptional, and translational gene regulation and are crucial for tissue functions and developmental programs^139,140. Diverse types of ncRNAs have been associated with cancer and genetic diseases^141,142. One of the most representative types of ncRNAs is microRNAs, which are short RNAs produced from RNA hairpins. miRNAs are loaded into Argonaute (AGO) proteins and form RNA-induced silencing complexes (RISCs), which target mRNAs through complementary base pairing and reduce their expression¹⁴³. The development of a model that can accurately predict the genome-wide regulatory effect of an arbitrary miRNA is a core task in RB that can enable a network-level understanding of miRNA-mediated gene regulation and pave the way for miRNA-based therapeutic development¹⁴⁴. Several attempts to construct such a model using classical machine learning and DL algorithms have been made, underscoring the importance of the task. Among these models, TargetScan is the current state-of-the-art model. This model has achieved significantly improved accuracy in predicting miRNA targeting efficacy compared to that of previous machine learning methods based on thermodynamic models or correlative approaches⁹⁵. TargetScan utilizes a CNN-based module to predict the affinities between an AGO–miRNA complex and target mRNAs. To alleviate the scarcity of experimental affinity data, the authors generated a large affinity dataset through AGO-RNA bind-n-seq (AGO-RBNS). The trained affinity predictor model was combined with a biochemical model to predict the repression of individual mRNAs. The model was more accurate than the previous versions of TargetScan, which did not use DL. Although TargetScan achieved robust performance in predicting miRNA targeting efficacy, it still explained only a small fraction of the variability in resulting mRNA expression changes from miRNA transfections. Therefore, this biologically important task needs further improvement, possibly through a more sophisticated approach to utilizing DL.

Long noncoding RNAs (lncRNAs) are ncRNAs longer than 200 nt. They bind to DNA, RNA, and proteins and regulate gene expression via various mechanisms, including chromatin remodeling, gene silencing and activation, nuclear organization, and RNA turnover regulation¹³⁹. lncRNAs are transcribed from various sources, including intergenic regions, introns, antisense strands, and enhancers. Accurately distinguishing lncRNAs from mRNAs is an important challenge in RB but is often complicated by the presence of various isoforms¹⁴⁵. Moreover, it is equally important to predict the functions of identified lncRNAs. To address both tasks, LncADeep utilizes separate modules for each task⁹⁹. The DBN-based lncRNA identification module was trained on annotated transcript sequences to learn the features distinguishing lncRNAs from mRNAs. The MLP-based functional annotation module was trained on lncRNA‒protein interaction data from the NPInter database¹⁴⁶. The functional annotations of the identified lncRNAs were made through pathway enrichment analyses of the predicted lncRNA-binding proteins via pathway databases, KEGG¹⁴⁷ and Reactome pathway¹⁴⁸. An average of 25 KEGG and 67 Reactome pathway annotations were associated with each identified lncRNA, indicating the complexity of lncRNA functions. The resulting model outperformed previous models in both tasks. While lncRNAs pose a unique challenge in RB due to their diverse regulatory mechanisms and sequence similarity to mRNAs, studies have shown that DL can be leveraged to elucidate the complex biology of lncRNAs.

Circular RNA (circRNA) is a subtype of lncRNA, which can accumulate in specific cell types due to the increased stability rendered by a closed ring structure. These accumulated circRNAs often act as sponges for miRNAs and RBPs, inhibiting their regulatory function and adding another layer of posttranscriptional gene regulation¹⁴⁹. By modulating posttranscriptional gene regulators, circRNAs are associated with various diseases, such as cancer¹⁵⁰. Therefore, identifying circRNAs is an important task. Distinguishing circRNAs from linear lncRNAs based on their sequences is challenging because the circularization process can often be inferred not from the transcript sequence itself but from the surrounding genomic context. This difficulty is reflected in the suboptimal performance of methods that distinguish circRNAs from other lncRNAs. To develop a more accurate circRNA identification model, circDeep uses reverse complement matching and conservation information from flanking regions of input sequences⁷⁴. The circDeep architecture is a hybrid CNN-BiLSTM model that captures both local interactions and global interactions in input sequence features. This model was trained on 32,914 human circRNAs from the circRNADb¹⁵¹ and other lncRNAs from GENCODE¹⁵² to learn the features distinguishing circRNAs from other lncRNAs. Since circRNAs are not fully annotated in standard annotations such as GENCODE and Ensembl, it is necessary to refer to specialized databases such as circBase to construct circRNA-related training datasets¹⁵³. The resulting model significantly improved the accuracy of the identification of circRNAs.

The utilization of DL to identify or distinguish major species of ncRNAs has significantly improved the accuracy compared to the previous machine learning methods (Table 2 and Supplementary Table 1). Nevertheless, functional annotation of identified ncRNAs still depends heavily on traditional enrichment analyses and thus remains a challenge for DL. To discover the specific functions or biological roles of ncRNAs, adequate DL methods, such as GNNs, which can learn from pathways or interaction networks involving ncRNAs, could be adopted.

Table 2 DL models for ncRNA biology.

Full size table

Epitranscriptomics

Epitranscriptomics is a field of RB studying various RNA modifications, which are crucial components of posttranscriptional gene expression regulation. RNA modification affects splicing, poly(A), mRNA export, RNA degradation, and translation efficiency^154,155. Through these biological processes, RNA modification is associated with cell differentiation, cancer progression, and neurological development^156,157. Therefore, understanding the functions and regulatory mechanisms of RNA modification is a significant objective in RNA biology. Investigating the biology of RNA modifications requires profiling the deposition of RNA modifications at the transcriptomic scale. For this purpose, high-throughput experimental methods, including m6A-seq¹¹, miCLIP¹⁵⁸, and bisulfite-seq¹⁵⁹, have been developed. Although these methods enable systematic investigations of epitranscriptomic landscapes, they often suffer from low resolution, high false positive rates, motif biases, and low concordance between experiments¹⁶⁰. To overcome this limitation, researchers have applied DL methods to profile the epitranscriptomic landscape. Transcriptome-wide prediction of RNA modifications via DL can improve the understanding of posttranscriptional gene regulation by elucidating the underlying mechanisms, associated variants, and phenotypic effects of RNA modifications¹⁶¹.

One approach to this objective is to train DL models that predict RNA modification sites from transcript sequences. This approach has been applied to various RNA modifications, including m6A¹¹⁵, 5-methylcytosine (m5C)¹⁶², pseudouridine¹⁶³, and 2′-O-methylation¹⁶⁴. One notable example is iM6A, which learned contextual information around the experimentally validated m6A modification sites with a ResNet-based model¹¹⁵. The inputs were pre-mRNA sequences with 5000 flanking nucleotides, and the labels were generated based on m6A sites validated through m6A-CLIP experiments. The iM6A model outperformed classical machine learning methods in predicting m6A modifications and generalized well to the sites validated with m6A-label-seq, MAZTER-seq, m6ACE-seq, and miCLIP2. By analyzing the model, the authors hypothesized that m6A-associated variants accumulate 50 nt downstream of the m6A site. They validated this hypothesis by analyzing an independently generated experimental dataset. They also showed that previously known pathogenic single-nucleotide variants (SNVs) are associated with changes in m6A deposition and suggested that codon usage may affect m6A deposition. While many RNA modification prediction models focus on predicting a single type of modification, multiRM predicts twelve widely occurring modifications with a single model⁷⁶. multiRM was trained on twenty epitranscriptome profiles generated from fifteen different technologies. The model utilizes three embedding schemes: a 1D convolution, a hidden Markov model, and word2vec. The embedding vectors generated by each scheme were fed into LSTM and attention methods to learn interactions among features relevant to different modifications. Analyses of attention matrix weights revealed modification motifs that were highly concordant with known RNA modification motifs. Moreover, the attention weights of different types of modifications were strongly correlated, indicating crosstalk among different modifications, as reported in previous studies. The above studies demonstrate the feasibility of elucidating the mechanisms regulating the epitranscriptomic landscape by developing DL models that predict the deposition of RNA modifications.

Another approach for applying DL to profile modification depositions is to develop a DL model that captures modification signatures from direct RNA sequencing (DRS) data. DRS is a technique in which native RNA is sequenced without the need for reverse transcription¹⁶⁵. In this method, the electrical current changes during the translocation of an RNA molecule inside a nanopore are measured to infer nucleotide identity. Since canonical and modified RNA bases cause different electrical current shifts, DRS can be used to identify modified bases via DL¹⁶⁶. m6Anet identifies m6A modifications on transcripts with DRS data⁷⁹. m6Anet uses an embedding layer to encode 5-mers, and predicts the modification probability of each site using an MLP. The model first predicts the probabilistic measure for each read and then determines the site-level probability from these measures. This technique is called multiple instance learning. m6Anet performed better than previous methods and generalized well to other cell lines and species. In addition to m6A, Dinopore predicts adenosine-to-inosine (A-to-I) editing from DRS data⁶⁹. Dinopore was constructed based on ResNet and uses multiple branches with various convolutional filter sizes to capture local interactions with different spans. This method outperformed previous methods and showed robust interspecies generalizability. As shown above, the development of DL models that accurately predict RNA modifications has enabled researchers to systematically study the regulatory elements relevant to RNA modifications and phenotypic or disease consequences of RNA modifications (Table 3 and Supplementary Table 2). Integrating diverse types of transcriptomic data, such as gene expression, RNA structure, and RBP binding data, will facilitate the development of multimodal DL models that learn systematic knowledge of the epitranscriptome, expanding the understanding of posttranscriptional regulation.

Table 3 DL models for RNA‒protein interactions.

Full size table

RNA-binding proteins

RNA-binding proteins (RBPs) control various aspects of gene expression regulation, including mRNA decay, mRNA-ncRNA interaction, RNA modification, translation efficiency, and RNA processing¹⁶⁷. The main challenge in RBP biology is to model the binding preferences and predict the binding sites of RBPs, which are crucial for understanding posttranscriptional and translational regulatory mechanisms. To address this challenge, RBP binding prediction models have been commonly trained using experimental data from RNAcompete¹⁶⁸ and CLIP-based experiments⁹, such as HITS-CLIP¹⁶⁹, PAR-CLIP¹⁷⁰, and eCLIP¹⁷¹. DeepBind is the foundational research for utilizing DL in RBP biology¹⁰⁴. In DeepBind, a CNN motif detector predicts protein-binding motifs from genomic sequences, and the resulting motif feature vector is fed into feedforward layers to yield binding prediction scores. Each DeepBind model accounts for a single type of protein, and the models were trained for 194 RBPs. DeepBind achieved a robust performance and showed that the RBP motif preference knowledge learned from in vitro experiments can be generalized to in vivo transcriptomes via DL.

While DeepBind outperformed previous approaches using a motif detector, subsequent studies showed that integrating contextual information from RNA structure and sequences could improve RBP binding prediction. The biological rationale for this approach is that the binding affinity of RBPs for an RNA target is influenced not only by the local binding motif but also by the sequence composition and structural context of the target RNA^172,173. Deepnet-RBP utilizes a multimodal DBN that receives primary, secondary, and tertiary structures¹⁰². The primary and secondary structures were encoded as k-mer count vectors, and the tertiary structures were encoded as structural motif indicating vectors. The secondary and tertiary structures were computationally predicted. The model outperformed classical machine learning models and predicted several potential secondary and tertiary structural motifs. Similarly, iDeepS utilized computational RNA structure prediction from RNAshape and slightly outperformed DeepBind¹⁷⁴. However, computing the structures of RNA molecules is computationally burdensome and often inaccurate. Because of this limitation, iDeepE attempted to capture structural information from sequences surrounding the RBP binding sites⁶⁶. Using two CNNs with local and global resolution filters, iDeepE outperformed previous DL models, including Deepnet-RBP.

While the above approaches use dense vector representations, molecular graphs are commonly used in biology and chemistry to model the structures of molecules, including RNA and proteins^134,175. RPI-Net utilizes a GNN to learn the graph representation of the RNA structure and to predict RBP binding and outperforms previous DL models⁹¹. Moreover, the study pointed out that some CLIP techniques, including PAR-CLIP and HITS-CLIP, may introduce sequence bias in the training set and that previous machine learning and DL models picked up the bias. When building the training set for RPI-Net, the authors de-biased the PAR-CLIP data by replacing the biased nucleotides with random bases. This example shows the importance of inspecting and de-biasing biological data when training DL models.

PrismNet, one of the latest DL models for RBP binding site prediction, incorporates in vivo RNA secondary structure data produced using icSHAPE-seq instead of computationally folded structures¹⁷⁶. In this study, the squeeze-and-excitation network (SENet) architecture, which captures global interdependencies, was employed with ResNet¹⁷⁷. PrismNet achieved robust performance in the benchmark conducted by the authors and was applied to discover disease-associated SNVs that affect RBP binding. In Zhou et al., a CNN-based RBP binding prediction model trained using eCLIP data was utilized to predict the effect of noncoding variants on autism spectrum disorders (ASDs), underscoring the utility of RBP prediction in medical genomics¹⁷⁸.

Overall, various studies have shown that it is possible to model the binding preferences and binding sites of RBPs accurately (Table 4 and Supplementary Table 3). The performance of RBP prediction models was improved by incorporating sequence context and in vivo structure information, highlighting the importance of non-motif features in RBP binding. Recent breakthroughs in protein structure prediction have enabled accurate prediction of protein‒RNA interactions at the structural level¹⁷⁹. Plausibly, integrating protein‒RNA complex structure data with current RBP binding site prediction methods could assist in accurate RBP binding site prediction. Improvements in RBP prediction models will promote the discovery of drug targets and biomarkers since RBP binding is a cardinal component of posttranscriptional and translational gene expression regulation.

Table 4 DL models for epitranscriptomics.

Full size table

Pre-mRNA processing

Pre-mRNA processing is a complex process involving 5′ capping, splicing, and 3′ poly(A). These processes serve as important points of posttranscriptional gene expression regulation, often through alternative splicing and alternative poly(A) (APA)^180,181. Dysregulation of pre-mRNA processing can cause various genetic disorders, including muscular dystrophy and progeria^182,183. A major task in this domain is the prediction of splicing, which will assist in the discovery of novel disease-associated splicing variants. This task can be further divided into exon usage prediction and splice site prediction. Exon usage prediction involves predicting the exon inclusion rate, termed the percent spliced in (PSI), using predefined exon boundaries¹⁸⁴. The sequences and abundances of isoforms produced by alternative splicing inferred from RNA-seq data are commonly used for training and evaluation. Leung et al. were among the first to apply DL for alternative splicing prediction. They utilized MLP to predict tissue-specific alternative splicing from manually extracted genomic sequence features and one-hot encoded tissue types⁸⁰. In this study, alternative splicing prediction was formulated as a classification task using only three PSI categories—low, medium, and high—and therefore, the predictive capability was limited. In their subsequent work, Xiong et al. regarded the PSI as a continuously distributed variable⁹². Using this strategy, the authors discovered novel ASD-associated splice variants and inferred the pathogenic mechanism of specific SNVs in Lynch syndrome.

Splice site prediction involves the identification of splice sites from the genomic sequence. SpliceRover employs a CNN motif detector to predict splicing donor and acceptor site probabilities using genomic sequences of 15–402 nucleotides and outperformed previous machine learning methods¹⁸⁵. This model was analyzed, and it was found that the model not only detects well-known splice site motifs but also considers additional factors, such as the polypyrimidine tract. SpliceAI adopts dilated convolution kernels to utilize longer sequence contexts of up to 10,000 nucleotides¹¹³. The authors showed that capturing long-range contexts improved splicing site prediction by comparing the performances of models with varying input lengths. The SpliceAI prediction score was used to discover a novel type of splicing variant, termed the cryptic splice variant, which creates splicing sites weaker than canonical splice site variants. In contrast to canonical splice variants that exert similar effects across tissues, cryptic splice variants alter splicing in a tissue-specific manner and are associated with intellectual disability and ASD. This example underscores how developing and utilizing DL models can assist novel scientific discoveries in RB.

Predicting poly(A) is another major task for DL in mRNA processing since APA is an essential regulatory mechanism of differential gene expression and is associated with various diseases. Poly(A) position data from PolyA-Seq¹⁸⁶ and 3P-seq¹⁸⁷ are commonly used to train poly(A) prediction models. DeepPolyA and Leung et al. are initial works that adopted different strategies to utilize DL for poly(A) prediction from genomic sequences^112,188. DeepPolyA formulates this task as a binary classification problem between poly(A) sites and non-poly(A) sites, and Leung et al. formulate the task as a regression of the poly(A) site strength. Both models outperformed classical machine learning approaches for predicting poly(A) sites. While these models were trained from an in vivo dataset, a subsequent CNN model, APARENT, utilizes a large in vitro dataset generated by massively parallel reporter assay (MPRA) to represent biological complexity¹⁸⁹. APARENT formulated the task as isoform fraction regression and cleavage site distribution prediction. Using APARENT, it was possible to capture several determinants of poly(A) site selection, identify potential pathogenic poly(A)-affecting variants, and design poly(A) signals that produce desired isoforms. The success of APARENT has demonstrated the utility of massive-scale experiments in developing DL models for understanding complex biological processes. Overall, multiple studies have demonstrated the utility of DL in discovering regulatory sequences and context features that determine pre-mRNA processing (Table 5 and Supplementary Table 4).

Table 5 DL models for pre-mRNA processing.

Full size table

Gene expression

One fundamental goal in genomics and RB is to model gene expression regulation using computational models. This goal is often formulated as predicting the gene expression level given the genomic sequence¹⁹⁰. Solving this task would allow a comprehensive and quantitative analysis of the regulatory functions of noncoding sequences and the prediction of noncoding variant effects in silico. Despite its importance, the complexity of the mechanisms regulating gene expression has impeded conventional machine learning models from solving this task. Consequently, multiple studies have utilized DL to address this fundamental task.

Basenji and ExPecto are pioneering works on this topic^105,191. Both are CNN models that predict the expression level of a given genomic sequence. Basenji has a receptive field of 32 kb and utilizes dilated convolution to predict the results of DNase-seq, ChIP-seq, and CAGE jointly. Among them, CAGE corresponds to gene expression. Unlike Basenji, ExPecto uses a sequential approach in which the first module of the model predicts epigenetic markers and TF binding from 40-kb DNA sequences, and the downstream modules predict gene expression levels. The epigenetic module was trained with ENCODE and Roadmap Epigenomics data, while the expression module was trained with CAGE data. Basenji and ExPecto both yielded gene expression level predictions that agreed well with the experimental eQTL data. ExPecto was also utilized to predict disease risk alleles and to predict and prioritize causal genome-wide association study (GWAS) variants. Xpresso is another CNN developed to predict gene expression levels from sequences. Unlike Basenji and ExPecto, Xpresso did not utilize epigenetic data for training and strictly relied on the genomic sequence¹⁹². The authors showed that the sequence-only model can perform comparably to previous models using epigenetic features, suggesting that gene expression can be inferred from flat genetic sequences. Moreover, Xpresso exhibits robust generalizability between cell lines and species. Notably, Xpresso was utilized to derive a novel hypothesis that CpGs are enriched in the core promoters of highly expressed genes, potentially expanding the role of CpG in transcriptional regulation and emphasizing that DL models can aid in novel biological discoveries.

In contrast to the models mentioned above, which all use the CNN architecture, Enformer was built with the transformer architecture to effectively capture long-range interactions between regulatory elements¹³². The input length for this model is 200 kb, which is significantly longer than that of the previous models. This model outputs TF ChIP-seq, histone modification ChIP-seq, DNase-seq, ATAC-seq, and CAGE tracks only from genomic sequences. Enformer outperformed Basenji2⁵⁴ and ExPecto in RNA expression prediction. In addition, this model exhibited robust eQTL variant effect prediction and mutation effect prediction performance. The study verified the benefit of using the attention layer through an ablation study, demonstrating the effectiveness of transformers for processing long biological sequences. Moreover, Enformer showed the possibility of detecting candidate enhancers using attention weights and gradient-based saliency scores. Overall, multiple studies have shown that inferring gene expression solely from genomic sequences is possible using DL. This trend suggests that utilizing longer genomic sequences with an effective architecture would lead to better sequence-based modeling of gene expression.

Although sequence-only approaches have been successful in modeling gene expression, researchers have modeled gene expression based on non-sequence features, which could expand the dimension of gene expression regulation research. D-GEX is an MLP model developed to predict gene expression using the expression levels of 1000 landmark genes, enabling whole-transcriptome profiling via economic Luminex beads⁸³. GEARS is another DL model for learning gene–gene interactions¹⁹³. In GEARS, the relative change in gene expression upon perturbation of specific genes was predicted using an MLP. DeepChrome is a CNN that predicts gene expression levels from histone modification profiles, and it outperformed classical machine learning models for histone-based expression prediction¹⁹⁴. DEcode is a CNN that infers differential gene expression levels from the binding information of three types of regulatory molecules, RBPs, miRNAs, and TFs. Using these input features allowed modeling at both the transcriptional and posttranscriptional regulation levels¹⁹⁵. The output of DEcode is a tissue-wide expression profile of each gene, composed of relative gene expression levels in 53 human tissues. The authors analyzed the model to measure the importance of each regulatory molecule in the differential expression for each tissue. The regulatory significance of RBPs, miRNAs, and TFs inferred by analyzing DEcode was validated with in vivo loss-of-function mutation data and disease associations. This example shows how a DL model can prime the production of new biological knowledge.

Translational regulation is another important axis of gene expression regulation that can be probed by ribosome profiling and protein reporter assays. The sequence and structure of the 5′ UTR are the key determinants of translational regulation. Optimus 5-Prime utilized a CNN to predict ribosome load from the 5′ UTR sequence using the MPRA dataset for training¹⁹⁶. Optimus 5-Prime accurately predicted the effect of the 5′ UTR sequence on the ribosome load. An analysis of the model revealed that the model can capture translation initiation site (TIS) sequences, stop codons, and non-canonical start codons. The model was also utilized to design a 5′ UTR of desired strength, demonstrating how such DL models can be utilized for synthetic biology. TISnet is a DL model for TIS prediction from primary sequences and RNA structures, the architecture of which was adopted from PrismNet¹⁹⁷. The study showed that both sequence and structure contribute to the accurate prediction of the translation initiation probability of a given AUG. By analyzing the downstream regions of predicted start codons, it was hypothesized that the downstream hairpin structure dictates start codon selection. This hypothesis was experimentally validated, providing a critical point for rational protein design.

Overall, several studies have shown that DL can be leveraged to model gene regulatory mechanisms at the transcriptional, posttranscriptional, and translational levels (Table 6 and Supplementary Table 5). These DL models for gene expression modeling have paved the way for in silico identification of pathogenic variants, potential biomarkers, drug targets, and synthetic biology.

Table 6 DL models for gene expression.

Full size table

Medical applications of RNA biology

Fundamental discoveries in RB have made multiple contributions to medicine, including RNA vaccines¹⁹⁸, RNA-targeting drugs^199,200, and RNA therapeutics^201,202. Numerous studies have also shown that transcriptomic data can be leveraged for disease diagnosis^203,204. By replacing or complementing conventional methods, RNA-seq can be utilized for precise diagnosis and to provide personalized treatment strategies. For example, Mayhew et al. aimed to detect acute infection from the expression of 29 marker host genes²⁰⁵. Although it is theoretically possible to diagnose infection using the host transcriptome, this approach has been impeded by the transcriptomic heterogeneity between patients. To overcome this limitation, the authors trained an MLP named IMX-BVN-1 using an infection dataset compiled from GEO and ArrayExpress. The model successfully diagnosed infections in an independent ICU dataset. This example shows that using a predefined set of marker genes allows the efficient development of a diagnostic model. However, the limited availability of comprehensive marker gene sets often obstructs this approach for various diseases. Moreover, utilizing the entire transcriptomic landscape, instead of using only a small fraction of the transcriptome, may unleash the potential of DL. For example, Comitani et al. developed a DL model for pediatric cancer classification from the expression of 18,010 genes and pseudogenes profiled by RNA-seq²⁰⁶. Transcriptional diversity in pediatric tumor tissue has been a major issue in transcriptome-based diagnosis. The authors adopted a self-supervised learning framework to overcome this challenge. First, they developed RACCOON, a framework for unsupervised tumor transcriptome clustering and identification. The tumor subtype hierarchy output by RACCOON was utilized to train OTTER, an ensemble of multiclass classifier CNNs. The accuracy of pediatric cancer type prediction by OTTER was 89%, and the predictions were temporally consistent. These examples underline the capability of DL models to generalize from copious amounts of high-dimensional transcriptomic data to yield clinically valuable predictions.

Multiple studies have developed DL models for clinical diagnosis by integrating transcriptomic data with other modalities, including genomic and proteomic data. This multimodal approach improves performance and generalizability by allowing models to generate more comprehensive representations of the biological states of patients. MOGONET is a multimodal and multiomics GNN for patient classification that integrates DNA methylation, mRNA expression, and miRNA expression profiles²⁰⁷. In MOGONET, separate GNNs generate initial prediction vectors from each of the three inputs, and the tensor product of the three prediction vectors is passed to an MLP for patient classification. MOGONET was validated against an Alzheimer’s disease diagnosis task, a glioma grade classification task, a kidney cancer type classification task, and a breast invasive carcinoma classification task. In addition to direct diagnosis, multimodal DL has also been utilized for biomarker discovery, a crucial component of diagnostic development. As another example, CoraL is a multimodal CNN that predicts disease associations of ncRNA-encoded small peptides (ncPEPs) and their originating short ORFs (sORFs) for cancer biomarker discovery (59). The model was trained with meta-learning and demonstrated the generalizability of DL-powered cancer biomarker discovery across various types of cancers. HE2RNA is another multimodal DL model for clinical RB that predicts the gene expression profile from histology images²⁰⁸. HE2RNA is an MLP that generates a patch-level transcriptomic representation from whole-slide images of tumor tissues. The gene expression levels were obtained from the TCGA database. In addition to predicting gene expression levels, HE2RNA can also generate spatial gene expression maps and predict microsatellite instability via transfer learning. These examples show that transcriptomic data can be integrated with other data modalities through DL to produce medical predictions.

In addition to diagnosis, transcriptomic data have also been leveraged for prognosis prediction via DL. For example, Qiu et al. utilized RNA-seq data from TCGA to train an MLP that predicts patient survival from the expression level of 17,176 genes⁵⁹. The authors adopted a meta-learning strategy to learn the parameter initialization, which allowed a few-shot training of the final model. This study demonstrated the utility of meta-learning for the few-shot survival prediction task by comparing it with regular pre-training and combined learning. Moreover, the feasibility of few-shot learning in transcriptome-based prognosis prediction was validated by demonstrating that the performance of the few-shot model was comparable to that of a many-shot model.

The success of mRNA vaccines amid the COVID-19 pandemic²⁰⁹ has drawn the attention of the pharmaceutical industry to RNA-based drugs. Beyond COVID-19 vaccines, RNA-based drugs offer promising therapeutic potential for preventing infections and treating various diseases, including cancer, cardiovascular diseases, and neurodegenerative diseases²¹⁰. DL has also been utilized for RNA-based drug research since it has been widely used in pharmaceuticals to aid in drug target discovery and drug design^211,212. A central task in developing effective mRNA-based drugs is to improve their stability, which is often a limiting factor in the global distribution of mRNA vaccines. Stanford OpenVaccine is a crowdsourced effort to develop a DL model that accurately predicts the stability of an arbitrary mRNA molecule²¹³. The project was hosted on Kaggle, a Google platform for public DL competitions. The best solution outperformed previous machine learning models using data augmentation and ensemble²¹⁴.

Gene editing, which has recently been approved for clinical use²¹⁵, is another major field in RNA-based drug research. Several DL methods have been developed for CRISPR‒Cas editing systems, focusing on tasks such as sgRNA optimization and editing outcome prediction. DeepCRISPR utilizes unsupervised representation learning to train a denoising CNN that learns the representation of sgRNAs from sequence and epigenetic features²¹⁶. The model was fine-tuned to predict on- and off-target effect profiles of sgRNA in the CRISPR-SpCas9 knockout system. CRISPRon exhibited robust performance in gRNA efficiency prediction by training a CNN that uses protospacer and protospacer adjacent motif (PAM) sequences as input²¹⁷. For CRISPR-based base editing, BE-DICT, a transformer-based encoder–decoder model, was developed to predict the probabilistic outcomes of CRISPR-SpCas9 base editors given the protospacer sequence²¹⁸. In addition to CRISPR-Cas9, DL-based guide RNA optimization tools have been developed for CRISPR-Cas13d and prime editing^219,220. Similarly, RNAi drugs such as short hairpin RNAs (shRNAs) have been employed in RNA therapeutics. The mechanism of shRNA drugs is based on the repression of target gene expression through complementary base pairing between the shRNA and its target mRNA. Embedding shRNAs into miRNA backbones greatly increases the degree of target gene repression due to the utilization of the endogenous miRNA processing pathway, and these miRNA-embedded shRNAs are referred to as shRNAmirs. Predicting the targeting efficacy of shRNAmirs remains a key step in designing RNAi drugs based on shRNAmirs. In this context, deep learning models that predict the efficacy of shRNAmirs have been developed, and one CNN-based model named shRNAI+ has achieved robust performance by using sequence and context features²²¹. These studies suggest that DL can be utilized for designing and optimizing RNA-based therapeutics.

Another important avenue in medical applications of RNA is RNA-targeting drugs. Gao et al. utilized DL to discover therapeutic targets for the splice-correcting drug BPN-15477²²². They first selected potentially pathogenic splicing-altering mutations as candidate targets using SpliceAI. To identify drug targets among the candidates, they developed a CNN that predicts drug-induced changes in PSI, given the exon‒intron boundary sequences. Several targets predicted by the CNN were experimentally validated, providing potential therapeutic strategies for diseases, including Lynch syndrome, cystic fibrosis, and Wolman disease. Overall, numerous studies have demonstrated the merit of DL in various medicine-related RB tasks, positioning DL as a core asset in personalized and precision medicine (Table 7 and Supplementary Table 6).

Table 7 DL models for medical applications of RB.

Full size table

Desiderata for employing deep learning for RNA biology

Numerous examples of successful DL applications in RB demonstrate the competence of DL in elucidating the mechanism of biological processes from large-scale experimental data. Yet, significant challenges remain. First, the imperfect metadata integrity of databases and the scarcity of independent benchmarks hinder the training of effective DL models. Indeed, most DL models in RB have not reached expert-level or experiment-level performance. Second, the difficulty in understanding the prediction of models often impedes the achievement of biological discoveries via DL. Third, the models should be able to integrate diverse modalities of RB data to gain a systematic understanding of the transcriptome. Nevertheless, researchers are actively addressing these challenges. The collective efforts of institutions and laboratories are improving the availability and quality of public RB data. Researchers are adopting novel techniques and integrating multiple modalities to construct models that can perform a wider range of tasks than previous models. In this section, we review the desiderata for exploiting the full potential of DL for RB research, together with prominent DL techniques and efforts to meet the challenges (Fig. 4).

**Fig. 4: Desiderata for employing DL in RB.**

Multidimensional and well-curated databases

The generation of structured and labeled datasets is central to developing DL models for RB. These datasets are commonly constructed from public databases containing high-throughput experiment data. Therefore, constructing public databases that reflect various dimensions of the transcriptome is cardinal in developing DL models for the integrative and systematic study of the transcriptome. These dimensions include functional genomics, isoforms, and single-cell transcriptomics. As reviewed in the previous section, ENCODE is a representative effort to construct a large-scale functional genomics database²²³. The ENCODE database is expanding to capture more diverse aspects of the transcriptome and genome. The current phase of ENCODE (ENCODE4) primarily focuses on increasing the diversity of TFs, RBPs, and cell types assayed. Moreover, to profile isoform expression across various cell types, long-read sequencing data, such as PacBio and Oxford Nanopore sequencing data, are included in the current phase. The efforts by ENCODE to increase the breadth and dimensions of the transcriptomic data have improved the utility of ENCODE for RB researchers. Single-cell transcriptomics is one of the most rapidly growing fields in RB, which allows spatiotemporal dissection of gene expression and regulation at the single-cell level. The Human Cell Atlas (HCA) is a representative single-cell transcriptomics database that aims to construct reference single-cell transcriptomes of diverse healthy human and mouse cells. The database currently contains data from more than 50 million cells²²⁴. The HCA has played a vital role in early research during the COVID-19 pandemic, illuminating cell- and tissue-level pathology²²⁵. These public databases that increase the dimension of RB data provide core assets for developing DL models in RB.

As reviewed in the previous sections, the metadata integrity and quality control (QC) of public databases are critical for filtering and labeling data during DL dataset construction. However, public databases often suffer from incomplete metadata and the scarcity of functional assay data. Databases need to observe good standards for curating and maintaining the databases. The fields and vocabularies of metadata should be strictly controlled for efficient filtering and labeling. Moreover, the data should be processed through a unified pipeline, automated QCs, and active auditing to alleviate the burden of complex standardization or normalization among the data generated by different pipelines. For example, the ENCODE consortium operates the Data Coordination Center to define standard metadata fields, develop uniform data processing pipelines, and perform extensive data quality audits. Similarly, the HCA is developing standard operating procedures (SOPs) and standardized QCs to coordinate the scRNA-seq data submitted by the research community. The efforts made by the ENCODE consortium and the HCA to maintain the quality of data and metadata provide notable examples of how to improve the utility of public RB databases for DL. By adopting standardized and automated QC measures, developing a unified processing framework, and administering the metadata format and vocabularies, public databases will continue to function as a solid foundation for DL-powered RB research.

Independent benchmarks

Standardized and unbiased benchmarking is a fundamental component of DL research, but this practice is uncommon in RB. Standard benchmark datasets are essential for quantifying the contribution of a new architecture, algorithm, or training strategy since study-specific benchmarks can often contain unintentional bias favoring their model. Multiple standard benchmark datasets are widely used in conventional fields of DL. For example, ImageNet²²⁶ and MS-COCO²²⁷ are widely used in computer vision, while SquAD²²⁸ and GLUE²²⁹ are commonly used in natural language processing. These benchmarks have enabled recent advances in DL by providing the means of fair and straightforward comparisons between models, allowing researchers to systematically evaluate the contribution of each novel architectural component and training technique. However, such standard benchmark datasets are not readily available in RB.

Individual studies have compared different DL models in RB using uniform criteria. These studies include benchmarks for nanopore sequencing basecalling²³⁰, RNA modification detection¹⁶⁰, RBP binding site prediction^128,231, miRNA–disease association prediction²³², and gene expression prediction^233,234. Notably, Sasse et al. and Huang et al. both benchmarked DL models for gene expression prediction and highlighted the generalizability issue. Enformer, Basenji2, ExPecto, and Xpresso all performed suboptimally for gene expression prediction in vivo. The authors also analyzed Enformer and showed that the model excessively relied on certain SNVs for expression level prediction. Also, Khan et al. performed an independent reevaluation of scBERT. The study showed that scBERT generalizes well to new independent datasets, but its performance is sensitive to class imbalance²³⁵.

These benchmark studies captured the strengths and weaknesses of the models that were not noted in the original studies, highlighting the importance of benchmarking. However, the lack of standard benchmark datasets prevents unbiased benchmarking from being adopted as a common practice in RB. Therefore, collective efforts should be directed toward developing standard benchmark datasets for various domains of RB.

Computational efficiency

One evident trend in DL is the exponential growth of the model size, which improved the performance and generalizability of the models²³⁶. However, larger models require high-end hardware²³⁷. The limited affordability of this hardware impedes the dissemination of DL in biology, especially in academic laboratories¹. Moreover, larger models require longer computation times. This limitation is problematic for time-critical field applications of biological DL, such as intraoperative²³⁸ or intensive care unit (ICU)²³⁹ patient diagnosis. Therefore, to reduce hardware requirements and computational time, improving the efficiency of RB DL models is crucial.

Several DL models and algorithms have been developed to allow the training and inference of DL models in resource-limited environments such as mobile and embedded systems²⁴⁰. The most common model-agonistic strategy is to utilize low-precision data types to reduce memory usage and increase throughput. Mixed precision is a widely adopted technique following this strategy. This technique uses high precision for updating the master model and low precision for gradient calculation during training²⁴¹. Similarly, converting the model weight from floating point numbers to integers, a technique termed quantization, is an actively investigated avenue in DL for efficient inference²⁴². Other ways to prepare the model for efficient inference include pruning, which is a discarding of unnecessary parameters to reduce the model size without impacting the performance²⁴³.

In addition to model-agonistic methods, various efforts have focused on designing lighter model architectures while maintaining the model performance. Some notable examples include MobileNet²⁴⁴, SqueezeNet²⁴⁵, ShuffleNet²⁴⁶, and EfficientNet²³⁶. These CNN models utilize various techniques aimed at capturing local features using fewer parameters. Researchers can use these techniques to increase the efficiency of biological DL models. For example, LegNet utilizes an EfficientNetV2-inspired²⁴⁷ architecture to predict gene expression from genomic sequences by capturing short regulatory elements²⁴⁸.

Recently, transformer-based models have dominated multiple fields of DL research. Therefore, the excessive computational burden has become more pronounced. As the input size grows linearly, the size of the transformer model grows quadratically. Therefore, it is difficult to use the transformer architecture for long inputs. To address this issue, multiple approximation algorithms, such as the linear transformer (249), Performer (250), and Linformer (251), have been proposed to scale transformers linearly with respect to the input size. Although these algorithms are not without drawbacks, including numerical instability and limited inputs, they can facilitate the adoption of the transformer architecture and its robust context-learning capability for RNA research. For example, scBERT utilizes the Performer architecture for cell type annotation from scRNA-seq data⁴⁴. This design enables scBERT to receive a long input of 16,000 genes, which is significantly longer than what most language models can receive. In summary, leveraging architectures, algorithms, and techniques proven to improve the efficiency of model training and inference can facilitate the application of DL in RB, particularly in resource-limited settings such as clinical and field biology.

Multimodality and structure

Humans can integrate diverse modes of information, such as audio, text, and vision, to produce a comprehensive understanding of the context. Recently, achieving such multimodality has emerged as a core task in DL, as it would improve the utility of models²⁴⁹. In RB, DL models can learn about biological processes and regulatory networks by integrating multiple biological modalities, including structural, evolutionary, genomic, epigenomic, transcriptomic, and proteomic data. There have been recent endeavors to apply multimodality to RB, including the studies reviewed in the previous section, such as CoraL, MOGONET, HE2RNA, DeepBind, Enformer, Aptardi, and PrismNet. For another example, Ashuach et al. developed a deep generative model, MultiVI, to integrate scATAC-seq, scRNA-seq, and surface protein expression data to generate a multiomics representation of a cell that can be used for data imputation²⁵⁰.

Structural data are an important modality that can be incorporated with transcriptomic data since they are a determinant of various processes involving RNA^251,252,253. These data can be leveraged for various DL tasks, including predictions of RNA‒RNA interactions, gene expression, RNA modification, and splicing. The successful utilization of structural data for RBP binding prediction demonstrates their utility. However, the scarcity of experimentally determined RNA structures often impedes the incorporation of RNA structures as a DL feature. Multiple studies have used computational predictions of RNA structures as inputs to overcome this limitation^{99,102,127,174}, even though the structure predictions have not reached the accuracy of the experiments. DL can improve the accuracy of RNA structure predictions, improving their utility as input features (Table 8 and Supplementary Table 7). ARES utilizes geometric DL to score RNA tertiary structure candidates⁹⁰, SPOT-RNA and MXfold2 utilize CNNs to predict secondary structure^254,255, trRosettaRNA utilizes transformer to predict tertiary structure²⁵⁶, and RoseTTAFoldNA utilizes transformer to predict the 3D structure of protein–nucleic acid complexes¹⁷⁹. With the increasing performance of the deep RNA structure prediction models, it will be possible to include model-predicted RNA structures as input features for various RB tasks. The integration of diverse modalities, including RNA structures, is a prominent direction of DL research in RB that will allow a comprehensive and multiomics-level understanding of the transcriptome.

Table 8 Other DL models for RB.

Full size table

Interpretability

DL models have long been considered black boxes, a term symbolizing the difficulty of interpreting them. Interpretation is the process of understanding the reasoning behind the decisions of the models²⁵⁷. Unlike simple regression models, DL models cannot be thoroughly interpreted by inspecting their weights. Despite this difficulty, interpreting DL models is an essential step when applying DL to biological sciences, which pursues mechanistic explanations of biological processes. In addition, interpreting DL models can lead to the identification of novel biomarkers, drug candidates, and drug targets, which are important goals of medical biology²¹¹. Moreover, interpretation can contribute to identifying shortcomings and improving the performance of models.

Interpreting deep neural networks involves attributing the relevance of each feature or element to a task. The first approach for attribution is to perturb the input and then observe the change in the output. One example of this concept commonly used in computational biology is in silico saturation mutagenesis (ISM), in which each residue or nucleotide in a biological sequence is modified to every possible alternative, and a model quantifies the effect of the variants. For example, ISM was utilized with ExPecto to profile the effects of promoter-proximal variants on gene expression¹⁹¹. SHAP is another perturbation-based method that approximates results from all possible combinations of input elements to infer their individual contributions²⁵⁸. Janssens et al. and Almeida et al. both developed DL models to predict enhancer activity and interpreted the models with SHAP to assess nucleotide-wide contributions^259,260.

The other major class of attribution methods is the gradient-based approach, which approximates the element- and feature-wise attributions for each given example by propagating the gradient. Intuitively, the gradient indicates the change in prediction that would result from a change in the input element. Popular methods of this type include saliency map²⁶¹, gradient * input²⁶², integrated gradients²⁶³, layerwise relevance propagation²⁶⁴, and DeepLIFT²⁶⁵. Multiple DL studies in RB have adopted gradient-based attribution methods to draw biological hypotheses from the models. For example, MultiRM utilizes integrated gradients to calculate the contribution of motifs in RNA modification prediction⁷⁶, SpliceRover utilizes DeepLIFT to calculate nucleotide-wise importance in splicing¹⁸⁵, and Enformer utilizes gradient * input to show that the model captures the importance of enhancer sequences in gene expression¹³². Notably, four attribution methods, gradient * input, integrated gradients, DeepLIFT, and layerwise relevance propagation, were tested using Xpresso, and it was found that all methods yielded similar results in terms of capturing the role of promoter regions in gene expression¹⁹². In addition, for transformer-based models, the weights of the attention matrices are often directly interpreted to identify the important input positions. This technique was used in MultiRM to identify the nucleotides involved in RNA modification deposition⁷⁶ and in scBERT to identify core genes for cell type annotation⁴⁴. As described above, multiple studies have successfully applied interpretation techniques to RB models to produce biological hypotheses. Nevertheless, caution should be taken when drawing scientific hypotheses from model interpretations. Current interpretation techniques only provide limited information, such as the contribution of each input, instead of demonstrating the systematic knowledge learned by the model²⁶⁶. To overcome this limitation, researchers can utilize biological domain knowledge to integrate interpretation results and discover novel functional and regulatory mechanisms of the transcriptome.

Foundation models

Developing a system that can solve complex problems based on the knowledge established from large-scale data has been a major goal in DL. In this regard, models with billions of parameters have been trained on billions or trillions of data points. These models, termed the foundation models, have achieved unparalleled performance in a broad range of tasks without extensive task-specific training²⁶⁷. For instance, GPT-4 can describe input images in natural language and has excelled at AP and bar exams^268,269. Foundation models are trained on large unlabeled datasets with self-supervised tasks. Representative tasks include causal language modeling, in which the model predicts the next word from the previous words²⁷⁰, and masked language modeling, in which the model regenerates masked-out portions of the original text⁴³. Motivated by the success of foundation models in natural language processing, there have been attempts to develop foundation models for genomics and RB. Theoretically, a foundation model that has learned the language of biological sequences such as the genome, transcriptome, and proteome can systematically apply generalized knowledge to predict broad biological processes, including gene expression regulation, gene‒gene interactions, variant effects, and disease susceptibility.

There have been multiple efforts to develop foundation models for RB and genomics (Table 8 and Supplementary Table 7). DNABERT is one of the earliest attempts to construct such a model²⁷¹. This model was pre-trained on the human genome sequence with a masked language modeling task, where DNA k-mers were regarded as words⁴³. DNABERT exhibits a robust performance in predicting promoters, splice sites, and TF binding sites after fine-tuning with a limited amount of labeled data. The success of DNABERT supports the principle that modeling the genetic language can solve various downstream genomics tasks. While DNABERT was trained on the genomic sequence, Geneformer was trained on the human single-cell transcriptome to learn tissue-specific knowledge of gene expression regulation networks⁴⁵. The model was trained on gene expression ranks extracted from 29.9 million human single-cell RNA-seq data. A masked language modeling task, where genes are regarded as words, was used for pre-training. The trained model showed robust performance in gene dosage sensitivity, chromatin dynamics, and network dynamics tasks. Furthermore, the model identified candidate therapeutic targets for cardiomyopathy by modeling the pathology at the gene network level, suggesting the potential value of RNA foundation models in medicine.

While DNABERT and Geneformer were pre-trained on a single type of data or task, DNAGPT was pre-trained on multiple types of data and tasks. DNAGPT was pre-trained on mammalian genomes using multiple tasks including causal language modeling regarding DNA k-mers as words, binary classification of DNA sequence order, and numerical regression of guanine-cytosine (GC) content²⁷². The model was able to predict gene expression and recognize functional regions such as poly(A) signals and translation initiation sites. The authors also showed that DNAGPT can generate artificial human genomes. However, sophisticated methods to measure the quality of the generated genome need to be devised to fully investigate the potential of artificial genome generation. BigRNA combined the prediction of multiple models to improve the generalizability, the technique called ensemble. This model was pre-trained to predict gene expression from genomic sequences and was fine-tuned to predict RBP and miRNA binding²⁷³. In contrast to other foundation models that are primarily pre-trained on DNA sequences, RNA-FM was pre-trained on noncoding RNA sequences using masked language modeling²⁷⁴. Notably, while structural data were not used for the pre-training, RNA-FM was able to predict RNA structure in terms of the secondary structure and 3D distances, in addition to functional features, including RBP and ribosome binding.

While multiple studies have made significant contributions toward developing foundation models in genomics and RB, the performance of the model and the diversity of the downstream tasks need to be improved. Importantly, foundation models should exhibit emergent properties and solve novel biological problems that have not been answered by task-specific models. By integrating recent advances in network biology and systems biology^275,276, it will be possible to develop foundation models that learn comprehensive knowledge of the biological language.

AI safety

As the capabilities of DL models expand in an accelerating phase, concerns regarding their safety are rising. One of the major concerns is the racial and sex biases of the models, which often manifest in the form of representational bias and performance disparities²⁶⁷. For instance, if a dataset used to train a DL model underrepresents a specific subpopulation, the model will struggle to make accurate inferences about that group. This problem is not new in biomedical research, where the inclusion of women and minority groups has been recognized as an essential goal to prevent racial and sex bias in research outcomes and improve the generalizability of the research outcome²⁷⁷. Since numerous DL models for RB are trained with datasets constructed using public databases, it is vital to maintain the racial, ethnic, and sexual diversity of ex vivo data in the databases. However, many public databases exhibit insufficient diversity²⁷⁸. For example, the vast majority of the GTEx v8 data are derived from individuals of European descent (85.3%), significantly underrepresenting the Asian population (1.4%), American Indian or Alaska Native populations (0.2%), and individuals identifying as Hispanic or Latino (1.9%)²⁷⁹. Moreover, the database underrepresents females (33.5%), introducing the source of potential sex bias. TCGA, another major source of ex vivo RNA-seq data, also suffers from the overrepresentation of White people (77%) and underrepresentation of Asian (3%) and Hispanic (3%) peoples²⁸⁰. These imbalances in public databases can facilitate the model to learn social biases against marginalized and underrepresented populations.

The most straightforward measure to mitigate social biases while training DL models is to resample data to achieve balanced racial and sex distributions²⁸¹. However, adopting this strategy is difficult since it severely reduces the number of training data points, which is already scarce in most fields of RB. Another strategy is to utilize data augmentation techniques to oversample underrepresented populations. However, this approach is uncommon because it can degrade the model performance and introduce additional biases²⁸². Therefore, the ideal solution would be to collect additional biological data from the underrepresented populations to counteract the existing bias. The All of Us Research Program of the NIH is one example of such efforts²⁸³. This program aims to collect genomic and health data from more than one million participants, focusing on traditionally underrepresented populations. Although this program does not currently collect transcriptomic data, similar efforts can be made in RB to enhance the social impact of biological DL studies.

Protecting the privacy and security of human-derived biological data is another crucial task in AI safety. This issue is especially emphasized in RB, considering the extensive amount of sensitive biological and clinical information that can be mined from transcriptomic data²⁸⁴. Indeed, the most straightforward way to prevent privacy breaches is to utilize only data that have been consented for public release, similar to the practice of the ENCODE and GEO databases²⁸⁵. However, a substantial portion of clinical data does not come with such consent, which leads to the demand for privacy-preserving machine learning²⁸⁶. It is a common practice to de-identify biomedical data before providing them for DL studies²⁸⁷. However, extensive de-identification can remove valuable information from the data, and there is a risk of re-identification using existing techniques²⁸⁸. Employing fully homomorphic encryption can overcome these limitations and protect privacy²⁸⁹. Fully homomorphic encryption allows DL models to be trained directly with encrypted data because its defining property is that arbitrary operation on encrypted data followed by decryption should yield results identical to those of operations on decrypted data²⁹⁰. Therefore, fully homomorphic encryption eliminates the need for extensive de-identification and allows the complete utilization of sensitive transcriptomic data for deep learning-powered RB research without privacy risks.

Another important technique to protect privacy in DL is federated training. Federated learning is a method that trains multiple independent DL agents locally and subsequently aggregates the weight updates on a central agent²⁹¹. Using federated learning, multiple institutions can collaborate without sharing sensitive transcriptome data outside the institution^292,293. It is even possible to eliminate the necessity of a central server using blockchain, further reducing the risk of privacy breaches²⁹⁴. In summary, adopting appropriate techniques can improve the safety of DL models in RB, expanding the range of training data and ensuring fair distribution of the potential benefits of RNA DL research across society.

Prospects and promises of deep learning in RNA biology

The exponential growth of big data in RB has opened new opportunities for employing DL in RNA research. DL has assisted in numerous discoveries and progresses in the study of RNA-binding proteins, ncRNAs, epitranscriptome, pre-mRNA processing, and gene expression. Moreover, DL models have leveraged RB data for medical applications, including diagnosis and drug development. Researchers have designed architectures to utilize biological data more effectively and have employed algorithms and techniques to counteract the data limitations. These efforts have resulted in the development of novel tools to assist and streamline RNA research.

Numerous DL studies reviewed in this article were made possible by public databases providing high-throughput experimental data. This fact highlights the central importance of public databases for employing DL in RB. The construction of better DL models for RB requires better datasets for training and benchmarking. Public databases need to focus on metadata integrity and QC, which can be achieved by developing unified analysis pipelines for each functional genomics assay and sequencing platform. Moreover, efforts should be directed toward compiling unbiased large-scale public datasets that can be used as standards for benchmarking RB models, which are central to engineering DL models.

Reflecting on the inspirational success of foundation models in understanding natural languages, it seems rational to expect that a DL model can be trained to decipher the genomic and transcriptomic language. Developing a foundation model for RB requires appropriate adaptations of self-supervised learning and multimodality paradigms for RB. Specifically, large-scale transcriptomic corpora, pre-training tasks for biological languages, downstream tasks for functional genomics, and standardized benchmarks have to be developed. While there have been various pioneering works in this direction, we believe that the paradigm-shifting potential of DL has yet to be fully realized in RB.

In summary, DL has shown immense potential in RB, and we expect it to play a significant role in unraveling the mysteries of RB in the future. Recently, DL has demonstrated its ability to guide the conception of novel scientific hypotheses²⁹⁵. With the ongoing technological advancements in functional genomics and the increasing availability of big data, researchers need to continue exploring the potential of DL in RB to accelerate and facilitate scientific discoveries.

References

Ching, T. et al. Opportunities and obstacles for deep learning in biology and medicine. J. R. Soc. Interface 15, 20170387 (2018).
PubMed PubMed Central Google Scholar
Sun, C., Shrivastava, A., Singh, S. & Gupta, A. Revisiting unreasonable effectiveness of data in deep learning era. In Proc. IEEE International Conference on Computer Vision. 843–852 (IEEE, 2017).
Stephens, Z. D. et al. Big data: astronomical or genomical? PLoS Biol. 13, e1002195 (2015).
PubMed PubMed Central Google Scholar
Stark, R., Grzelak, M. & Hadfield, J. RNA sequencing: the teenage years. Nat. Rev. Genet. 20, 631–656 (2019).
PubMed CAS Google Scholar
LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. Nature 521, 436–444 (2015).
PubMed CAS Google Scholar
He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. In Proc. IEEE Conference on Computer Vision and Pattern Recognition 770–778 (IEEE, 2016).
Brown, T. et al. Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 33, 1877–1901 (2020).
Google Scholar
Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021).
PubMed PubMed Central CAS Google Scholar
Ule, J. et al. CLIP identifies Nova-regulated RNA networks in the brain. Science 302, 1212–1215 (2003).
PubMed CAS Google Scholar
Hafner, M. et al. CLIP and complementary methods. Nat. Rev. Methods Prim. 1, 20 (2021).
CAS Google Scholar
Dominissini, D. et al. Topology of the human and mouse m6A RNA methylomes revealed by m6A-seq. Nature 485, 201–206 (2012).
PubMed CAS Google Scholar
Rouskin, S., Zubradt, M., Washietl, S., Kellis, M. & Weissman, J. S. Genome-wide probing of RNA structure reveals active unfolding of mRNA structures in vivo. Nature 505, 701–705 (2014).
PubMed CAS Google Scholar
Hwang, B., Lee, J. H. & Bang, D. Single-cell RNA sequencing technologies and bioinformatics pipelines. Exp. Mol. Med. 50, 1–14 (2018).
PubMed PubMed Central CAS Google Scholar
Roscher, R., Bohn, B., Duarte, M. F. & Garcke, J. Explainable machine learning for scientific insights and discoveries. IEEE Access. 8, 42200–42216 (2020).
Google Scholar
Consortium, E. P. Expanded encyclopaedias of DNA elements in the human and mouse genomes. Nature 583, 699–710 (2020).
Google Scholar
Leipzig, J., Nüst, D., Hoyt, C. T., Ram, K. & Greenberg, J. The role of metadata in reproducible computational research. Patterns 2, 100322 (2021).
PubMed PubMed Central Google Scholar
Barrett, T. et al. NCBI GEO: archive for functional genomics data sets—update. Nucleic Acids Res. 41, D991–D995 (2012).
PubMed PubMed Central Google Scholar
Consortium, R. E. et al. Integrative analysis of 111 reference human epigenomes. Nature 518, 317–329 (2015).
Google Scholar
Leinonen, R., Sugawara, H. & Shumway, M. The sequence read archive. Nucleic Acids Res. 39, D19–D21 (2011).
PubMed CAS Google Scholar
Gonçalves, R. S. & Musen, M. A. The variable quality of metadata about biological samples used in biomedical experiments. Sci. Data 6, 190021 (2019).
PubMed PubMed Central Google Scholar
Giles, C. B. et al. ALE: automated label extraction from GEO metadata. BMC Bioinformatics 18, 509 (2017).
PubMed PubMed Central Google Scholar
Serna Garcia, G., Leone, M., Bernasconi, A. & Carman, M. J. GeMI: interactive interface for transformer-based Genomic Metadata Integration. Database 2022, baac036 (2022).
Consortium, E. P. An integrated encyclopedia of DNA elements in the human genome. Nature 489, 57 (2012).
Google Scholar
Moore et al. Expanded encyclopaedias of DNA elements in the human and mouse genomes. Nature 583, 699–710 (2020).
PubMed PubMed Central Google Scholar
Rozowsky, J. et al. The EN-TEx resource of multi-tissue personal epigenomes & variant-impact models. Cell 186, 1493–1511.e40 (2023).
PubMed PubMed Central CAS Google Scholar
Hong, E. L. et al. Principles of metadata organization at the ENCODE data coordination center. Database 2016, baw001 (2016).
PubMed PubMed Central Google Scholar
Parkinson, H. et al. ArrayExpress–a public database of microarray experiments and gene expression profiles. Nucleic Acids Res. 35, D747–D750 (2007).
PubMed CAS Google Scholar
Burgin, J. et al. The European Nucleotide Archive in 2022. Nucleic Acids Res. 51, D121–D125 (2023).
PubMed CAS Google Scholar
Abugessaisa, I. et al. FANTOM enters 20th year: expansion of transcriptomic atlases and functional annotation of non-coding RNAs. Nucleic Acids Res. 49, D892–D898 (2021).
PubMed CAS Google Scholar
Andersson, R. et al. An atlas of active enhancers across human cell types and tissues. Nature 507, 455–461 (2014).
PubMed PubMed Central CAS Google Scholar
Hon, C.-C. et al. An atlas of human long non-coding RNAs with accurate 5′ ends. Nature 543, 199–204 (2017).
PubMed PubMed Central CAS Google Scholar
Shiraki, T. et al. Cap analysis gene expression for high-throughput analysis of transcriptional starting point and identification of promoter usage. Proc. Natl. Acad. Sci. USA 100, 15776–15781 (2003).
PubMed PubMed Central CAS Google Scholar
Ramilowski, J. A. et al. Functional annotation of human long noncoding RNAs via molecular phenotyping. Genome Res. 30, 1060–1072 (2020).
PubMed PubMed Central CAS Google Scholar
GTEx Consortium et al.The GTEx Consortium atlas of genetic regulatory effects across human tissues. Science 369, 1318–1330 (2020).
Google Scholar
Weinstein, J. N. et al. The Cancer Genome Atlas Pan-Cancer analysis project. Nat. Genet. 45, 1113–1120 (2013).
PubMed PubMed Central Google Scholar
Calabrese, C. et al. Genomic basis for RNA alterations in cancer. Nature 578, 129–136 (2020).
PubMed PubMed Central CAS Google Scholar
Goodfellow, I., Bengio, Y. & Courville, A. Deep Learning (MIT Press, 2016).
Chapelle, O., Schölkopf, B. & Zien, A. Semi-Supervised Learning (Adaptive Computation and Machine Learning) (MIT Press, 2006).
Krishnan, R., Rajpurkar, P. & Topol, E. J. Self-supervised learning in medicine and healthcare. Nat. Biomed. Eng. 6, 1346–1352 (2022).
PubMed Google Scholar
Young, J. D., Cai, C. & Lu, X. Unsupervised deep learning reveals prognostically relevant subtypes of glioblastoma. BMC Bioinformatics 18, 381 (2017).
PubMed PubMed Central Google Scholar
Li, X. et al. Deep learning enables accurate clustering with batch effect removal in single-cell RNA-seq analysis. Nat. Commun. 11, 2338 (2020).
PubMed PubMed Central CAS Google Scholar
Collobert, R. & Weston, J. A unified architecture for natural language processing: deep neural networks with multitask learning. In Proc. 25th International Conference on Machine Learning 160–167 (ACM, 2008).
Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proc. 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) 4171–4186 (ACL, 2019).
Yang, F. et al. scBERT as a large-scale pretrained deep language model for cell type annotation of single-cell RNA-seq data. Nat. Mach. Intell. 4, 852–866 (2022).
Google Scholar
Theodoris, C. V. et al. Transfer learning enables predictions in network biology. Nature 618, 616–624 (2023).
PubMed PubMed Central CAS Google Scholar
Zhou, Z. et al. Joint masking and self-supervised strategies for inferring small molecule-miRNA associations. Mol. Ther. Nucleic Acids. 35, 102103 (2024).
Jin, W. et al. HydRA: deep-learning models for predicting RNA-binding capacity from protein interaction association context and protein sequence. Mol. Cell 83, 2595–611. e11 (2023).
PubMed CAS Google Scholar
Peng, X. et al. RBP-TSTL is a two-stage transfer learning framework for genome-scale prediction of RNA-binding proteins. Brief Bioinform. 23, bbac215 (2022).
Xu, C. & Jackson, S. A. Machine learning and complex biological data. Genome Biol. 20, 76 (2019).
PubMed PubMed Central Google Scholar
Tzeng, E., Hoffman, J., Zhang, N., Saenko, K. & Darrell, T. Deep domain confusion: maximizing for domain invariance. Preprint at https://arxiv.org/abs/1412.3474 (2014).
Ganin, Y. et al. Domain-adversarial training of neural networks. J. Mach. Learn. Res. 17, 1–35 (2016).
Google Scholar
Chen, J. et al. Deep transfer learning of cancer drug responses by integrating bulk and single-cell RNA-seq data. Nat. Commun. 13, 6494 (2022).
PubMed PubMed Central CAS Google Scholar
Shaw, D., Chen, H. & Jiang, T. DeepIsoFun: a deep domain adaptation approach to predict isoform functions. Bioinformatics 35, 2535–2544 (2018).
PubMed Central Google Scholar
Kelley, D. R. Cross-species regulatory sequence activity prediction. PLoS Comput. Biol. 16, e1008050 (2020).
PubMed PubMed Central CAS Google Scholar
Kimmel, J. C. & Kelley, D. R. Semisupervised adversarial neural networks for single-cell classification. Genome Res. 31, 1781–1793 (2021).
PubMed PubMed Central Google Scholar
Finn, C., Abbeel, P. & Levine, S. Model-agnostic meta-learning for fast adaptation of deep networks. In Proc. 34th International Conference on Machine Learning 1126–1135 (PMLR, 2017).
Snell, J., Swersky, K. & Zemel, R. Prototypical networks for few-shot learning. In Advances in Neural Information Processing Systems 30 (NIPS, 2017).
Brbić, M. et al. MARS: discovering novel cell types across heterogeneous single-cell experiments. Nat. Methods 17, 1200–1206 (2020).
PubMed Google Scholar
Qiu, Y. L., Zheng, H., Devos, A., Selby, H. & Gevaert, O. A meta-learning approach for genomic survival analysis. Nat. Commun. 11, 6350 (2020).
PubMed PubMed Central CAS Google Scholar
Li, Z. et al. CoraL: interpretable contrastive meta-learning for the prediction of cancer-associated ncRNA-encoded small peptides. Brief. Bioinform. 24, bbad352 (2023).
PubMed Google Scholar
Cai, J., Wang, T., Deng, X., Tang, L. & Liu, L. GM-lncLoc: lncRNAs subcellular localization prediction based on graph neural network with meta-learning. BMC Genomics 24, 52 (2023).
PubMed PubMed Central CAS Google Scholar
Shorten, C. & Khoshgoftaar, T. M. A survey on image data augmentation for deep learning. J. Big Data 6, 1–48 (2019).
Google Scholar
Hill, S. T. et al. A deep recurrent neural network discovers complex biological rules to decipher RNA protein-coding potential. Nucleic Acids Res. 46, 8105–8113 (2018).
PubMed PubMed Central CAS Google Scholar
Cao, Y., Geddes, T. A., Yang, J. Y. H. & Yang, P. Ensemble deep learning in bioinformatics. Nat. Mach. Intell. 2, 500–508 (2020).
Google Scholar
Sagi, O. & Rokach, L. Ensemble learning: a survey. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 8, e1249 (2018).
Google Scholar
Pan, X. & Shen, H.-B. Predicting RNA–protein binding sites and motifs through combining local and global deep convolutional neural networks. Bioinformatics 34, 3427–3436 (2018).
PubMed CAS Google Scholar
Camargo, A. P., Sourkov, V., Pereira, G. A. G. & Carazzolle, M. F. RNAsamba: neural network-based assessment of the protein-coding potential of RNA sequences. NAR Genomics Bioinform. 2, lqz024 (2020).
Google Scholar
Cheng, J. et al. MMSplice: modular modeling improves the predictions of genetic variant effects on splicing. Genome Biol. 20, 48 (2019).
PubMed PubMed Central Google Scholar
Nguyen, T. A. et al. Direct identification of A-to-I editing sites with nanopore native RNA sequencing. Nat. Methods 19, 833–844 (2022).
PubMed CAS Google Scholar
Kalkatawi, M., Magana-Mora, A., Jankovic, B. & Bajic, V. B. DeepGSR: an optimized deep-learning structure for the recognition of genomic signals and regions. Bioinformatics 35, 1125–1132 (2019).
PubMed CAS Google Scholar
Zhang, T., Tang, Q., Nie, F., Zhao, Q. & Chen, W. DeepLncPro: an interpretable convolutional neural network model for identifying long non-coding RNA promoters. Brief. Bioinform. 23, bbac447 (2022).
PubMed Google Scholar
Aoki, G. & Sakakibara, Y. Convolutional neural networks for classification of alignments of non-coding RNA sequences. Bioinformatics 34, i237–i244 (2018).
PubMed PubMed Central CAS Google Scholar
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S. & Dean, J. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems 26 (NIPS, 2013).
Chaabane, M., Williams, R. M., Stephens, A. T. & Park, J. W. circDeep: deep learning approach for circular RNA classification from other long non-coding RNA. Bioinformatics 36, 73–80 (2020).
PubMed CAS Google Scholar
Farhadi, F., Allahbakhsh, M., Maghsoudi, A., Armin, N. & Amintoosi, H. DiMo: discovery of microRNA motifs using deep learning and motif embedding. Brief. Bioinform. 24, bbad182 (2023).
Song, Z. et al. Attention-based multi-label neural networks for integrated prediction and interpretation of twelve widely occurring RNA modifications. Nat. Commun. 12, 4011 (2021).
Le, Q. & Mikolov, T. Distributed representations of sentences and documents. In Proc. 31st International Conference on Machine Learning 1188–1196 (PMLR, 2014).
Xie, W., Luo, J., Pan, C. & Liu, Y. SG-LSTM-FRAME: a computational frame using sequence and geometrical information via LSTM to predict miRNA-gene associations. Brief. Bioinform. 22, 2032–2042 (2021).
PubMed CAS Google Scholar
Hendra, C. et al. Detection of m6A from direct RNA sequencing using a multiple instance learning framework. Nat. Methods 19, 1590–1598 (2022).
PubMed PubMed Central CAS Google Scholar
Leung, M. K. K., Xiong, H. Y., Lee, L. J. & Frey, B. J. Deep learning of the tissue-regulated splicing code. Bioinformatics 30, i121–i129 (2014).
PubMed PubMed Central CAS Google Scholar
Lusk, R. et al. Aptardi predicts polyadenylation sites in sample-specific transcriptomes using high-throughput RNA sequencing and DNA sequence. Nat. Commun. 12, 1652 (2021).
PubMed PubMed Central CAS Google Scholar
Zhang, Z. et al. Deep-learning augmented RNA-seq analysis of transcript splicing. Nat. Methods 16, 307–310 (2019).
PubMed PubMed Central CAS Google Scholar
Chen, Y., Li, Y., Narayan, R., Subramanian, A. & Xie, X. Gene expression inference with deep learning. Bioinformatics 32, 1832–1839 (2016).
PubMed PubMed Central CAS Google Scholar
Yu, G., Zhou, G., Zhang, X., Domeniconi, C. & Guo, M. DMIL-IsoFun: predicting isoform function using deep multi-instance learning. Bioinformatics 37, 4818–4825 (2021).
PubMed CAS Google Scholar
Zhang, K., Wang, C., Sun, L. & Zheng, J. Prediction of gene co-expression from chromatin contacts with graph attention network. Bioinformatics 38, 4457–4465 (2022).
PubMed PubMed Central CAS Google Scholar
Han, S. et al. LncFinder: an integrated platform for long non-coding RNA identification utilizing sequence intrinsic composition, structural information and physicochemical property. Brief. Bioinform. 20, 2009–2027 (2019).
PubMed CAS Google Scholar
Wen, M., Cong, P., Zhang, Z., Lu, H. & Li, T. DeepMirTar: a deep-learning approach for predicting human miRNA targets. Bioinformatics 34, 3781–3787 (2018).
PubMed CAS Google Scholar
Budach, S. & Marsico, A. pysster: classification of biological sequences by learning sequence and structure motifs with convolutional neural networks. Bioinformatics 34, 3035–3037 (2018).
PubMed PubMed Central CAS Google Scholar
Ben-Bassat, I., Chor, B. & Orenstein, Y. A deep neural network approach for learning intrinsic protein-RNA binding preferences. Bioinformatics 34, i638–i646 (2018).
PubMed CAS Google Scholar
Townshend, R. J. L. et al. Geometric deep learning of RNA structure. Science 373, 1047–1051 (2021).
PubMed PubMed Central CAS Google Scholar
Yan, Z., Hamilton, W. L. & Blanchette, M. Graph neural representational learning of RNA secondary structures for predicting RNA-protein interactions. Bioinformatics 36, i276–i284 (2020).
PubMed PubMed Central CAS Google Scholar
Xiong, H. Y. et al. The human splicing code reveals new insights into the genetic determinants of disease. Science 347, 1254806 (2015).
PubMed Google Scholar
Zhang, L. et al. A deep learning model to identify gene expression level using cobinding transcription factor signals. Brief. Bioinform. 23, bbab501 (2022).
PubMed Google Scholar
Jha, A., Gazzara, M. R. & Barash, Y. Integrative deep models for alternative splicing. Bioinformatics 33, i274–i282 (2017).
PubMed PubMed Central CAS Google Scholar
McGeary, S. E. et al. The biochemical basis of microRNA targeting efficacy. Science 366, eaav1741 (2019).
PubMed PubMed Central CAS Google Scholar
Hornik, K., Stinchcombe, M. & White, H. Multilayer feedforward networks are universal approximators. Neural Netw. 2, 359–366 (1989).
Google Scholar
LeCun, Y., Bottou, L., Bengio, Y. & Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 86, 2278–2324 (1998).
Google Scholar
Rumelhart, D. E., Hinton, G. E. & Williams, R. J. Learning representations by back-propagating errors. Nature 323, 533–536 (1986).
Google Scholar
Yang, C. et al. LncADeep: an ab initio lncRNA identification and functional annotation tool based on deep learning. Bioinformatics 34, 3825–3834 (2018).
PubMed CAS Google Scholar
Mateos, P. A., Zhou, Y., Zarnack, K. & Eyras, E. Concepts and methods for transcriptome-wide prediction of chemical messenger RNA modifications with machine learning. Brief. Bioinform. 24, bbad163 (2023).
Google Scholar
Hinton, G. E., Osindero, S. & Teh, Y.-W. A fast learning algorithm for deep belief nets. Neural Comput. 18, 1527–1554 (2006).
PubMed Google Scholar
Zhang, S. et al. A deep learning framework for modeling structural features of RNA-binding protein targets. Nucleic Acids Res. 44, e32 (2015).
PubMed PubMed Central Google Scholar
Krizhevsky A., Sutskever I., Hinton G. E. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25 (NIPS, 2012).
Alipanahi, B., Delong, A., Weirauch, M. T. & Frey, B. J. Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. Nat. Biotechnol. 33, 831–838 (2015).
PubMed CAS Google Scholar
Kelley, D. R. et al. Sequential regulatory activity prediction across chromosomes with convolutional neural networks. Genome Res. 28, 739–750 (2018).
PubMed PubMed Central CAS Google Scholar
Sahu, B. et al. Sequence determinants of human gene regulatory elements. Nat. Genet. 54, 283–294 (2022).
PubMed PubMed Central CAS Google Scholar
Zrimec, J. et al. Deep learning suggests that gene expression is encoded in all parts of a co-evolving interacting gene regulatory structure. Nat. Commun. 11, 6141 (2020).
PubMed PubMed Central CAS Google Scholar
Avsec, Ž., Barekatain, M., Cheng, J. & Gagneur, J. Modeling positional effects of regulatory sequences with spline transformations increases prediction accuracy of deep neural networks. Bioinformatics 34, 1261–1269 (2018).
PubMed CAS Google Scholar
Cuperus, J. T. et al. Deep learning of the regulatory grammar of yeast 5’ untranslated regions from 500,000 random sequences. Genome Res. 27, 2015–2024 (2017).
PubMed PubMed Central CAS Google Scholar
Xia, Z. et al. DeeReCT-PolyA: a robust and generic deep learning method for PAS identification. Bioinformatics 35, 2371–2379 (2019).
PubMed CAS Google Scholar
Zheng, X., Fu, X., Wang, K. & Wang, M. Deep neural networks for human microRNA precursor detection. BMC Bioinform. 21, 17 (2020).
CAS Google Scholar
Leung, M. K. K., Delong, A. & Frey, B. J. Inference of the human polyadenylation code. Bioinformatics 34, 2889–2898 (2018).
PubMed PubMed Central CAS Google Scholar
Jaganathan, K. et al. Predicting splicing from primary sequence with deep learning. Cell 176, 535–548.e24 (2019).
PubMed CAS Google Scholar
Zeng, T. & Li, Y. I. Predicting RNA splicing from DNA sequence using Pangolin. Genome Biol. 23, 103 (2022).
PubMed PubMed Central CAS Google Scholar
Luo, Z., Zhang, J., Fei, J. & Ke, S. Deep learning modeling m6A deposition reveals the importance of downstream cis-element sequences. Nat. Commun. 13, 2720 (2022).
PubMed PubMed Central CAS Google Scholar
Yu, F. & Koltun, V. Multi-scale context aggregation by dilated convolutions. In 4th International Conference on Learning Representations https://doi.org/10.48550/arXiv.1511.07122 (2016).
Szegedy, C. et al. Going deeper with convolutions. In Proc. IEEE Conference on Computer Vision and Pattern Recognition 1–9 (IEEE, 2015).
Zhao, Y. et al. CUP-AI-Dx: A tool for inferring cancer tissue of origin and molecular subtype using RNA gene-expression data and artificial intelligence. eBioMedicine 61, 103030 (2020).
PubMed PubMed Central Google Scholar
Lipton, Z. C., Berkowitz, J. & Elkan, C. A critical review of recurrent neural networks for sequence learning. Preprint at https://arxiv.org/abs/1506.00019 (2015).
Sherstinsky, A. Fundamentals of recurrent neural network (RNN) and long short-term memory (LSTM) network. Phys. D Nonlinear Phenom. 404, 132306 (2020).
Google Scholar
Hochreiter, S. & Schmidhuber, J. Long short-term memory. Neural Comput. 9, 1735–1780 (1997).
PubMed CAS Google Scholar
Cho, K. et al. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In Proc. 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) 1724–1734 (ACL, 2014).
Sekhon, A., Singh, R. & Qi, Y. DeepDiff: DEEP-learning for predicting DIFFerential gene expression from histone modifications. Bioinformatics 34, i891–i900 (2018).
PubMed CAS Google Scholar
Graves, A., Mohamed, A.-R., Hinton, G. Speech recognition with deep recurrent neural networks. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing 6645–6649 (IEEE, 2013).
Bretschneider, H., Gandhi, S., Deshwar, A. G., Zuberi, K. & Frey, B. J. COSSMO: predicting competitive alternative splice site selection using deep learning. Bioinformatics 34, i429–i437 (2018).
PubMed PubMed Central CAS Google Scholar
Grønning, A. G. B. et al. DeepCLIP: predicting the effect of mutations on protein–RNA binding with deep learning. Nucleic Acids Res. 48, 7099–7118 (2020).
PubMed PubMed Central Google Scholar
Arefeen, A., Xiao, X. & Jiang, T. DeepPASTA: deep neural network based polyadenylation site analysis. Bioinformatics 35, 4577–4585 (2019).
PubMed PubMed Central CAS Google Scholar
Trabelsi, A., Chaabane, M. & Ben-Hur, A. Comprehensive evaluation of deep learning architectures for prediction of DNA/RNA sequence binding specificities. Bioinformatics 35, i269–i277 (2019).
PubMed PubMed Central CAS Google Scholar
Vaswani, A. et al. Attention is all you need. In Advances in Neural Information Processing Systems 30 (NIPS, 2017).
Dosovitskiy, A. et al. An image is worth 16x16 words: transformers for image recognition at scale. In 9th International Conference on Learning Representations https://openreview.net/forum?id=YicbFdNTTy (2021).
Yu, H. & Dai, Z. SANPolyA: a deep learning method for identifying Poly(A) signals. Bioinformatics 36, 2393–2400 (2020).
PubMed CAS Google Scholar
Avsec, Ž. et al. Effective gene expression prediction from sequence by integrating long-range interactions. Nat. Methods 18, 1196–1203 (2021).
PubMed PubMed Central CAS Google Scholar
Zhou, J.-R., Wang, X.-F., Wen, J.-y, Shang, X.-Q. & Niu, R. Predicting circRNA-miRNA interactions utilizing transformer-based RNA sequential learning and high-order proximity preserved embedding. iScience 27, 108592 (2023).
PubMed PubMed Central Google Scholar
Gilmer, J. et al. Neural message passing for quantum chemistry. In Proc. 34th International Conference on Machine Learning 1263–1272 (PMLR, 2017).
Kipf, T. N. & Welling, M. Semi-supervised classification with graph convolutional networks. In 5th International Conference on Learning Representations https://openreview.net/forum?id=SJU4ayYgl (2017).
Veličković, P. et al. Graph attention networks. In 6th International Conference on Learning Representations https://openreview.net/forum?id=rJXMpikCZ (2018).
Wu, Z. et al. A comprehensive survey on graph neural networks. IEEE Trans. Neural Netw. Learn. Syst. 32, 4–24 (2020).
Google Scholar
Forster, D. T. et al. BIONIC: biological network integration using convolutions. Nat. Methods 19, 1250–1261 (2022).
PubMed PubMed Central CAS Google Scholar
Statello, L., Guo, C.-J., Chen, L.-L. & Huarte, M. Gene regulation by long non-coding RNAs and its biological functions. Nat. Rev. Mol. Cell Biol. 22, 96–118 (2021).
PubMed CAS Google Scholar
Bartel, D. P. MicroRNAs: target recognition and regulatory functions. Cell 136, 215–233 (2009).
PubMed PubMed Central CAS Google Scholar
Peng, Y. & Croce, C. M. The role of microRNAs in human cancer. Signal Transduct. Target. Ther. 1, 15004 (2016).
PubMed PubMed Central Google Scholar
Slack, F. J. & Chinnaiyan, A. M. The role of non-coding RNAs in oncology. Cell 179, 1033–1055 (2019).
PubMed PubMed Central CAS Google Scholar
Ha, M. & Kim, V. N. Regulation of microRNA biogenesis. Nat. Rev. Mol. Cell Biol. 15, 509–524 (2014).
PubMed CAS Google Scholar
Agarwal, V., Bell, G. W., Nam, J. W. & Bartel, D. P. Predicting effective microRNA target sites in mammalian mRNAs. eLife 4, e05005 (2015).
Cao, H., Wahlestedt, C. & Kapranov, P. Strategies to annotate and characterize long noncoding RNAs: advantages and pitfalls. Trends Genet. 34, 704–721 (2018).
PubMed CAS Google Scholar
Yuan, J. et al. NPInter v2.0: an updated database of ncRNA interactions. Nucleic Acids Res. 42, D104–D108 (2013).
PubMed PubMed Central Google Scholar
Kanehisa, M. & Goto, S. KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res. 28, 27–30 (2000).
PubMed PubMed Central CAS Google Scholar
Fabregat, A. et al. The Reactome Pathway Knowledgebase. Nucleic Acids Res. 46, D649–D655 (2018).
PubMed CAS Google Scholar
Kristensen, L. S. et al. The biogenesis, biology and characterization of circular RNAs. Nat. Rev. Genet. 20, 675–691 (2019).
PubMed CAS Google Scholar
Kristensen, L. S., Jakobsen, T., Hager, H. & Kjems, J. The emerging roles of circRNAs in cancer and oncology. Nat. Rev. Clin. Oncol. 19, 188–206 (2022).
PubMed CAS Google Scholar
Chen, X. et al. circRNADb: a comprehensive database for human circular RNAs with protein-coding annotations. Sci. Rep. 6, 34985 (2016).
PubMed PubMed Central CAS Google Scholar
Harrow, J. et al. GENCODE: the reference human genome annotation for the ENCODE project. Genome Res. 22, 1760–1774 (2012).
PubMed PubMed Central CAS Google Scholar
Glažar, P., Papavasileiou, P. & Rajewsky, N. circBase: a database for circular RNAs. RNA 20, 1666–1670 (2014).
PubMed PubMed Central Google Scholar
Meyer, K. D. & Jaffrey, S. R. The dynamic epitranscriptome: N6-methyladenosine and gene expression control. Nat. Rev. Mol. Cell Biol. 15, 313–326 (2014).
PubMed PubMed Central CAS Google Scholar
Wiener, D. & Schwartz, S. The epitranscriptome beyond m6A. Nat. Rev. Genet. 22, 119–131 (2021).
PubMed CAS Google Scholar
Delaunay, S., Helm, M. & Frye, M. RNA modifications in physiology and disease: towards clinical applications. Nat. Rev. Genet. 25, 104–122 (2024).
Barbieri, I. & Kouzarides, T. Role of RNA modifications in cancer. Nat. Rev. Cancer 20, 303–322 (2020).
PubMed CAS Google Scholar
Linder, B. et al. Single-nucleotide-resolution mapping of m6A and m6Am throughout the transcriptome. Nat. Methods 12, 767–772 (2015).
PubMed PubMed Central CAS Google Scholar
Schaefer, M., Pollex, T., Hanna, K. & Lyko, F. RNA cytosine methylation analysis by bisulfite sequencing. Nucleic Acids Res. 37, e12 (2009).
PubMed Google Scholar
Zhong, Z.-D. et al. Systematic comparison of tools used for m6A mapping from nanopore direct RNA sequencing. Nat. Commun. 14, 1906 (2023).
PubMed PubMed Central CAS Google Scholar
Helm, M. & Motorin, Y. Detecting RNA modifications in the epitranscriptome: predict and validate. Nat. Rev. Genet. 18, 275–291 (2017).
PubMed CAS Google Scholar
Hasan, M. M. et al. Deepm5C: a deep-learning-based hybrid framework for identifying human RNA N5-methylcytosine sites using a stacking strategy. Mol. Ther. 30, 2856–2867 (2022).
PubMed PubMed Central CAS Google Scholar
Tahir, M., Tayara, H. & Chong, K. T. iPseU-CNN: identifying RNA pseudouridine sites using convolutional neural networks. Mol. Ther. Nucleic Acids 16, 463–470 (2019).
PubMed PubMed Central CAS Google Scholar
Mostavi, M., Salekin, S. & Huang, Y. Deep-2’-O-Me: predicting 2’-o-methylation sites by convolutional neural networks. In 2018 40th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC) 2394–2397 (IEEE, 2018).
Garalde, D. R. et al. Highly parallel direct RNA sequencing on an array of nanopores. Nat. Methods 15, 201–206 (2018).
PubMed CAS Google Scholar
Liu, H. et al. Accurate detection of m6A RNA modifications in native RNA sequences. Nat. Commun. 10, 4079 (2019).
PubMed PubMed Central Google Scholar
Hentze, M. W., Castello, A., Schwarzl, T. & Preiss, T. A brave new world of RNA-binding proteins. Nat. Rev. Mol. Cell Biol. 19, 327–341 (2018).
PubMed CAS Google Scholar
Ray, D. et al. A compendium of RNA-binding motifs for decoding gene regulation. Nature 499, 172–177 (2013).
PubMed PubMed Central CAS Google Scholar
Licatalosi, D. D. et al. HITS-CLIP yields genome-wide insights into brain alternative RNA processing. Nature 456, 464–469 (2008).
PubMed PubMed Central CAS Google Scholar
Hafner, M. et al. Transcriptome-wide identification of RNA-binding protein and microRNA target sites by PAR-CLIP. Cell 141, 129–141 (2010).
PubMed PubMed Central CAS Google Scholar
Van Nostrand, E. L. et al. Robust transcriptome-wide discovery of RNA-binding protein binding sites with enhanced CLIP (eCLIP). Nat. Methods 13, 508–514 (2016).
PubMed PubMed Central Google Scholar
Taliaferro, J. M. et al. RNA sequence context effects measured in vitro predict in vivo protein binding and regulation. Mol. Cell. 64, 294–306 (2016).
PubMed PubMed Central CAS Google Scholar
Sanchez de Groot, N. et al. RNA structure drives interaction with proteins. Nat. Commun. 10, 3246 (2019).
PubMed PubMed Central Google Scholar
Pan, X., Rijnbeek, P., Yan, J. & Shen, H.-B. Prediction of RNA-protein sequence and structure binding preferences using deep convolutional and recurrent neural networks. BMC Genomics 19, 511 (2018).
PubMed PubMed Central Google Scholar
Duvenaud, D. K. et al. Convolutional networks on graphs for learning molecular fingerprints. In Advances in Neural Information Processing Systems 28 (NIPS, 2015).
Sun, L. et al. Predicting dynamic cellular protein–RNA interactions by deep learning using in vivo RNA structures. Cell Res. 31, 495–516 (2021).
PubMed PubMed Central CAS Google Scholar
Hu, J., Shen, L. & Sun, G. Squeeze-and-excitation networks. In Proc. IEEE Conference on Computer Vision and Pattern Recognition 7132–7141 (IEEE, 2018).
Zhou, J. et al. Whole-genome deep-learning analysis identifies contribution of noncoding mutations to autism risk. Nat. Genet. 51, 973–980 (2019).
PubMed PubMed Central CAS Google Scholar
Baek, M. et al. Accurate prediction of protein–nucleic acid complexes using RoseTTAFoldNA. Nat. Methods 21, 117–121 (2023).
PubMed PubMed Central Google Scholar
Tian, B. & Manley, J. L. Alternative polyadenylation of mRNA precursors. Nat. Rev. Mol. Cell Biol. 18, 18–30 (2017).
PubMed CAS Google Scholar
Matlin, A. J., Clark, F. & Smith, C. W. Understanding alternative splicing: towards a cellular code. Nat. Rev. Mol. Cell Biol. 6, 386–398 (2005).
PubMed CAS Google Scholar
De Sandre-Giovannoli, A. et al. Lamin a truncation in Hutchinson-Gilford progeria. Science 300, 2055 (2003).
PubMed Google Scholar
Baralle, F. E. & Giudice, J. Alternative splicing as a regulator of development and tissue identity. Nat. Rev. Mol. Cell Biol. 18, 437–451 (2017).
PubMed PubMed Central CAS Google Scholar
Wang, E. T. et al. Alternative isoform regulation in human tissue transcriptomes. Nature 456, 470–476 (2008).
PubMed PubMed Central CAS Google Scholar
Zuallaert, J. et al. SpliceRover: interpretable convolutional neural networks for improved splice site prediction. Bioinformatics 34, 4180–4188 (2018).
PubMed CAS Google Scholar
Derti, A. et al. A quantitative atlas of polyadenylation in five mammals. Genome Res. 22, 1173–1183 (2012).
PubMed PubMed Central CAS Google Scholar
Jan, C. H., Friedman, R. C., Ruby, J. G. & Bartel, D. P. Formation, regulation and evolution of Caenorhabditis elegans 3′ UTRs. Nature 469, 97–101 (2011).
PubMed CAS Google Scholar
Gao, X., Zhang, J., Wei, Z. & Hakonarson, H. DeepPolyA: a convolutional neural network approach for polyadenylation site prediction. IEEE Access. 6, 24340–24349 (2018).
Google Scholar
Bogard, N., Linder, J., Rosenberg, A. B. & Seelig, G. A deep neural network for predicting and engineering alternative polyadenylation. Cell 178, 91–106.e23 (2019).
PubMed PubMed Central CAS Google Scholar
Beer, M. A. & Tavazoie, S. Predicting gene expression from sequence. Cell 117, 185–198 (2004).
PubMed CAS Google Scholar
Zhou, J. et al. Deep learning sequence-based ab initio prediction of variant effects on expression and disease risk. Nat. Genet. 50, 1171–1179 (2018).
PubMed PubMed Central CAS Google Scholar
Agarwal, V. & Shendure, J. Predicting mRNA abundance directly from genomic sequence using deep convolutional neural networks. Cell Rep. 31, 107663 (2020).
PubMed CAS Google Scholar
Roohani, Y., Huang, K. & Leskovec, J. Predicting transcriptional outcomes of novel multigene perturbations with GEARS. Nat. Biotechnol. https://doi.org/10.1038/s41587-023-01905-6 (2023).
Singh, R., Lanchantin, J., Robins, G. & Qi, Y. DeepChrome: deep-learning for predicting gene expression from histone modifications. Bioinformatics 32, i639–i648 (2016).
PubMed CAS Google Scholar
Tasaki, S., Gaiteri, C., Mostafavi, S. & Wang, Y. Deep learning decodes the principles of differential gene expression. Nat. Mach. Intell. 2, 376–386 (2020).
PubMed PubMed Central Google Scholar
Sample, P. J. et al. Human 5′ UTR design and variant effect prediction from a massively parallel translation assay. Nat. Biotechnol. 37, 803–809 (2019).
PubMed PubMed Central CAS Google Scholar
Xiang, Y. et al. Pervasive downstream RNA hairpins dynamically dictate start-codon selection. Nature 621, 423–430 (2023).
PubMed PubMed Central CAS Google Scholar
Pardi, N., Hogan, M. J., Porter, F. W. & Weissman, D. mRNA vaccines—a new era in vaccinology. Nat. Rev. Drug Discov. 17, 261–279 (2018).
PubMed PubMed Central CAS Google Scholar
Childs-Disney, J. L. et al. Targeting RNA structures with small molecules. Nat. Rev. Drug Discov. 21, 736–762 (2022).
PubMed PubMed Central CAS Google Scholar
Warner, K. D., Hajdin, C. E. & Weeks, K. M. Principles for targeting RNA with drug-like small molecules. Nat. Rev. Drug Discov. 17, 547–558 (2018).
PubMed PubMed Central CAS Google Scholar
Winkle, M., El-Daly, S. M., Fabbri, M. & Calin, G. A. Noncoding RNA therapeutics—challenges and potential solutions. Nat. Rev. Drug Discov. 20, 629–651 (2021).
PubMed PubMed Central CAS Google Scholar
Setten, R. L., Rossi, J. J. & Han, S.-P. The current state and future directions of RNAi-based therapeutics. Nat. Rev. Drug Discov. 18, 421–446 (2019).
PubMed CAS Google Scholar
Byron, S. A., Van Keuren-Jensen, K. R., Engelthaler, D. M., Carpten, J. D. & Craig, D. W. Translating RNA sequencing into clinical diagnostics: opportunities and challenges. Nat. Rev. Genet. 17, 257–271 (2016).
PubMed PubMed Central CAS Google Scholar
Cummings, B. B. et al. Improving genetic diagnosis in Mendelian disease with transcriptome sequencing. Sci. Transl. Med. 9, eaal5209 (2017).
PubMed PubMed Central Google Scholar
Mayhew, M. B. et al. A generalizable 29-mRNA neural-network classifier for acute bacterial and viral infections. Nat. Commun. 11, 1177 (2020).
PubMed PubMed Central CAS Google Scholar
Comitani, F. et al. Diagnostic classification of childhood cancer using multiscale transcriptomics. Nat. Med. 29, 656–666 (2023).
PubMed PubMed Central CAS Google Scholar
Wang, T. et al. MOGONET integrates multi-omics data using graph convolutional networks allowing patient classification and biomarker identification. Nat. Commun. 12, 3445 (2021).
PubMed PubMed Central CAS Google Scholar
Schmauch, B. et al. A deep learning model to predict RNA-Seq expression of tumours from whole slide images. Nat. Commun. 11, 3877 (2020).
PubMed PubMed Central CAS Google Scholar
Chaudhary, N., Weissman, D. & Whitehead, K. A. mRNA vaccines for infectious diseases: principles, delivery and clinical translation. Nat. Rev. Drug Discov. 20, 817–838 (2021).
PubMed PubMed Central CAS Google Scholar
Qin, S. et al. mRNA-based therapeutics: powerful and versatile tools to combat diseases. Signal Transduct. Target. Ther. 7, 166 (2022).
PubMed PubMed Central CAS Google Scholar
Jiménez-Luna, J., Grisoni, F. & Schneider, G. Drug discovery with explainable artificial intelligence. Nat. Mach. Intell. 2, 573–584 (2020).
Google Scholar
Vamathevan, J. et al. Applications of machine learning in drug discovery and development. Nat. Rev. Drug Discov. 18, 463–477 (2019).
PubMed PubMed Central CAS Google Scholar
Wayment-Steele, H. K. et al. Deep learning models for predicting RNA degradation via dual crowdsourcing. Nat. Mach. Intell. 4, 1174–1184 (2022).
PubMed PubMed Central Google Scholar
Chen, T. & Guestrin, C. XGboost: a scalable tree boosting system. In Proc. 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 785–794 (ACM, 2016).
Wong, C. UK first to approve CRISPR treatment for diseases: what you need to know. Nature 623, 676–677 (2023).
PubMed CAS Google Scholar
Chuai, G. et al. DeepCRISPR: optimized CRISPR guide RNA design by deep learning. Genome Biol. 19, 80 (2018).
PubMed PubMed Central Google Scholar
Xiang, X. et al. Enhancing CRISPR-Cas9 gRNA efficiency prediction by data integration and deep learning. Nat. Commun. 12, 3238 (2021).
PubMed PubMed Central CAS Google Scholar
Marquart, K. F. et al. Predicting base editing outcomes with an attention-based deep learning algorithm trained on high-throughput target library screens. Nat. Commun. 12, 5114 (2021).
PubMed PubMed Central CAS Google Scholar
Mathis, N. et al. Predicting prime editing efficiency and product purity by deep learning. Nat. Biotechnol. 41, 1151–1159 (2023).
PubMed CAS Google Scholar
Wessels, H.-H. et al. Prediction of on-target and off-target activity of CRISPR–Cas13d guide RNAs using deep learning. Nat. Biotechnol. https://doi.org/10.1038/s41587-023-01830-8 (2023).
Park, S. et al. shRNAI: a deep neural network for the design of highly potent shRNAs. Preprint at bioRxiv https://doi.org/10.1101/2024.01.09.574789 (2024).
Gao, D. et al. A deep learning approach to identify gene targets of a therapeutic for human splicing disorders. Nat. Commun. 12, 3332 (2021).
PubMed PubMed Central CAS Google Scholar
Abascal, F. et al. Perspectives on ENCODE. Nature 583, 693–698 (2020).
Google Scholar
Regev, A. et al. The Human Cell Atlas. eLife 6, e27041 (2017).
Lindeboom, R. G. H., Regev, A. & Teichmann, S. A. Towards a Human Cell Atlas: taking notes from the past. Trends Genet. 37, 625–630 (2021).
PubMed CAS Google Scholar
Deng, J. et al. Imagenet: a large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition 248–255 (IEEE, 2009).
Lin, T.-Y. et al. Microsoft coco: common objects in context. In Proc. Computer Vision–ECCV 2014: 13th European Conference 740–755 (Springer, 2014).
Rajpurkar, P., Zhang, J., Lopyrev, K. & Liang, P. Squad: 100,000+ questions for machine comprehension of text. In Proc. 2016 Conference on Empirical Methods in Natural Language Processing 2383–2392 (ACL, 2016).
Wang, A. et al. GLUE: a multi-task benchmark and analysis platform for natural language understanding. In Proc. 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP 353–355 (ACL, 2018).
Pagès-Gallego, M. & de Ridder, J. Comprehensive benchmark and architectural analysis of deep learning models for nanopore sequencing basecalling. Genome Biol. 24, 71 (2023).
PubMed PubMed Central Google Scholar
Horlacher, M. et al. A systematic benchmark of machine learning methods for protein–RNA interaction prediction. Brief. Bioinform. 24, bbad307 (2023).
PubMed PubMed Central Google Scholar
Huang, Z. et al. Benchmark of computational methods for predicting microRNA-disease associations. Genome Biol. 20, 202 (2019).
PubMed PubMed Central Google Scholar
Sasse, A. et al. Benchmarking of deep neural networks for predicting personal gene expression from DNA sequence highlights shortcomings. Nat. Genet. 55, 2060–2064 (2023).
PubMed CAS Google Scholar
Huang, C. et al. Personal transcriptome variation is poorly explained by current genomic deep learning models. Nat. Genet. 55, 2056–2059 (2023).
PubMed PubMed Central CAS Google Scholar
Khan, S. A. et al. Reusability report: Learning the transcriptional grammar in single-cell RNA-sequencing data using transformers. Nat. Mach. Intell. 5, 1437–1446 (2023).
Google Scholar
Tan, M. & Le, Q. Efficientnet: rethinking model scaling for convolutional neural networks. In Proc. 36th International Conference on Machine Learning 6105–6114 (PMLR, 2019).
Thompson, N. C., Greenewald, K., Lee, K. & Manso, G. F. The computational limits of deep learning. In Ninth Computing within Limits 2023. https://doi.org/10.21428/bf6fb269.1f033948 (LIMITS, 2023).
Vermeulen, C. et al. Ultra-fast deep-learned CNS tumour classification during surgery. Nature 622, 842–849 (2023).
Bauer, W. et al. A novel 29-messenger RNA host-response assay from whole blood accurately identifies bacterial and viral infections in patients presenting to the emergency department with suspected infections: a prospective observational study. Crit. Care Med. 49, 1664–1673 (2021).
PubMed PubMed Central CAS Google Scholar
Menghani, G. Efficient deep learning: a survey on making deep learning models smaller, faster, and better. ACM Comput. Surv. 55, 1–37 (2023).
Google Scholar
Micikevicius, P. et al. Mixed precision training. In 6th International Conference on Learning Representations https://openreview.net/forum?id=r1gs9JgRZ (2018).
Jacob, B. et al. Quantization and training of neural networks for efficient integer-arithmetic-only inference. In Proc. IEEE Conference on Computer Vision and Pattern Recognition 2704–2713 (IEEE, 2018).
He, Y., Zhang, X. & Sun, J. Channel pruning for accelerating very deep neural networks. In Proc. IEEE International Conference on Computer Vision 1389–1397 (IEEE, 2017).
Howard, A. G. et al. Mobilenets: efficient convolutional neural networks for mobile vision applications. Preprint at https://arxiv.org/abs/1704.04861 (2017).
Iandola, F. N. et al SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5 MB model size. Preprint at https://arxiv.org/abs/1602.07360 (2016).
Zhang, X., Zhou, X., Lin, M. & Sun, J. Shufflenet: an extremely efficient convolutional neural network for mobile devices. In Proc. IEEE Conference on Computer Vision and Pattern Recognition 6848–6856 (IEEE, 2018).
Tan, M. & Le, Q. EfficientNetV2: Smaller models and faster training. In Proc. 38th International Conference on Machine Learning 10096–10106 (PMLR, 2021).
Penzar, D. et al. LegNet: a best-in-class deep learning model for short DNA regulatory regions. Bioinformatics 39, btad457 (2023).
PubMed PubMed Central CAS Google Scholar
Baltrušaitis, T., Ahuja, C. & Morency, L.-P. Multimodal machine learning: a survey and taxonomy. IEEE Trans. Pattern Anal. Mach. Intell. 41, 423–443 (2018).
PubMed Google Scholar
Ashuach, T. et al. MultiVI: deep generative model for the integration of multimodal data. Nat. Methods 20, 1222–1231 (2023).
PubMed PubMed Central CAS Google Scholar
Long, D. et al. Potent effect of target structure on microRNA function. Nat. Struct. Mol. Biol. 14, 287–294 (2007).
PubMed CAS Google Scholar
Wang, X.-W., Liu, C.-X., Chen, L.-L. & Zhang, Q. C. RNA structure probing uncovers RNA structure-dependent biological functions. Nat. Chem. Biol. 17, 755–766 (2021).
PubMed CAS Google Scholar
Mortimer, S. A., Kidwell, M. A. & Doudna, J. A. Insights into RNA structure and function from genome-wide studies. Nat. Rev. Genet. 15, 469–479 (2014).
PubMed CAS Google Scholar
Sato, K., Akiyama, M. & Sakakibara, Y. RNA secondary structure prediction using deep learning with thermodynamic integration. Nat. Commun. 12, 941 (2021).
PubMed PubMed Central CAS Google Scholar
Singh, J., Hanson, J., Paliwal, K. & Zhou, Y. RNA secondary structure prediction using an ensemble of two-dimensional deep neural networks and transfer learning. Nat. Commun. 10, 5407 (2019).
PubMed PubMed Central Google Scholar
Wang, W. et al. trRosettaRNA: automated prediction of RNA 3D structure with transformer network. Nat. Commun. 14, 7266 (2023).
PubMed PubMed Central CAS Google Scholar
Guidotti, R. et al. A survey of methods for explaining black box models. ACM Comput. Surv. (CSUR). 51, 1–42 (2018).
Google Scholar
Lundberg, S. M. & Lee, S.-I. A unified approach to interpreting model predictions. In Advances in Neural Information Processing Systems Vol. 30. https://papers.nips.cc/paper_files/paper/2017/file/8a20a8621978632d76c43dfd28b67767-Paper.pdf (Curran Associates, Inc., 2017).
Janssens, J. et al. Decoding gene regulation in the fly brain. Nature 601, 630–636 (2022).
PubMed CAS Google Scholar
de Almeida, B. P., Reiter, F., Pagani, M. & Stark, A. DeepSTARR predicts enhancer activity from DNA sequence and enables the de novo design of synthetic enhancers. Nat. Genet. 54, 613–624 (2022).
PubMed Google Scholar
Simonyan, K., Vedaldi, A. & Zisserman, A. Deep inside convolutional networks: visualising image classification models and saliency maps. In Proceedings of the International Conference on Learning Representations. https://doi.org/10.48550/arXiv.1312.6034 (2014)
Shrikumar, A., Greenside, P., Shcherbina, A. & Kundaje, A. Not just a black box: learning important features through propagating activation differences. Preprint at https://arxiv.org/abs/1605.01713 (2016).
Sundararajan, M., Taly, A. & Yan, Q. Axiomatic attribution for deep networks. In Proc. 34th International Conference on Machine Learning 3319–3328 (PMLR, 2017).
Bach, S. et al. On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PLoS ONE 10, e0130140 (2015).
PubMed PubMed Central Google Scholar
Shrikumar, A., Greenside, P. & Kundaje, A. Learning important features through propagating activation differences In International Conference on Machine Learning 3145–3153 (PMLR, 2017).
Rudin, C. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nat. Mach. Intell. 1, 206–215 (2019).
PubMed PubMed Central Google Scholar
Bommasani R., et al. On the opportunities and risks of foundation models. Preprint at https://arxiv.org/abs/2108.07258 (2021).
Bubeck, S. et al. Sparks of artificial general intelligence: early experiments with gpt-4. Preprint at https://arxiv.org/abs/2303.12712 (2023).
OpenAI, et al. GPT-4 technical report. Preprint at https://arxiv.org/abs/2303.08774 (2023).
Radford, A., Narasimhan, K., Salimans, T. & Sutskever, I. Improving language understanding by generative pre-training. Preprint at https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf (2018).
Ji, Y., Zhou, Z., Liu, H. & Davuluri, R. V. DNABERT: pre-trained bidirectional encoder representations from transformers model for DNA-language in genome. Bioinformatics 37, 2112–2120 (2021).
PubMed PubMed Central CAS Google Scholar
Zhang, D. et al. DNAGPT: a generalized pretrained tool for multiple DNA sequence analysis tasks. Preprint at https://www.biorxiv.org/content/10.1101/2023.07.11.548628v1 (2023).
Celaj, A. et al. An RNA foundation model enables discovery of disease mechanisms and candidate therapeutics. Preprint at https://www.biorxiv.org/content/10.1101/2023.09.20.558508v1 (2023).
Chen, J. et al. Interpretable RNA foundation model from unannotated data for highly accurate RNA structure and function predictions. Preprint at https://www.biorxiv.org/content/10.1101/2022.08.06.503062v1.full (2022).
Badia-i-Mompel, P. et al. Gene regulatory network inference in the era of single-cell multi-omics. Nat. Rev. Genet. 24, 739–754 (2023).
PubMed CAS Google Scholar
Cha, J. & Lee, I. Single-cell network biology for resolving cellular heterogeneity in human diseases. Exp. Mol. Med. 52, 1798–1808 (2020).
PubMed PubMed Central CAS Google Scholar
Obermeyer, Z., Powers, B., Vogeli, C. & Mullainathan, S. Dissecting racial bias in an algorithm used to manage the health of populations. Science 366, 447–453 (2019).
PubMed CAS Google Scholar
Sirugo, G., Williams, S. M. & Tishkoff, S. A. The missing diversity in human genetic studies. Cell 177, 26–31 (2019).
PubMed PubMed Central CAS Google Scholar
Consortium GT. The GTEx Consortium atlas of genetic regulatory effects across human tissues. Science 369, 1318–1330 (2020).
Google Scholar
Spratt, D. E. et al. Racial/ethnic disparities in genomic sequencing. JAMA Oncol. 2, 1070–1074 (2016).
PubMed PubMed Central Google Scholar
Xu, J. et al. Algorithmic fairness in computational medicine. EBioMedicine 84, 104250 (2022).
PubMed PubMed Central Google Scholar
Sharma, S. et al. Data augmentation for discrimination prevention and bias disambiguation. In Proc. AAAI/ACM Conference on AI, Ethics, and Society; 358–364 (ACM, 2020).
Investigators AoURP. The “All of Us” research program. New Engl. J. Med. 381, 668–676 (2019).
Google Scholar
Gürsoy, G. et al. Functional genomics data: privacy risk assessment and technological mitigation. Nat. Rev. Genet. 23, 245–258 (2022).
PubMed Google Scholar
Lunshof, J. E., Chadwick, R., Vorhaus, D. B. & Church, G. M. From genetic privacy to open consent. Nat. Rev. Genet. 9, 406–411 (2008).
PubMed CAS Google Scholar
Shokri, R. & Shmatikov, V. Privacy-preserving deep learning. In Proc. 22nd ACM SIGSAC Conference on Computer and Communications Security, 1310–1321 (ACM, 2015).
Wan, Z. et al. Sociotechnical safeguards for genomic data privacy. Nat. Rev. Genet. 23, 429–445 (2022).
PubMed PubMed Central CAS Google Scholar
Gymrek, M., McGuire, A. L., Golan, D., Halperin, E. & Erlich, Y. Identifying personal genomes by surname inference. Science 339, 321–324 (2013).
PubMed CAS Google Scholar
Acar, A., Aksu, H., Uluagac, A. S. & Conti, M. A survey on homomorphic encryption schemes: theory and implementation. ACM Comput. Surv. 51, 1–35 (2018).
Google Scholar
Gilad-Bachrach, R. et al. Cryptonets: applying neural networks to encrypted data with high throughput and accuracy. In International Conference on Machine Learning, PMLR, 201–210 (PMLR, 2016).
Konečný, J. et al. Federated learning: strategies for improving communication efficiency. Preprint at https://arxiv.org/abs/1610.05492 (2016).
Rieke, N. et al. The future of digital health with federated learning. NPJ Digit. Med. 3, 119 (2020).
PubMed PubMed Central Google Scholar
Dayan, I. et al. Federated learning for predicting clinical outcomes in patients with COVID-19. Nat. Med. 27, 1735–1743 (2021).
PubMed PubMed Central CAS Google Scholar
Warnat-Herresthal, S. et al. Swarm learning for decentralized and confidential clinical machine learning. Nature 594, 265–270 (2021).
PubMed PubMed Central CAS Google Scholar
Wang, H. et al. Scientific discovery in the age of artificial intelligence. Nature 620, 47–60 (2023).
PubMed CAS Google Scholar
Katz, K. et al. The Sequence Read Archive: a decade more of explosive growth. Nucleic Acids Res. 50, D387–D390 (2022).
PubMed CAS Google Scholar
Hudson, T. J. et al. International network of cancer genome projects. Nature 464, 993–998 (2010).
PubMed CAS Google Scholar
Ko, G. et al. KoNA: Korean Nucleotide Archive as a new data repository for nucleotide sequence data. Genomics Proteomics Bioinformatics, qzae017 https://doi.org/10.1093/gpbjnl/qzae017 (2024).
Lee, B. et al. Introduction of the Korea BioData Station (K-BDS) for sharing biological data. Genomics Inform. 21, e12 (2023).
PubMed PubMed Central Google Scholar
Zou, Q., Xing, P., Wei, L. & Liu, B. Gene2vec: gene subsequence embedding for prediction of mammalian N(6)-methyladenosine sites from mRNA. Rna 25, 205–218 (2019).
PubMed PubMed Central CAS Google Scholar

Download references

Acknowledgements

This study was supported by the National Research Foundation of Korea (NRF), which is funded by the Ministry of Science and ICT, Republic of Korea (NRF-2019M3E5D3073104, NRF-2020R1A2C3007032, NRF-2020R1A5A1018081, and NRF-2022M3A9I2082294).

Author information

These authors contributed equally: Hyeonseo Hwang, Hyeonseong Jeon, Nagyeong Yeo.

Authors and Affiliations

School of Biological Sciences, Seoul National University, Seoul, Republic of Korea
Hyeonseo Hwang, Nagyeong Yeo & Daehyun Baek
Interdisciplinary Program in Bioinformatics, Seoul National University, Seoul, Republic of Korea
Hyeonseong Jeon & Daehyun Baek
Genome4me Inc., Seoul, Republic of Korea
Hyeonseong Jeon & Daehyun Baek

Authors

Hyeonseo Hwang
View author publications
You can also search for this author in PubMed Google Scholar
Hyeonseong Jeon
View author publications
You can also search for this author in PubMed Google Scholar
Nagyeong Yeo
View author publications
You can also search for this author in PubMed Google Scholar
Daehyun Baek
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

D.B. conceived the review. H.H., H.J., and N.Y. wrote the manuscript. All of the authors revised the manuscript.

Corresponding author

Correspondence to Daehyun Baek.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information for Big Data and Deep Learning for RNA Biology

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Hwang, H., Jeon, H., Yeo, N. et al. Big data and deep learning for RNA biology. Exp Mol Med 56, 1293–1321 (2024). https://doi.org/10.1038/s12276-024-01243-w

Download citation

Received: 31 January 2024
Revised: 27 February 2024
Accepted: 05 March 2024
Published: 14 June 2024
Issue Date: June 2024
DOI: https://doi.org/10.1038/s12276-024-01243-w
Springer Nature Limited

This article is cited by

Regulatory RNA: from molecular insights to therapeutic frontiers
- TaeSoo Kim
- Tae-Kyung Kim
Experimental & Molecular Medicine (2024)

Associated content

[Special Feature] Regulatory RNA: From Molecular Insights to Therapeutic Frontiers

Collection 14 June 2024

Editorial Summary

Exploring the complex world of RNA with deep learning

This review spotlights the revolutionary role of deep learning (DL) in expanding the understanding of RNA. RNA is a fundamental biomolecule that shapes and regulates diverse phenotypes including human diseases. Understanding the principles governing the functions of RNA is a key objective of current biology. Recently, big data produced via high-throughput experiments have been utilized to develop DL models aimed at analyzing and predicting RNA-related biological processes. This review emphasizes the role of public databases in providing these big data for training DL models. The authors introduce core DL concepts necessary for training models from the biological data. By extensively examining DL studies in various fields of RNA biology, the authors suggest how to better leverage DL for revealing novel biological knowledge and demonstrate the potential of DL in deciphering the complex biology of RNA.

This summary was initially drafted using artificial intelligence, then revised and fact-checked by the author.

Big data and deep learning for RNA biology

Abstract

Similar content being viewed by others

Deep dive into RNA: a systematic literature review on RNA structure prediction using machine learning methods

Selene: a PyTorch-based deep learning library for sequence data

Potential Applications of Deep Learning in Bioinformatics Big Data Analysis

Explore related subjects

Leveraging big data with deep learning in RNA biology

Public sources of large-scale RNA biology data

GEO and SRA

ENCODE

ArrayExpress & ENA

FANTOM

GTEx, TCGA, and ICGC

Characteristics of RNA biology data and related deep learning methods

Supervised learning

Self-supervised learning

Domain adaptation

Meta-learning

Data augmentation

Ensemble

Encoding RNA biology data into deep learning input features

Nucleic acid sequences

Expressions

Structures

Bindings

Deep learning architectures for leveraging RNA biology data

Multilayer perceptrons

Convolutional neural networks

Recurrent neural networks

Transformers

Graph neural networks

Applications of deep learning in RNA biology

Noncoding RNAs

Epitranscriptomics

RNA-binding proteins

Pre-mRNA processing

Gene expression

Medical applications of RNA biology

Desiderata for employing deep learning for RNA biology

Multidimensional and well-curated databases

Independent benchmarks

Computational efficiency

Multimodality and structure

Interpretability

Foundation models

AI safety

Prospects and promises of deep learning in RNA biology

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Supplementary information

Supplementary Information for Big Data and Deep Learning for RNA Biology

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Regulatory RNA: from molecular insights to therapeutic frontiers

Search

Navigation