Background

Gene expression profiles play an important role in biological research. The depiction of the gene expression profile helps focus on the role and significance of genes in different diseases and promote the development and application of drugs. Multiple high-throughput sequencing (HTS) techniques have been used to obtain gene expression profiles, and the most commonly used methods are DNA microarray and RNA-Seq. Microarrays have been used as a transcriptomics platform since 1990s [1], while the high-throughput RNA-Seq method has emerged only in the past two decades [2]. Compared to microarrays, RNA-Seq has a higher dynamic range for the detection of low-abundance transcripts and does not necessarily rely on a reference genome, allowing for novel transcript and variant detection [3]. The emergence of HTS has accumulated a large amount of data and made it possible to depict gene expression maps more conveniently. Therefore, benefitting from public data, many databases have been constructed for public users to analyze gene expression profiles.

The exploration of gene expression profiles of experimental animals, such as mice and rats, is useful to clarify the conserved nature of genes in different species, which is an essential premise for the safe and effective application of biological experiments in humans. Mice have a wide variety of strains and mature genome modification techniques, making them the most commonly used laboratory animals. However, in some models, rats are better suited for experimental studies than mice. In some respects, such as cognition and behavior, rats are closer to humans than mice. Compared with mice, rats are larger in size and easier to study physiological characteristics, which are suitable for scientific research on behavior, cells, physiology, biochemistry, pharmacology and toxicology [4]. They are important model animals to increase our understanding of common human diseases and are widely used in physiological studies, such as hypertension, diabetes, breast cancer and neurological diseases.

There are many expression profile databases for humans and mice, such as The Human Protein Atlas (HPA, https://www.proteinatlas.org), Genotype-Tissue Expression (GTEx, https://gtexportal.org), RNA Seq Atlas [5], Mouse Gene Expression Database (MGD, https://www.informatics.jax), Mouse Phenome Database (MPD, https://phenome.jax.org), and Tabula Muris [6]. However, the database for querying rat gene expression has not yet been amply reported. The rat BodyMap database performs RNA-Seq from 11 organs of both sexes of different age rats [7]. The Rat Genome Database [8] (RGD, http://rgd.mcw.edu, 1999) was created by the American Academy of Medical Sciences, collecting data on rat genomics, genetics, physiology and more, but without expression profiles. Additionally, there are some comprehensive databases that can satisfy queries of multiple species, such as ArrayExpress [9] and Gene Expression Omnibus (GEO) [10]. However, both databases are mainly used to store omics data. Microarray datasets in GEO can be analyzed directly online by GEO2R [10]; however, no tools aim for HTS datasets in both databases. Therefore, users without informatics skills cannot systematically analyze and compare gene expression profiles, particularly HTS data.

The analysis of RNA-Seq data generally relies on a target genome for reads mapping and a gene transfer format (GTF) file for expressional quantification. The latest genome versions, such as the human genome reference T2T-CHM13 [11], mouse genome reference GRCm39 [12], and rat genome reference mRatBN7.2 [13], comprise more comprehensive gene information, which facilitates the discovery of some previously unnoticed genes. In addition, the gene-level quantification approach generally uses a GTF file containing gene models, with each model representing the structure of transcripts produced by a given gene. Nevertheless, the gene expression profiles for rat databases mentioned above are annotated using a more previous version of the rat genome reference and GTF file. This makes a lack of genes in the results when the previous or incomplete GTF files are used for gene quantification.

Since the latest rat genome reference mRatBN7.2 was published, there has not been a rat gene expression database satisfying the retrieval requirement. In addition, the current rat gene expression databases are focused on tissues and lack expression information on primary cells and cell lines. To address this need, we constructed the Omic Horizon Expression Database, which facilitates gene expression profiling and further functional clues at multiple levels.

Construction and content

Data collection

All HTS dataset information about rat samples was derived from GEO DataSets [10] searching (https://www.ncbi.nlm.nih.gov/geo/) according to processes similar to those described previously [14, 15]. The downloaded samples were due to March 2022. RNA-Seq data, with RNA being total RNA or polyA RNA, from Rattus norvegicus were retained. The tissue and cell sources were manually annotated based on sample characteristics and sorted in ascending order. Then, the top several samples of each type of tissue and cell were selected. The selected sample size accounts for approximately 10% of the total sample. Because the dataset information was associated with the Sequence Read Archive (SRA), the SRA run (SRR) IDs were extracted from the selected dataset information. Based on the accession numbers, RNA-Seq data were downloaded via SRAtools from the SRA database (https://www.ncbi.nlm.nih.gov/sra/) and used for subsequent analysis.

Data processing

The sra format files were converted to FASTQ format files by the fastq-dump command. The program fastp [16] (https://github.com/OpenGene/fastp) was used for quality control of raw reads according to the standard pipeline. Read alignment to the reference genome of rat mRatBN7.2 was performed by the STAR program [17]. The GTF file (v105) was downloaded from the ENSEMBLES website (http://ftp.ensembl.org/pub/release-105/gtf/rattus_norvegicus/). The rat reference sequences (https://ftp.ncbi.nlm.nih.gov/refseq/R_norvegicus/) was used to help gene symbol annotation. Raw counts of each gene were first calculated with featureCounts [18], and the transcripts per kilobase million (TPM) values (see below) were further calculated and used for the subsequent analysis. During the process, samples with fewer than 5,000 detected genes were removed.

TPM value calculation

To eliminate the effects of sequencing depth and gene length, we used the TPM value to measure the gene expression level. A TPM value is calculated by the following formula: TPMi = (\(\frac{\text{N}\text{i}}{\text{L}\text{i}}\))*106/\(\left(\sum _{i=1}^{n}\frac{\text{N}\text{i}}{\text{L}\text{i}}\right)\), where Ni indicates the count of reads mapping to gene i and Li indicates the length of exons of gene i.

Construction of the database

The Omic Horizon Expression database was based on the MVT (Model/View/Template) design pattern and implemented based on the Python web framework-Django. The data were stored and managed by a MySQL relational database (version 8.0.17). Data box plotting via Highchart. Omic Horizon Expression runs on an Apache web server (version 2.4.54). It accesses the database using mysqlclient.

Molecular cloning of Lilrb1 and Lilrb3

Male Sprague-Dawley rats were obtained from Peking University Health Science Center, Department of Laboratory Animal Science. Rats were sacrificed by carbon dioxide. Tissues were isolated and temporarily stored in TRIzol reagent (Trans#ET111-01). Total RNA in tissues was extracted and reverse transcribed into cDNA. The molecules were cloned using nested polymerase chain reaction (nested-PCR), and the primers are shown in Table 1. The outer primers were labeled F1 and R1 for the forward and reverse primers, respectively, while the inner primers were labeled F2 and R2.

Table 1 The primers for molecular cloning

Utility and discussion

Basic description of the database

In this study, we reanalyzed the sequencing data, mapped the reads to the reference genome mRatBN7.2, and then quantified gene expression using the v105 version of the GTF file. A new database for rat gene expression, the Omic Horizon Expression Database (or abbreviated as OmicHorizon@Expression), was constructed (Fig. 1), containing the majority of tissues and cells of publicly available HTS data for rats. To date, it contains 2,762 samples in terms of the sequencing run data (Additional file 1), which are categorized into 48 tissues, 40 primary cell types, and 25 cell lines. In addition, some tissues were further divided into several different subtissues according to the anatomical site, while some primary cells were described in detail based on different tissue sources. The summary of datasets collected in the database is shown in Table 2, and the description of cell lines is shown in Table 3. It is available to query the expression profiles of over 30,000 rat genes based on Ensemble gene IDs by the Omic Horizon Expression database, which supports requests using gene symbols (or alias), Ensemble and Entrez gene IDs.

Fig. 1
figure 1

Workflow of the Omic Horizon Expression database development. The Omic Horizon Expression database development includes three major steps: data collection, data processing and database construction. The key points of each step are shown in frames with surrounding information

Table 2 The RNA-Seq sample information stored in the Horizon Expression database
Table 3 Description of rat cell lines

Query of the database

The interface is simple and easy to understand (Fig. 2a,b). On the homepage, we can select a search type from the drop-down menu as tissue, primary cell and cell line. Then, an interesting tissue or cell type is selected before the new drop-down menu pops up for further selection of subtissue or cell sources. When ‘All’ is selected, all tissues or cells under the layer will be queried to show the result. Then, we entered the query gene symbol or gene ID and clicked on the button ‘GO’ to enter the result page for the gene expression profile. Please note that not all legends on the x-axis will be displayed when there are too many retrieved tissues or subtissues, for example, when “All” is selected in the first step. In this situation, the legend option helps users refine illustration when the interested legends are selected from the drop-down list box (Fig. 2c), moreover, the full retrieved results are also shown in a table, which is downloadable.

Fig. 2
figure 2

A simplified example of navigating the database. (a) The page to enter the OmicHorizon@Expression database. (b) The interface of the OmicHorizon@Expression database. (c) The result page of the queried gene expression profiles includes three sections: brief information, expression profile indicated by boxplots, and data table that can be downloadable. The data table can be sorted by median. In the second section, the legend option helps users select their interested tissues and cells from the boxplots to refine illustration; the reset button helps return to the initial display result

Understand the query result

The results page includes three sections (Fig. 2c). The first section is the basic information of the queried gene, such as gene description, gene alias, gene IDs, gene biotype and chromosome location. The second section indicates the gene expression profile shown in box plots. In each plot, the x-axis represents the queried tissues, primary cells or cell lines, while the y-axis represents the expressional TPM values. A box plot consists of five data nodes, which arrange a set of data from largest to smallest, as the upper edge (Q3 + 1.5*IQR), Q3 (75th percentile), median (50th percentile), Q1 (25th percentile) and lower edge (Q1-1.5*IQR), which are calculated for each profile. The values that overflow the range from the upper edge to the lower edge are identified as the outliers, and are shown as circles in the plot. The upper and lower edges are set to the maximum and minimum values, respectively, when there are no outliers. Each value represents a sample source in the box plots. The third section is a table form showing the gene expressional TPM values of the queried gene. The box plot and table files are downloadable.

Functions of the database

The Omic Horizon Expression Database provides gene expression profiles in multiple tissues and subtissues that are beneficial for gene functional clues. The “subtissue” catalog in this database provides a more detailed expression profile. For example, the “brain” tissues comprise subtissues from different brain regions, such as the amygdala, anterior pituitary, forebrain, superficial zone and thalamus; the “breast” tissues are divided into subtissues “basal fractions” and “luminal fractions” based on sample location. There are a total of 78 subtissues that are mainly derived from the brain, kidney and muscle tissues. This function offers multiple possibilities for the retrieval and presentation of gene expression.

Expression profiles in primary cells and cell lines help to arrange further functional studies, such as gene overexpression, knockdown or knockout. From the current database version, the primary cells are derived from a total of 33 tissues and subtissues. Some primary cells have multiple sources, such as macrophages from bone marrow, brain, lung and peritoneum. Therefore, differential expression can be analyzed among different tissue sources in the same cell types.

We used several application cases in the next three sections to show how to use the Omic Horizon Expression database to solve biological problems based on gene expression profiles.

Application case 1: conventional gene expression profiling

This is the general usage to show differential gene expression in various tissues, primary cells and cell lines. For example, cytokine-like protein 1 (CYTL1), also called protein C17, is a secretory protein originally identified in human CD34+ cells with a predicted cytokine and interleukin structure characteristics [19]. Sequence alignment showed that CYTL1 genes are conserved in humans, rats and mice [20]. Studies have shown that human CYTL1 is highly expressed in the aorta, placenta, and trachea [20] and that mouse Cytl1 is highly expressed in the cartilage, trachea, lung, and heart [21]. Using OmicHorizon@Expression, we found that rat Cytl1 is highly expressed in cartilage and aorta (Fig. 3a), indicating conservative expression among these species and further suggesting a conservative functional role. The queried expressional values of Cytl1 can be downloaded, which provides the chance to users to generate their expression profiles of interest (Fig. 3b). In primary cells, Cytl1 showed relatively high expression in rat germ cells, breast epithelial cells and aortic smooth muscle cells (Fig. 3c).

Fig. 3
figure 3

Expression profile of rat Cytl1. (a) Cytl1 expression profiles in rat tissues derived from OmicHorizon@Expression. (b) Cytl1 expression profiles in rat tissues of interest. It is locally drawn according to expression values downloaded from OmicHorizon@Expression. (c) Cytl1 expression profiles in primary cells derived from OmicHorizon@Expression.

Application case 2: species-specific gene expression profiling

Considering the large number of gene expression databases in humans and mice, it is possible to determine whether there is species-specific expression based on the current rat database. For example, lysozyme G-like 1 (LYG1) is a classical secretory protein identified through immunogenomics and belongs to the lysozyme G family [22]. The Omic Horizon Expression database shows that rat Lyg1 is highly expressed in the gingiva (Fig. 4a). The gene expression profile database indicates that human LYG1 is highly expressed in the kidney (Fig. 4b) while mouse Lyg1 is highly expressed in the stomach (Fig. 4c). The change in the expression profiles suggests that LYG1 gene should evolve actively, which may result in functional discrimination in different species.

Fig. 4
figure 4

Expression profile of LYG1 genes. (a) Lyg1 expression profiles in rat tissues derived from OmicHorizon@Expression. (b) LYG1 expression profiles in human tissues derived from GTEx. (c) Lyg1 expression profiles in mouse tissues derived from the mouse ENCODE project

Application case 3: novel genes identification

The database uses the latest genome and the relevant GTF version, which are useful for identifying novel genes. Immune checkpoint therapy (ICT), which is designed to block inhibitory signals mediated by immune checkpoint molecules, such as PD-1 and CTLA-4, has revolutionized the field of cancer immunotherapy because of its clinical success in many cancers [23]. However, only a minority of patients benefit from current immune checkpoint inhibitors, highlighting the need to identify novel drug targets, including novel promising inhibitory receptors. Leukocyte immunoglobulin-like receptor subfamily B (LILRB) proteins (LILRBs 1–5) contain cytoplasmic immunoreceptor tyrosine-based inhibitory motifs (ITIMs) and transduce a negative signal in multiple cell types in the tumor microenvironment, providing novel opportunities for anti-cancer immunotherapy [24, 25]. Identification of the homologous genes of LILRBs in other species, such as mice and rats, contributes to their functional studies, target validation and drug development. However, the homologous genes of LILRBs in mice and rats have not been fully identified. In the NCBI reference gene and sequence database, there are known rat Lilrb2 and Lilrb4 mRNA sequences but only predicted rat Lilrb1 and Lilrb3 genes by automated computational annotation deposited in the database. In addition, rat Lilrb1 and Lilrb3 genes are not yet recorded by the UCSC Genome Browser (http://genome.ucsc.edu) and are not currently searchable. Lilrb1 is also lacking in the previous rat GTF versions. Therefore, this prompted us to use HTS data combined with PCR to verify the existence and exon structures of Lilrb1 and Lilrb3 in the rat genome (Fig. 5).

Fig. 5
figure 5

Identification of rat Lilrb1 and Lilrb3 genes. (a) HTS data support the transcriptional evidence of the rat genes Lilrb1 and Lilrb3. The reference genome mRatBN7.2 was used for read alignment, and the result was viewed by the Integrative Genomics Viewer (IGV) browser. The sequencing data used for Lilrb1 were merged by 30 samples that were randomly selected from 135 bone marrow samples. The sequencing data used for Lilrb3 were merged from 35 spleen samples. (b) Lilrb1 expression profiles in rat tissues derived from OmicHorizon@Expression. (c) Validation of Lilrb1 and Lilrb3 expression in rat tissues by nested-PCR. Tissues are indicated by the numbers as shown. (d) The genomic structure of Lilrb1 and Lilrb3 based on the amplified sequences. Exon numbers are indicated. The sequencing results were confirmed to be correct and supported by multiple clones. (e) Lilrb3 expression profiles in rat tissues derived from OmicHorizon@Expression. (f) Sketch maps of the domains and motifs of LILRB1 and LILRB3 in humans and rats. The human protein motifs are based on the annotation in the UniProt database (https://www.uniprot.org/), while the rat protein motifs are predicted from the SMART web server (http://smart.embl-heidelberg.de/) and protein alignment

HTS data from rat tissues support the transcriptional evidence of the rat genes Lilrb1 and Lilrb3 (Fig. 5a). The Omic Horizon expression database revealed that Lilrb1 is highly expressed in bone marrow, spleen, blood, bone, and tendons (Fig. 5b). We successfully cloned Lilrb1 from the spleen (Fig. 5c). There was one transcript of Lilrb1 containing complete open reading frames (ORFs) (Fig. 5d & Additional file 2). However, the sequence similarity between the Lilrb1 cloned transcript and the predicted transcript in the NCBI GenBank database is 91% in nucleotides (Additional file 3) and 85% in amino acids. The sequence inconsistencies may be due to the differences in animal strains, as Brown Norway rat in the NCBI reference genome but Sprague-Dawley rat in our experiment were used, suggesting that there should be active evolution of Lilrb1.

Similarly, the Omic Horizon Expression database shows that Lilrb3 is highly expressed in the spleen, liver, thymus, bone and ear (Fig. 5e). We successfully cloned Lilrb3 from the spleen and liver and confirmed the existence of rat Lilrb3 (Fig. 5c). Lilrb3 has three transcript variants, including Lilrb3-v1, -v2 and -v3, with different exon combinations (Fig. 5d & Additional file 4). The similarities between these cloned transcripts and predicted transcripts of Lilrb3 in the NCBI GenBank database are all 100% at either the nucleic acid or protein level.

Interestingly, sequence alignments based on rat Lilrb1 and Lilrb3 mRNA sequences further confirm the loss of both genes in mice because of multiple pre-stop codons in the deduced coding regions in the homologous genomic region. The protein similarities between human and rat LILRB1 and LILRB3 are 32.5% and 40.6%, respectively (Additional file 5), with similar domains and motifs (Fig. 5f). Therefore, the successful identification of rat Lilrb1 and Lilrb3 will contribute to functional studies based on rat models.

Conclusions

In this study, we constructed a new gene expression database named the Omic Horizon Expression database. It supports gene expression profiling and comparison in 48 tissues, 40 primary cell types and 25 cell lines via the query of gene symbols (or alias), Ensemble and Entrez gene ID for a total of 30,560 rat genes. Through the database, we verified the inexistence of Lilrb1 and Lilrb3 in the mouse genome but their existence in the rat genome, which provided the basis for functional studies and drug development in rat models. In the current version, only rat data can be queried, however, multiple species will be considered in the future, especially for primary cells and cell line data, to which other existing databases often receive less attention. In addition, considering the new gene annotation in the more frequently updated GTF files rather than the reference genome version, as well as the accumulation of public data leading to other tissue and cell types that have not yet been covered in the current study, we will regularly update the database.