Background

The implementation of zebrafish (Danio rerio) as an animal model to study human disease is growing at an unprecedented pace [1]. The applications span a wide range and include models for neurological disorders, aging, cancer, behavior, pharmacology, and toxicology, among others [2,3,4,5,6,7].

The fact that its embryo is transparent, placed zebrafish as one of the main vertebrate models to study developmental processes [8]. It has been shown that cellular and molecular events leading to and governing gastrulation, the formation of the primitive streak, and organogenesis in zebrafish show great parallels with mammals [9,10,11]. However, less is known about the differences and similarities between the female gametes.

Here, we sought to compare the transcriptome profile of the single matured human and unfertilized zebrafish oocytes at the time of ovulation. Our study shows that despite the significant evolutionary distance between humans and zebrafish, the mature female gametes of both species have significant similarities in gene expression.

Results

Gene expression by type

Our data analysis involve three single-cell RNA-seq datasets for the oocyte, each with three samples: zebrafish data generated by our group (ZF), human dataset 1 (H1) [12] and human dataset 2 (H2) [13]. In Fig. 1, we show the transcripts per million (TPM) distribution for each of the nine samples used in our analysis. As expected, most of the genes showed very low or no expression; on average 75, 65, and 45% of the genes had zero TPM, and 87, 80, and 61% of the genes had less than one TPM in the H1, H2, and ZF datasets, respectively. The smaller percentage of genes with little-to-no expression in zebrafish was due to the lower number of identified pseudogenes in the zebrafish genome, which tend to have low read assignments. In the supplementary data (Supplementary file 1), we break down the TPM distribution for each of the 9 samples (3 samples each coming from the 3 datasets) based on the 46 and 30 gene types described in human and zebrafish, respectively.

Fig. 1
figure 1

Transcripts per million (TPM) distribution for the nine samples used in our analysis: TPM values are divided into five intervals for each sample and the number of genes in each interval are shown. Biological replicates are indicated with lower case letters, a,b,c. Sample order follows the two human datasets (H1 and H2) followed by our zebrafish dataset (ZF)

About 88% of the gene abundance comes from protein-coding genes in human (90% for the H1 and 86% for the H2 datasets), whereas in zebrafish, this ratio is around 79%. In human, most of the noncoding gene abundance comes from mitochondrial ribosomal RNAs (Mt-rRNAs) and long intervening noncoding RNAs (lincRNAs). In zebrafish, the lincRNA abundance is less significant with most of the noncoding gene abundance coming from Mt-rRNAs and rRNAs (Supplementary file 1).

Orthologous gene expression

There are 18,388 orthologous gene pairs defined between the two organisms in the Ensembl database. These gene pairs involve many-to-many mappings, i.e., one human gene may be orthologous to more than one zebrafish gene; and there may be more than one human gene orthologous to the same zebrafish gene. The 18,388 orthologous gene pairs involve 13,963 human genes and 16,546 zebrafish genes. The Ensembl database further groups the orthologous gene pairs as high-confidence and low-confidence orthology. There are 9809 high-confidence orthologous gene pairs between the two organisms, and this mapping involves 9020 human genes and 9495 zebrafish genes. In Fig. 2, we summarize the types of genes involved in the orthology mapping and their confidence levels. Approximately 60% of the human protein-coding genes have an orthologous zebrafish gene.

Fig. 2
figure 2

Gene types that form an orthologous pair between human and zebrafish

In order to identify the expression of orthologous genes between the two organisms, we first identified genes that are “expressed” in a dataset as the genes that have a TPM value higher than one in all three biological replicates used in the dataset. This resulted in 5753, 9917, and 12,383 genes expressed in the H1, H2, and ZF datasets, respectively. There were 5443 genes common in the expressed gene lists for the two human datasets showing ~ 95% overlap between them. We then divided the expressed genes in each dataset into 10 quantiles, i.e., the first quantile consists of the top 10% of the most highly expressed genes in the dataset, etc. We compared the genes in each quantile across pairs of datasets, which we termed “quantile mapping.” In Fig. 3, we show the mapping results for each of the three pairwise comparisons; and in the supplementary data (Supplementary file 2), we show the genes in each of the cells shown in Fig. 3 with corresponding annotations, sample-level signal values, and z-scores. During the quantile mapping between the human and zebrafish datasets, we considered only the high-confidence orthologous genes retaining the cases that render many-to-many mappings as described above.

Fig. 3
figure 3

Quantile mapping between pairs of data sets: (a) H1 vs. ZF, (b) H2 vs. ZF, and (c) H1 vs. H2. For each mapping, a heatmap shows the number of common genes in each quantile. For across organism mappings (a and b), Row 11: genes that are expressed in zebrafish, have a high-confident orthologue in human, but are not expressed in human; Row 12: genes that are expressed in zebrafish but do not have a high-confidence orthologue in human; Column 11: genes that are expressed in human, have a high-confident orthologue in zebrafish, but are not expressed in zebrafish; Column 12: genes that are expressed in human but do not have a high-confidence orthologue in zebrafish. For H1-H2 mapping, Row/Column 11 identify the genes that are expressed in only one of the datasets. For each quantile, we also show the average TPM value shown in data value bars with a yellow background. In (d), we summarize the overlap between the top 30% of highly expressed (the 3 × 3 top-left corner of the quantile mappings in a and b) genes that are high-confidence orthologs across the two organisms for the H1 and H2 datasets

The quantile mapping between H1 and H2 shows that the 95% similarity between the two gene sets also follows the same TPM distribution as the large mapping numbers are observed along a diagonal (Fig. 3c). Therefore, not only do we see a high overlap among the genes expressed in the two human datasets, but these genes are also expressed at approximately the same relative levels in the two oocyte sets, underscoring the quality of the datasets. Our results for across organism mappings suggest that more than 50% of the genes expressed in the human oocyte have an orthologue that is also expressed in the zebrafish oocyte: 3174 for H1 and 5057 for H2 (data not shown). When only the high-confidence orthologs are considered, these numbers drop down to 2314 for H1 and 3657 for H2, accounting for ~ 40% of the genes expressed in the human oocytes (Fig. 3a, b). However, more importantly, these genes are concentrated on the top-left region of the quantile mapping heatmap. In other words, a higher percentage of the genes that are highly expressed in both organisms show high-confidence orthology compared to the lower expressed genes. For example, when H1 is compared to ZF, the 2314 high-confidence orthologous genes are distributed into 10 × 10 = 100 quantile mapping cells (Fig. 3a). Therefore, on average, we would expect ~ 23 genes to be in each cell for a random distribution. However, the very top-left cell, which represents the genes that are in the top 10% in both datasets and are high-confidence orthologs, for example, has 113 genes. This is a very significant occurrence (p < 10− 21, Fisher’s exact test) showing that high-confidence orthologous genes are concurrently highly expressed in the oocytes of the two organisms.

A similar observation holds for H2. Out of the 3657 genes expressed in H2 with a high-confidence ortholog in zebrafish that is also expressed in ZF, 151 are in the top 10% in the two organisms (p < 10− 25). This significance of occurrence does not just hold for the top-left cell in the quantile mapping but for the top-left region, as well. For example, if we focus on the top-left 3 × 3 corner of the quantile mapping results, i.e., high-confidence orthologous genes that are expressed in the top 30% in both of the organisms, we see 425 genes mapped for H1 (p < 10− 12) and 668 genes mapped for H2 (p < 10− 14). On the other hand, out of the genes that are expressed in the human oocyte and have a high-confidence ortholog in zebrafish (2812 for H1 and 4524 for H2; Fig. 3a, b), only about one-fifth are not expressed in the zebrafish oocyte (575 for H1 and 997 for H2; Fig. 3a, b). The genes that are expressed in the human oocyte and have a high-confidence ortholog in zebrafish comprise the total number of “unique” human genes in the quantile mapping that span Rows 1–10 and Columns 1–11. Among these, the unique human genes in Column 11 are the ones not expressed in zebrafish (Fig. 3a, b Supplementary file 2).

Highly concordant orthologous genes

The 425 and 668 genes that are high-confidence orthologs between the two organisms and appeared in the top 30% of the expression bracket for ZF as well as for H1 and H2 datasets, respectively, showed ~ 93% overlap, or 397 genes (Fig. 3d, Supplementary file 3). Based on the average TPM of the 9 samples, in Table 1 we show the top 25 of the 397 genes that we call “highly concordant orthologous genes.” In this table, we show only the top representative of a gene group, e.g., “mitochondrially encoded cytochrome c oxidase,” or “ribosomal protein.”

Table 1 Top 25 genes that are orthologous between human and zebrafish and expressed in the top 30% of all three data sets. Average TPM was calculated using all nine samples. For a gene family, e.g., “ribosomal proteins,” only the top representative is listed. The complete list of genes can be found in Supplementary file 3

In order to assess the similarity between the three datasets, we performed hierarchical clustering and principal components analysis (PCA) for the 9 samples using the 397 highly concordant orthologous genes. The results depicted in Fig. 4 show that the two human datasets are more similar to each other than they are to the zebrafish dataset. However, this similarity is not significantly different as the height of the hierarchical clustering branching between the two human datasets is almost as large as the branching between the human and zebrafish datasets. This is also evident in the PCA plot as the three datasets are almost equidistant from each other. Our ANOSIM analysis did not report significant difference between the pairs of datasets (R ~ 0.8, p < 0.1) while three-way comparison remained significant (R = 0.93, p < 0.005). A similar result was observed in the adonis analysis (pairwise R2 ~ 0.71, p < 0.1; three-way R2 = 0.87, p < 0.005). Although from a different organism, the distance between the zebrafish dataset and the two human datasets was not significantly different than the distance between the two human datasets. These results suggest that based on the highly concordant orthologous genes, zebrafish and human oocytes exhibit transcriptomic similarity as the expected organismal differences are not pronounced.

Fig. 4
figure 4

Sample similarity between the oocytes: (a) Hierarchical clustering and (b) principal components analysis (PCA) of the 9 samples using the 397 highly concordant orthologous genes. In (b), the percent variation explained by each PC is shown in parentheses

Functional analysis of the orthologous genes

We used Ingenuity® Pathway Analysis (IPA) (Ingenuity Systems, Redwood City, CA) to analyze the 397 highly concordant orthologous genes and investigated canonical pathways, downstream effects (functions), upstream regulators, regulator effects, and interaction networks. The complete IPA results are cataloged in the supplementary data (Supplementary file 3). In Fig. 5, we present the top members in each category along with associated functions, which is a summary generated by IPA consolidating the detailed categories with the highest significance presented in Supplementary file 3. In the supplementary data, we present the EIF2 signaling pathway, upstream regulator results for MYCN and HNF4A, along with their target molecules, and one gene interaction network highlighting genes involved in embryonic development (Supplementary Figures 1, 2, 3 and 4).

Fig. 5
figure 5

Summary of IPA results based on the 397 highly concordant orthologous genes. a, b Top Biofunctions and the most significantly enriched Canonical Pathways identified by IPA. Bars represent the number of genes in the functional category or the canonical pathway (primary y-axis) and the orange line represents the significance of the category or the pathway in -Log(p-value) (secondary y-axis). c Upstream regulators that target a significant portion of the genes in the input list. The inferred activation states of the regulators based on the observed expression of their targets are noted (e.g. an increased expression in targets that are induced by a regulator may imply an “activated” state for the regulator). N/A implies an inconclusive activation state of the regulator. d Number of genes and emerging biological functions in the deduced interaction networks that involve input genes. e Sets of regulators with a combined target gene set that show concordant enrichment in biological functions. Bars represent the total number of genes targeted by each set of regulators. On each bar, the biological functions that are significantly enriched by the target genes are noted

We also analyzed the 397 highly concordant orthologous genes using the EpiFactors database [14] to infer their roles in epigenetic regulation. In Table 2, we list the 36 genes that have been identified in EpiFactors as having an epigenetic function. In the supplementary data (Supplementary file 3), we list the detailed results of the EpiFactors analysis.

Table 2 Thirty-six out of the 397 highly concordant orthologous genes with an epigenetic function based on the EpiFactors database. Genes are sorted by decreasing abundance

Individual oocyte data set characterization

In order to identify functional similarity in the three datasets that is irrespective of orthology, we performed a comparative analysis at the systems level. For this purpose, we identified “highly expressed” genes in each dataset as the genes that have a z-score (based on logged TPM value of “expressed” genes) greater than 1.5 in two out of the three replicates in each study. This resulted in 460 H1, 761 H2, and 901 ZF genes (Supplementary file 4); and the two human datasets had 384 (~ 84%) highly expressed genes in common.

We analyzed each of the three highly expressed gene lists separately with the database for annotation, visualization and integrated discovery (DAVID v6.8) [15] to identify enriched Kyoto encyclopedia of genes and genomes (KEGG) pathways [16] and the biological process (BP), molecular function (MF), and cellular component (CC) gene ontology (GO) categories [17]. Detailed results are included in the supplementary data (Supplementary file 4). In Fig. 6, we list the KEGG pathway enrichment analysis results. Our results indicated that the two human datasets showed extreme similarity as expected; moreover, there was significant similarity between the zebrafish and human datasets as well. On average, about 65% of the significantly enriched categories in the zebrafish dataset were also significantly enriched in the human datasets.

Fig. 6
figure 6

Significantly enriched KEGG pathways based on DAVID functional analysis using genes that are highly expressed in the individual datasets. Bars represent the number of highly expressed genes in the pathway for each data set

Oocyte-specific gene expression

Although we observe significant concordance in highly expressed genes when orthologous genes are considered, it is possible that functionally important genes, e.g., genes critical in early development, may be expressed at lower levels in the oocyte. We had previously identified human oocyte-specific genes by comparing metaphase II oocytes with a reference consisting of a mixture of total RNA from 10 different normal human tissues not including the ovary [18]. These genes may be expressed at lower quantiles when all of the expressed genes in the oocyte are considered, but they may still have functional significance.

We explored the expression of those human oocyte-specific genes, which mapped to 3493 unique Ensemble Gene IDs, in all three datasets by identifying them on the quantile mapping described in Fig. 3a, b (Supplementary Figure 5, Supplementary file 5). Out of the 3493 human oocyte-specific genes, 2403 (~ 69%) and 3036 (~ 87%) were also expressed in H1 and H2, respectively. Of those 3493 human genes, 2864 (~ 82%) had a high confident ortholog in zebrafish and 2251 (~ 79%) of these 2864 genes were expressed in zebrafish (i.e. had a TPM value greater than 1 in all three replicates).

Our results indicated that the percentage of the orthologous genes, for the functionally important ones, significantly surpassed the percentage of the overall orthologous genes between the two organisms (~ 82% vs. ~ 60%). Furthermore, the functionally important genes that are orthologous are also concurrently highly expressed in the two organisms, which was true for the complete transcriptome as well. For example, the functionally important orthologous genes that are in the top 3 quantiles of expression levels in both organisms (i.e., the top 3 × 3 region of the heatmaps shown in Supplementary Figure 5) contained ~ 20% and ~ 24% of orthologous, functionally important genes expressed in both organisms for H1 and H2 respectively. Both observations were statistically significant (Fisher’s exact test p-values < 10− 14 and < 10− 16 for H1 and H2, respectively). Still, there were genes that were functionally important but were expressed at low levels in the two organisms e.g., orthologous genes that fall beyond the top 3 × 3 region of the heatmaps in Supplementary Figure 5. Nevertheless, although some at low levels, a large subset of the genes that are important for early development (over 80% on average) showed orthology and concordant expression between the two organisms.

Discussion

A comparative analysis between the zebrafish and the human genome showed high orthology between the two organisms at the gene sequence level [19]. We explored if this similarity was commonly observed at the transcriptome level and investigated the oocyte gene expression across the two organisms. We compared our zebrafish single-cell RNA-seq data to two published single-cell RNA-seq datasets from human oocytes. All three datasets have used three single cell metaphase II (MII) oocytes at the same stage, which are comparable across organisms, considered fully matured and ready to be fertilized. All three datasets have been obtained using standard RNA-seq protocols and our analysis began with the raw data obtained in the three experiments using the same computational workflow to prevent any variations. We found that majority of the genes expressed in the human oocyte have an ortholog that is also expressed in the zebrafish oocyte. However, when we divided the expressed genes in each data set into ten quantiles based on their level of expression, we found that the degree of concordance of both species increased significantly for highly expressed genes. Our results indicate that the number of high-confidence orthologous genes expressed in both organisms was about four times the number of high-confidence orthologous genes expressed in only one organism.

Among the genes expressed in the zebrafish egg, and not expressed in the human oocytes or with no orthologue in human, we found members of the family of the elongation of very long-chain fatty acid (Elovl) genes ELOVL1A, 8B, 7B, 6 L, 1B, and 4B; and those highly expressed were ORG, CLDND, CLDNG, and HMGB3B, in agreement with previous reports [20,21,22,23]. Also, interestingly, several pluripotency-associated genes appear specifically in the zebrafish egg, as NANOG, Pou5f3, GDF3, KLF2 and KLF5, opening a venue to use the zebrafish oocyte as a model for studying the role of maternal “pluripotency” genes that can have a significant role, but different than those described so far in mammals, during post-fertilization reprogramming [24].

We performed IPA analysis on the highly concordant orthologous genes and found that EIF2 signaling and oxidative phosphorylation were the top two activated “canonical pathways” based on statistical significance assessing their enrichment (p < 10− 20). For “upstream regulators,” the top ranked were the protooncogene MYCN (p < 10− 35), as activated, and RICTOR (p < 10− 34), as inhibited where the regulators are ranked by the significance of overlap between their targets and the input gene list to the IPA. Genes associated with regulation of cell cycle progression were involved in regulator effect networks analysis, which identify a congruent theme for the targets of upstream regulators that are among the input genes (the highly concordant orthologous genes). Furthermore, the “top active networks” identified by IPA (based on statistical significance of the involvement of the input genes in the networks using Fisher’s exact test) were RNA post-transcriptional modification and embryonic development (p < 10− 50). Based on our enrichment analysis using DAVID, “spliceosome,” “oxidative phosphorylation,” “RNA transport,” and “ribosome” were among the KEGG pathways common to all three datasets further underlining the functional similarity between the oocytes of the two organisms. Details of the IPA and DAVID functional analysis can be found in Supplementary files 3 and 4.

We used the EpiFactors database to identify genes that are known to be involved in epigenetic modifications; and we found a high number of the highly concordant orthologous genes to be listed (~ 10%). The list includes well characterized genes, such as DNMT1, PCNA, and UHRF1. Also listed was DPY30, which has recently been characterized in the context of embryonic development and required for pluripotency maintenance [25]. Other epigenetic chromatin reprogramming factors, such as TET3 digoxygenase or histone modifier factors (HDAC1, HDAC8, SIRT1, SIRT7, HAT1, or TRAM1), with a crucial role in the zygote following fertilization and nuclear reprogramming also appear listed, further confirming functional concordance between species [26, 27]. Similarly, the expression of MTA3 in all datasets is thought provoking. MTA3 belongs to the family of metastasis associated proteins, known for their role in cancer progression; however, there are no studies describing its role in gametes or embryonic development [28].

When we explored the distribution of oocyte-specific genes in order to assess the levels of expression and orthology for genes that may be important in early development, we observed an improved agreement between the two organisms. Both the percentage and expression levels of the orthologous genes for the oocyte-specific genes were higher than those observed when all of the genes were considered. Therefore, the amount and degree of agreement between the two organisms increased for functionally crucial genes, further validating the use of zebrafish as a model organism in developmental biology.

Conclusions

In order to understand the differences and similarities between human and zebrafish matured oocytes, we performed RNA-seq of individual zebrafish eggs and compared them to two different datasets of individual human oocytes at the same developmental stage. Our results indicate that the degree of orthology between the expressed genes in the two organisms is significantly larger for highly expressed genes. Moreover, the functional relevance based on gene expression in the two organisms’ oocytes shows high concordance. Our study compares the observed transcriptome of the metaphase II oocytes, including genes that are deposited in the oocytes for late usage during the development. Our results represent the similarity between the two organisms for the given stage and indicates not only high orthology but high functional similarity in their transcriptomes.

Despite the significant evolutionary distance between human and zebrafish, the mature female gametes of both species show significant similarities in gene expression. The results of these functional analyses underscore the use of zebrafish as a valuable animal model and provide evidence for future hypothesis-driven experiments related to germ cell development, gametogenesis, and epigenetic inheritance, among others. Future studies may involve comparative transcriptome analysis on other cell types.

Methods

Zebrafish maintenance

All experimental protocols were approved by the Michigan State University Institutional Animal Care and Use Committee, the Spanish Institutional Animal Use and Care Committee regulations and the Regional Andalusian Government (code A/ES/14/43). All animal methods were carried out in accordance with standard practices described in ‘The Zebrafish Book’ and by ZM Varga 2016 [29, 30]. The zebrafish used in this study were obtained from the Zebrafish International Resource Center (ZIRC; http://zebrafish.org). After the study, animals were euthanized according with our American Association for Accreditation of Laboratory Animal Care (AAALAC) approved protocol: Fish were euthanized in sodium bicarbonate buffered tricaine methanesulphonate (MS-222) at a concentration of 250 mg/L according to American Veterinary Medical Association (AVMA) guidelines on euthanasia. This is administered by adding directly to the water to which fish are acclimated. Solution was made fresh for every 10 fish to ensure that the dosage and buffering remain effective. Death was confirmed by absence of heartbeat and corneal reflex.

Zebrafish egg isolation

The F1 zebrafish, a cross between the Tübinguen and AB lines of zebrafish were used as egg donors. Only females proven to give rise to > 90% of fertilized eggs at 24 h post fertilization (hpf) were chosen and subsequently used for egg collection. The eggs were obtained from the same female to minimize variability. Equipment and reagents were prepared beforehand as previously described [31]. Females were transferred into a beaker with a sedation solution (0.02% w/v tricaine solution or MS222). When movement ceased, we rinsed the fish in clean water. Each female was set on a folded Kimwipe® paper in an upturned position so that the genital opening was easily accessible. The abdominal area surrounding the genitalia was dried with Kimwipe paper. A small amount of pressure was gently applied to two sides of the belly using a rounded-edge glass rod and a fingertip. Inactivated eggs were collected in Chinook salmon ovarian fluid (CSOF) to maintain the metaphase-arrested stage of the eggs [32]. The morphological appearance of each egg was examined, and only high-quality eggs were used for sequencing, as previously described [31,32,33]. Eggs were rinsed twice in 0.5% bovine serum albumin in Hank’s Balanced Salt Solution to wash off the CSOF. Then, the eggs were swiftly rinsed in RNAse-free phosphate buffered saline (PBS) three times and immediately transferred to a 1.5-ml tube –one oocyte per tube– with the least amount of PBS left, snap frozen using liquid nitrogen for 1 min of immersion, and then stored at − 80 °C until RNA isolation.

RNA-seq experimental procedure

RNA isolation and purification were performed using Direct-zolTM RNA MicroPrep according with manufacturer instructions (Zymo Research, Irvine, CA) on three individual oocytes. Briefly, frozen eggs were thawed at room temperature and resuspended in Tri Reagent, followed by addition of 95% ethanol, DNase I treatment, and centrifugation in an IC column [34]. Final RNA elution was performed using DNase/RNase-free water. Total RNA was processed for library construction by Cofactor Genomics (St. Louis, MO) according to the following procedure: Briefly, total RNA was reverse-transcribed using an Oligo (dT) primer, and limited cDNA amplification was performed using the SMARTer® Ultra® Low Input RNA Kit for Sequencing – v4 (Takara Bio USA, Inc., Mountain View, CA). The resulting full-length cDNA was fragmented and tagged, followed by limited PCR enrichment to generate the final cDNA sequencing library (Nextera® XT DNA Library Prep, Illumina, San Diego, CA). Libraries were sequenced as single-end 75 base pair reads on an Illumina NextSeq500 following the manufacturer’s instructions.

Human datasets

We used two single-cell RNA-seq datasets from human oocytes (labeled H1 and H2). For dataset H1 (Gene Expression Omnibus (GEO) accession number GSE44183), oocytes were obtained from female patients that were between 26 and 35 years old. Oocytes were vitrified and thawed at the time of RNA isolation. Only oocytes that had an intact cell membrane and zona pellucida were used (n = 3) [12]. For dataset H2 (GEO accession number GSE110798), oocytes were also obtained from women undergoing fertility treatments. RNA isolation was performed in in vivo matured oocytes immediately after follicle aspiration (n = 3) [13].

RNA-seq analysis

We generated single-end, single-cell RNA-seq data for three zebrafish oocytes (labeled ZF). The average read length was 75 base pairs (bp), and the average number of reads per sample was ~ 46.3 million. Both human datasets were processed using paired-end sequencing. The first dataset, H1, had an average read length of 90 bp and an average number of reads per sample of ~ 37.4 million (~ 18.7 million fragments) [12]. The average read length for H2 samples was 100 bp with an average number of reads per sample of ~ 59.9 million (~ 30.0 million fragments) [13].

We processed all three datasets using FastQC (v. 0.11.5) and Trimmomatic (v 0.38) [35, 36]. Overrepresented sequences due to experimental artifacts (e.g., adapters and similar technical sequences) were identified and subsequently removed. Trimmomatic was used in the palindrome mode, based on default alignment detection and scoring parameters. Low quality bases were trimmed using the maximum information quality filtering followed by a minimum average read quality threshold of 25. Following technical sequence and low-quality base removal, reads that were shorter than 40 bp were filtered out. Sequencing quality metrics for all three datasets (both for raw and trimmed+filtered reads) were assessed using FQStat, which resulted in passing values with respect to the default quality control thresholds identified therein [37].

Transcript quantification was done using Salmon (v. 0.8.2) with default parameters based on Ensembl database annotations [38, 39]. Salmon uses sample-specific models, such as correction for guanine-cytosine (GC) content bias, that improve the accuracy of transcription abundance estimates. We used transcripts per million (TPM) in Salmon’s output as the normalized relative abundance measure employed in our downstream analysis. Gene-level abundance estimates were obtained using the R package (tximport) as gene-level signal summaries have shown to be the more robust statistic [40].

Gene expression analysis

For each dataset, we identified genes that are “expressed” in a dataset as the genes that had a TPM value greater than 1 in all three replicates used in the dataset. In order to identify genes that are “highly expressed” in a dataset, we calculated the z-score of the genes (in a dataset) based on the logged TPM values of the “expressed” genes. If the gene’s z-score in a dataset exceeded 1.5 in at least two out of the three replicates, we identified the gene as “highly expressed” in this dataset. Orthologous genes between the two organisms were obtained from the Ensembl database along with the confidence assessment in orthology, which is a binary classification: 1 (high) or 0 (low).

Genes expressed in a human dataset were contrasted with the genes expressed in the zebrafish dataset using quantile mapping. Genes expressed in a human data set were divided into ten quantiles, and genes in each quantile were mapped through orthology to the ten expression quantiles identified for the genes expressed in the zebrafish dataset. For each human quantile, we further assessed the genes that did not have an orthologue in zebrafish and the genes that had an orthologue in zebrafish but were not identified as expressed in the zebrafish dataset. Similarly, for each zebrafish quantile, we identified genes that did not have an orthologue in human and genes that had an orthologue in human but were not identified as expressed in the human dataset. Quantile mapping was done for the two human datasets as well, in order to assess their similarity in gene expression.

Clustering of samples was done using the unweighted pair group method with arithmetic-mean (UPGMA) method with Pearson’s correlation as the distance measure [41]. The expression data matrix was row normalized prior to the application of average linkage clustering. Samples were also used for principal components analysis (PCA), which is a dimension reduction technique that represents samples along orthogonal PCs (axis) that decreasingly capture the variance in the underlying data [42]. Plotting the samples along the PCs reflect the closeness between the samples and inherently imply an alternative clustering view for the samples. Both the hierarchical clustering and PC analyses were done using MATLAB® (The MathWorks, Inc., Natick, MA). Analysis of similarity (ANOSIM) [43] and adonis (permutational multivariate analysis of variance using distance matrices) [44] tests were applied to PCA results using the vegan package (v. 2.3–5 and v. 2.4–2) in R (v. 3.5.3) to evaluate the differences between the three datasets.

Systems biology analysis

Gene sets of interest were analyzed using the database for annotation, visualization and integrated discovery (DAVID v6.8), the Ingenuity® pathway analysis (IPA) (Ingenuity Systems, Redwood City, CA), and the EpiFactors database [14, 15]. Functional analysis of the gene lists was done using DAVID based on the biological process (BP), molecular function (MF), and cellular component (CC) gene ontology (GO) categories and the Kyoto encyclopedia of genes and genomes (KEGG) pathways [15, 16]. The expression analysis systematic explorer (EASE) score was used to assess over represented categories that are biologically relevant and warrant further investigation. The EASE score is the upper bound of the distribution of jackknife iterative resampling of Fisher exact probabilities with Bonferroni multiple testing correction. Categories containing low numbers of genes are under weighted so that the EASE score is more robust than the Fisher exact test. The EASE score is a significance level with smaller EASE scores indicating increasing confidence in over representation. We picked GO categories and KEGG pathways that have EASE scores of 0.05 or lower as significantly overrepresented.

We employed IPA, which uses Ingenuity® Pathways Knowledge Base (IPKB), a manually curated knowledgebase based on scientific literature that involves biological interactions and functional annotations for genes and gene products. Given a gene list, IPA uses enrichment analysis-based approaches to identify canonical pathways, downstream effects, upstream regulators, regulator effects, and interaction networks that best explain the observed expression levels [45, 46]. In the canonical pathways and downstream effects analysis, known pathways or functions that involve a significant number of the input genes are identified. The upstream regulators approach identifies transcription factors, microRNAs, kinases, compounds, drugs, etc., that are known to regulate a significant portion of the input genes. The regulator effects analysis combines the upstream regulator and downstream effects analyses to develop a causal hypothesis combining upstream regulators with supposedly regulated input genes that are significantly represented in a biological function. Finally, the interaction network analysis identifies input (and other) genes that are highly connected based on literature and overlays a number of features, such as expression, function, drug, etc., regarding the genes in the network. IPA also calculates the “state” of the individual analysis target (e.g., “activating” or “inhibiting”) based on the observed expression changes and the known causal relationships from the literature.

In order to investigate the epigenetic involvement of the gene lists of interest, we cross-referenced them with the EpiFactors database. EpiFactors is a manually curated database that catalogues genes, proteins, and complexes that are involved in epigenetic regulation. Functional and relational annotation is provided for regulators and their targets, which may be used to associate gene expression patterns with epigenetic regulation.