Integrating Iso-seq and RNA-seq data for the reannotation of the greater amberjack genome

Zhao, Yuanli; Chen, Zonggui; Hu, Meidi; Liu, Hairong; Zhao, Haiping; Huang, Yang; Jiang, Mouyan; Li, Shengkang; Li, Guangli; Zhu, Chunhua; Hu, Wei; Luo, Daji

doi:10.1038/s41597-024-03495-7

Integrating Iso-seq and RNA-seq data for the reannotation of the greater amberjack genome

Data Descriptor
Open access
Published: 22 June 2024

Volume 11, article number 675, (2024)
Cite this article

Download PDF

You have full access to this open access article

Scientific Data

Integrating Iso-seq and RNA-seq data for the reannotation of the greater amberjack genome

Download PDF

Yuanli Zhao¹^na1,
Zonggui Chen^1,2^na1,
Meidi Hu ORCID: orcid.org/0000-0001-6177-4606^1,3^na1,
Hairong Liu¹,
Haiping Zhao¹,
Yang Huang^4,5,
Mouyan Jiang⁵,
Shengkang Li⁶,
Guangli Li⁵,
Chunhua Zhu^4,5,
Wei Hu¹ &
…
Daji Luo ORCID: orcid.org/0000-0002-8051-9624^1,3

231 Accesses
Explore all metrics

Abstract

The greater amberjack is a very important fishery species with high commercial value, and it is distributed worldwide. Transcriptome-based studies on S. dumerili have been limited by an inadequate reference genome and a lack of well-annotated full-length transcripts. In this study, a total of 12 tissues from juvenile and adult fish both sexes were collected for next-generation RNA sequencing (RNA-seq) and full-length isoform sequencing (Iso-seq). For Iso-seq, a total of 163,218, 149,716, and 189,169 high-quality unique transcript sequences were obtained, with an N50 of 5,441, 5,255, and 5,939, from juvenile, adult male and adult female S. dumerili, respectively. We integrated the Iso-seq and RNA-seq data to construct a comprehensive gene annotation and systematically profiled the dynamics of gene expression across the 12 tissues. Our gene models had greater detail and accuracy than those from NCBI and Ensembl, with more precise polyA locations. These resources serve as a foundation for functional genomic studies and provide valuable insights into the molecular mechanisms underlying the development, reproduction and commercial traits of amberjack.

Integrated transcriptome catalog of Tenualosa ilisha as a resource for gene discovery and expression profiling

Article Open access 17 April 2023

Single-molecule Real-time (SMRT) Sequencing Facilitates Transcriptome Research and Genome Annotation of the Fish Sillago sinica

Article 09 September 2022

Guidelines for RNA-seq projects: applications and opportunities in non-model decapod crustacean species

Article 03 July 2018

Background & Summary

The greater amberjack (Seriola dumerili), which is known for its rapid growth and superior flesh quality, is an important aquatic species with high commercial value, and it is widely distributed around the world¹. Furthermore, S. dumerili is known to have higher contents of health-promoting fatty acids, such as docosahexaenoic acid (DHA), eicosapentaenoic acid (EPA), and ω-3 polyunsaturated fatty acids (ω-3 PUFAs), than other fish species, making it a preferable choice for a health-conscious diet². Recently, the genomes of S. dumerili³ and three others different Seriola species, Japanese amberjack (S. quinqueradiata)⁴, Almaco jack (S. rivoliana)⁵ and yellowtail amberjack (S. lalandi)⁶, have been successfully sequenced and assembled. Deep sequencing has revolutionized the fields of biology and aquaculture, offering isoform precision and high-throughput capabilities for understanding candidate genes that control commercial traits⁷. However, the greater amberjack genome has currently been elucidated only at the scaffold level³, and the available gene models in this genome are mostly derived from computational prediction, which can be incomplete or even inaccurate. This incomplete gene annotation and transcriptome information limits the investigations of the molecular mechanisms involved in various biological processes in the greater amberjack.

Current gene annotations for the greater amberjack, found in database such as NCBI and Ensembl, primarily rely on in silico prediction and were assembled using a limited number of short-read RNA-sequencing (RNA-Seq) data^3,8. These technologies, limited by short read fragment lengths, traditionally fail to capture the full contiguous sequence of RNAs, leading to incomplete, or incorrectly compressed isoform annotations⁹. The recent advancement of long-read isoform sequencing (Iso-seq) can accurately capture full-length (FL) isoforms, utilizing long reads that are greater than 10 kb in length. Iso-seq has been successfully used to discover diverse novel isoforms and annotates full-length isoforms from start to end in a wide variety of species¹⁰. Many studies have combined Iso-seq and short-read RNA-seq to enhance the accuracy and obtain a more comprehensive understanding of diverse isoform expression quantification^11,12. Thus, there is a need for an integrated application using Iso-seq and RNA-seq to explore the complexity of transcriptomes in the greater amberjack.

By integrating FL isoform structures revealed by Iso-seq with the deep sequencing coverage of RNA-Seq, we constructed a more comprehensive view of isoform information for 12 tissues representing juvenile and adult stages of both sexes in the greater amberjack. These tissues included the eye, brain, pituitary, heart, muscle, spleen, gill, liver, intestine, stomach, gonad, and kidney. Our findings uncovered 105,607 novel spliced isoforms of known genes and new isoforms transcribed from 13,827 novel gene loci; these discoveries indicate that the amberjack transcriptome is significantly more intricate and dynamic than current annotations released. These data and findings provide a valuable resource for further exploration of critical genes and the molecular mechanisms underlying biological processes and commercial traits of amberjack.

Ethics statement

All the experiments were conducted according to the guidelines and regulations outlined by the ‘Management and Use of Laboratory Animals of Hubei Province’ and complied with China’s existing laws and regulations for biological research. This study did not involve any endangered or protected species.

Methods

Sample collection, library construction and sequencing

The greater amberjack individuals were obtained from Hainan Yonghe Biotechnology Co., Ltd. (Hainan, China) and Fujian Dongshan Fuminyang Aquatic Development Co., Ltd. (Guangdong, China). Twelve tissues (eye, brain, pituitary, heart, muscle, spleen, gill, liver, intestine, stomach, gonad, and kidney) were harvested for RNA sequencing. The sex of juveniles cannot be determined, so we only differentiated the sex of adult fish. Additionally, harvesting pituitary, spleen, and stomach tissues from juvenile fish is challenging, so we did not include these tissues in our juvenile sample set. For each tissue, a paired-end RNA-sequencing library was constructed with an insert size of 300 bp, and then the library was sequenced on the MGI DNBSEQ platform to generate paired-end (PE) reads of 2 × 150 bp. Each tissue was represented by two biological replicates. Three mixtures (juveniles, adult females, and adult males) were also used to construct an Iso-seq library with fragment sizes of 5–10 K, which was subsequently sequenced on the PacBio RSII platform (Fig. 1). We used the greater amberjack FASTA genome from NCBI (ID: GCF_002260705.1)¹³, the GTF annotation file from NCBI (ID: GCF_002260705.1)¹³, and the GTF annotation file from Ensembl (ID: Sdu_1.0.103)¹⁴.

Short-read assembly using RNA-seq

The adapter sequences were removed by BGI prior to delivery. The analysis of the phred quality score distribution of the RNA-seq data was conducted using the numpy and pandas packages in Python. To improve the precision and sensitivity of splice junction discovery, paired-end DNBSEQ RNA-seq reads were first aligned to the greater amberjack genome (Sdu_1.0) by using STAR (v2.7.5b)¹⁵ software. This alignment was performed in end-to-end mode, employing the following parameters “–runMode alignReads–outSAMtype BAM SortedByCoordinate–limitBAMsortRAM 10000000000–readFilesCommand zcat –outFileNamePrefix –genomeDir –genomeLoad–readFilesIn”. Then, the reads were aligned to the genome with splice junctions from the 12 tissues. The statistical mapping results that were obtained using STAR software were analyzed using the numpy and pandas packages in Python. Only the uniquely mapped alignments were considered for further analysis. The potential PCR amplification bias that was introduced during library construction was approximately removed by using the MarkDuplicates command of the Picard Toolkit (v2.23.3, http://broadinstitute.github.io/picard/). The Fragments Per Kilobase of exon model per Million (FPKM) values and counts for different genomic features were calculated using StringTie (v2.1.4)¹⁶ with the following parameters: “–rf -e -A gene_abund.tab -C cov_refs.gtf -G input.gtf -o transcripts.gtf”.

Long-read assembly using PacBio full-length Iso-seq

We utilized several programs from PacBio SMRT Analysis and followed the SMRT Tools Reference Guide in this processing pipeline. For each transcriptome library, circular consensus sequences (CCSs) were constructed from the raw subreads by using ccs (v6.0.0) with the following parameters “–skip-polish–min-passes 4–min-length 200–min-rq 0.99”. The primer sequences in the long CCS reads were removed by using lima (v2.0.0) with the parameter “–isoseq”. The full-length, nonchimeric (FLNC) cDNA reads were identified by the refine subcommand of isoseq. 3 (v3.4.0) with the following parameters: “–require-polya–min-polya-length 20”. Then, we used the cluster and polish subcommand to cluster FLNC reads and generated polished transcripts. Finally, the polished transcripts were mapped to the greater amberjack genome (Sdu_1.0) by using minimap2 (v2.17-r941)¹⁷ with the parameter “-x splice”. During SMART processing, we used the numpy and pandas packages in Python to analyze the CCS reads and generate statistical results, as well as to determine the length distribution of high-quality unique transcript sequences.

Construction of gene annotation and validation of the polyA site

The ultimate gene annotation was generated from the integration of the isoforms identified by RNA-seq and Iso-seq. Since long-read assembled isoforms reveal the original structures of transcripts, all long-read assembled isoforms were retained. The short-read assembled isoforms were filtered out according to the following criteria: a) overlap with long-read assembled isoforms in the same orientation and b) low expression level (FPKM lower than 0.5 in all samples). The remaining short-read assembled isoforms were merged with long-read assembled isoforms to produce the comprehensive gene annotation. In detail, for RNA-seq, we used StringTie (v2.1.4)¹⁶ with the parameter “–rf” to construct isoforms de novo from short-read alignments and merged the assemblies. For Iso-seq, we collapsed the full-length alignments to construct isoforms using a script from TAMA with the parameter “tama_collapse.py -s output.sam -f input.fa -x no_cap -p params.prefix” and merged the assemblies using the tama_merge.py script. The gene character analysis was conducted using the numpy and pandas packages in Python.

The transcripts were identified as internal primer transcripts and removed if the 20nt sequence downstream of the polyA site contained more than 15As. The retained isoforms were termed long-read assembled isoforms. The final assembly model (ASM) was predicted using the long-read assembled isoforms and supplemented with short-read assembled isoforms that did not overlap with long-read assembled isoforms on the same strand. Finally, all the transcriptome isoforms were compared to the NCBI annotation using SQANTI3 (v3.6.1)¹⁸ to identify genes and fusion genes. For the NCBI annotation, the version of the assembly and gene model annotation that were used was Sdu_1.0. and GCF_002260705.1_Sdu_1.0, respectively.

Functional annotation of protein-coding genes

The ORFs of the isoforms were predicted using ORFfinder (v0.4.3), and the translated amino acid sequences were aligned to the NCBI NR databases using the Diamond (v0.9.14.115)¹⁹ program with an e-value < 1e-5. A hit with more than 90% identical matches and the lowest e-value for a certain ORF was considered to be a homologous protein. For the isoforms with more than one predicted ORF, we selected the ORF with the lowest e-value as the final ORF. The GO terms and KEGG pathways were assigned by eggNOG-mapper (v2)²⁰ with the following parameter: “emapper.py -m diamond -i input.fa–output params.prefix -d euk–usem”.

Identification of tissue-specific genes in the greater amberjack

Tissue-specific genes are defined as those that exhibit higher expression exclusively within a particular tissue. FPKM is calculated with the following formula: \(FPKM=\frac{count\times {10}^{9}}{Librarysize\times Length}\). Initially, genes with an FPKM less than 1 in all the samples were filtered out. Subsequently, the z score for each gene across all the samples was computed. Genes with a z score greater than 1 in a specific tissue and a z score less than 0 in other tissues were classified as tissue-specific genes. The following formula was used for calculating the z-score²¹: \(z=\frac{{\rm{x}}-{\rm{\mu }}}{{\rm{s}}}\), where x is the gene expression, μ is the mean of all gene expression across the 12 tissues, and s is the standard deviation. Short reads from RNA-seq were aligned to the ASM, and gene expression levels were calculated using StringTie with the following parameters: “–rf -e -A output.out/gene_abund.tab -C output.out/cov_refs.gtf -G input.gtf -o”. The correlation calculation was performed using the clustermap function from the seaborn package in Python (https://seaborn.pydata.org/generated/seaborn.clustermap.html), and the clustering algorithm was hierarchical/agglomerative clustering.

Data Records

All the RNA-seq and full-length Iso-seq raw reads and relevant information for the greater amberjack were deposited in the Sequence Read Archive (SRA) of the National Center for Biotechnology Information under BioProject accession number PRJNA1035924²². The transcriptome GTFs, FASTA and SQANTI reports for short-read assembled isoforms, long-read assembled isoforms, and our assembled isoforms can be accessed through the Zenodo under https://doi.org/10.5281/zenodo.11207647²³.

For RNA-seq, we performed strand-specific RNA-seq (paired-end, 150 bp), generating a total of 3,425,467,779 (3.42 Gb) clean read pairs, averaging 51,901,027 (51.9 ± 6.3 M) pairs per cDNA library (Table 1). Among these reads, 92.31% had a mean base quality higher than or equal to the Q30 (Fig. 2a). Furthermore, we randomly extracted 10,000 reads from each sample and calculated the base quality of the paired-end R1 and R2, both of which had an average score of 37 (Fig. 2b), indicating a very high prediction accuracy for the base call. On average, 90.34% of the short reads were aligned to a unique location of the genome, and an additional 7.47% of the short reads were aligned to multiple locations, indicating that the collected RNA was almost free of contamination (Fig. 2c,d and Table 1). After genome alignment and removal of PCR duplicates, we obtained 1,658,051,483 (1.66 Gb) read pairs, with an average of 25,121,992 pairs of reads per cDNA library. In the short-read approach, we assembled 91,901 isoforms (short-read assembled isoforms) of 51,687 genes, averaging 3 isoforms per gene and 13 exons per isoform. Among these genes, 17,491 (33.84%) contained multiple isoforms (>2 isoforms). Among these isoforms, 92,811 (87.34%) contained multiple exons (Figure S1a).

Table 1 Summary of sequencing and genome alignment results for RNA-seq.

Full size table

For PacBio full-length Iso-seq, a total of 17,409,419 (17.41 M) polymer reads were yielded (with an average of 15 subreads per polymer read) (Table 2). Of these, 53.67%, 56.89%, and 58.42% of the polymer reads from juvenile, adult male and adult female S. dumerili, respectively, were successfully combined into circular consensus sequences (CCS). After SMRT-link processing (see Method), approximately 60% of the reads passed through a zero-mode waveguide filtering (ZMWs, Fig. 2e), and 163,218 (163.3 K), 149,716 (149.7 K), and 189,169 (189.2 K) high-quality (HQ) unique transcript sequences were obtained, with an N50 of 5,441, 5,255, and 5,939, respectively, for each library (Fig. 2f and Table 2). More than 99.9% of HQ transcript sequences were successfully aligned to the greater amberjack genome. In the long-read approach, we assembled 130,734 isoforms (long-read assembled isoforms) of 17,457 genes, averaging nearly 8 isoforms per gene and 17 exons per isoform. Among these genes, 11,608 (66.49%) contained multiple isoforms. Among these multiple isoforms, 119,012 (95.88%) contained multiple exons (Figure S1a).

Table 2 Summary of sequencing and processing for Iso-seq.

Full size table

Technical Validation

Quality control of comprehensive gene annotation

Comprehensive gene annotation was achieved by integrating RNA-seq and Iso-seq data. The assembly statistics are shown in Fig. 3 and Table 3. Compared with the NCBI and Ensembl isoforms, the novel annotations identified more than 3 times more isoforms, and the isoforms were longer (Figure S1a). Furthermore, long-read assembled isoforms exhibited a higher number of isoforms per gene, demonstrating the robust capability of full-length isoform sequencing for detecting isoforms (Figure S1a). Short-read assembled isoforms tended to have fewer exons (1–5 exons per isoform), whereas long-read assembled isoforms often contained more exons (over 10 exons per isoform) (Figure S1b).

Table 3 Summary of gene annotations from different sources.

Full size table

Our assembly results (ASM-based isoforms) identified a total of 147,218 isoforms from 45,167 genes across all 12 tissues, with an average of 5.44 isoforms per gene (Fig. 3a) and 16 exons per isoform (Fig. 3b). In the comparison between our assembly and the NCBI annotation, 22,040 (14.97%) isoforms corresponded to full splice matches (FSMs) in the NCBI annotation, and 66,603 (45.24%) isoforms contained novel splice sites (Novel Not in Catalog, NNC) (Fig. 3c). This proportion of NNC sites in the ASM-based isoforms was lower than that in both the short-read assembled isoforms and long-read assembled isoforms compared with that in the NCBI and Ensembl databases, demonstrating more transcripts matching known reference transcript in the ASM-based transcriptome; this also serves as one of the indicators of a more complete assembly of our transcriptome (Figure S1c). A significant increase of NIC (Novel in Catalog) isoforms in the long-read transcriptome compared to short-read transcriptome (Figure S1c), demonstrating an advantage of the long-read transcriptome. Among these genes, 13,827 (41.44%) were novel genes that were not annotated in the NCBI database (Fig. 3d). According to gene character analysis, genes that were annotated only in our assembly from short-read assembled isoforms or in the NCBI database were typically shorter and expressed at lower levels than those that were annotated from both sources. In contrast, genes that were identified in both our assembly and the NCBI database tended to be characterized by longer lengths and higher expression (Fig. 3e). To determine the functions of these genes, we extracted the predicted ORF sequences and searched the NR, eggNOG, Pfam, Swiss-Prot, KOG, GO, KEGG, and COG databases. Overall, 24,801 (74.33%) genes were annotated as protein-coding genes. Among the protein- coding genes, 16,902 (68.15%) genes were successfully assigned to GO terms, and 10,366 (41.80%) genes were assigned to KEGG pathways (Fig. 3f). Within specific GO terms, the number of genes that were identified in the ASM annotation surpassed that in NCBI, as indicated by values exceeding the red line (Fig. 3g).

Quality control of polyA site location

To evaluate the 3′ end of our assembled transcriptome model, we analyzed the nucleotides content surrounding the putative polyA site. The results revealed a tendency of high contents of adenine (A) in the 20 base pairs (bp) upstream of the polyA site, while the 20 base pairs downstream of the polyA site were rich in uracil (U). In the comparison isoforms from different sources, long-read assembled isoforms showed the most significant tendency, while those from Ensembl displayed an almost random distribution of nucleotides (Figure S2a–d). This regularity observed in the Ensembl is due to over half of the isoforms not containing a 3′UTR and confirmed the completeness of the isoforms assembled in ASM transcriptome. Furthermore, we calculated the density of conserved 3′ end processing signals around the polyA sites and found enrichment of AAUAAA and AUUAAA in the 20 bp upstream of the polyA site compared with the NCBI annotations (Fig. 4b).

The identification of genes with a certain distance between the proximal and distal polyA sites is an indicator of the completeness of our assembled transcriptome. Transcripts with varying 3′ UTR lengths are thought to play a role in regulating gene expression at the posttranscriptional level by influencing RNA stability, translation efficiency, or cellular localization. We determined the distance between the proximal polyA site and the distal polyA site for genes with multiple isoforms. The number of genes with Alternative Polyadenylation (APA) (with a distance >100 bp) from the ASM-based isoforms was counted and compared with that from the long-read assembled, short-read assembled, NCBI, and Ensembl results. ASM-based isoforms included 5,949 genes with APA, which was far more than others (Fig. 4c). However, the distribution of distance among the genes with APA did not significantly differ among the assemblies (Fig. 4d). To verify the precision of the polyA site annotation, we showed the tracks of akhr gene across the 12 tissues (Fig. 4d). The polyA sites of akhr were found to vary among the data from NCBI, Ensembl, and our assembly. Our assembly provided greater precision in locating polyA sites than existing gene models and revealed the complexity of the transcriptome at alternative polyadenylation level.

Correlation analysis between samples based on gene expression profiling

Based on the improved gene annotation, we attempted to construct a gene expression profile for the greater amberjack. Short-read data were used to quantify gene expression across the 12 tissues at two developmental stages (juvenile and adult) and from different sexes (male and female). Each sample had two biological replicates. Furthermore, the fragments per kilobase per million mapped reads (FPKM) for our assembled genes were computed using the short-read data, and their similarity and specificity were investigated. The results demonstrated that the correlation coefficients of two biological replicates within a group were mostly greater than 0.8 (Fig. 5a). The FPKM distribution across the 12 tissues showed that the FPKM values from the liver and muscle were slightly lower than those from the other 10 tissues (Fig. 5b). The correlation analysis results of all the samples indicated that, except for the intestines and eyes of the juveniles, the correlation coefficients within the tissues of juveniles, adult females and adult males all reached 0.9, indicating that the samples clustered together well (Fig. 5c).

We identified the genes that were uniquely highly expressed in each tissue and generated a z score normalized expression heatmap clustered by genes. In the present study, tissue-specific genes clustered together, showing similar expression patterns and significantly higher expression levels than those that were expressed in multiple tissues (Fig. 5a). Of all the tissues, the brain had the most tissue-specific genes (524), followed by the pituitary gland (349), spleen (287), gill (282), stomach (270), and liver (245). On the other hand, the gonad (39) and eye (74) had the fewest tissue-specific genes (Fig. 5a). GO term enrichment analysis of these genes was performed based on our functional annotation. The terms that were enriched in certain tissues were consistent with the functions of the tissues; for example, the brain-specific genes enriched neuron-related terms, such as “synaptic membrane”, “regulation of synaptic plasticity” and “neuron to neuron synapse”; the pituitary gland-specific genes enriched endocrine-related terms, such as “peptide hormone processing”, “hormone activity” and “pituitary gland development”; and the gonad-specific genes enriched the reproduction-related terms, such as “pole plasm”, “germ plasm”, and “piRNA metabolic process” (Fig. 5a). The details of the enriched GO terms for each tissue are summarized in Table 3. These findings prove that our results are credible and can be used to guide further research.

Usage Notes

The ASM dataset was constructed with a comprehensive gene annotation that not only had greater detail and accuracy than those from NCBI and Ensembl, but also featured more precise polyA site locations, as well as systematically profiled gene expression patterns across the 12 tissues in the greater amberjack. The uploaded binary alignment (BAM) files contain reads from the short-read assembled, long-read assembled, NCBI, Ensembl, and ASM that were already mapped to reference genome of the greater amberjack. These aligned files can be further analyzed using various bioinformatics program packages, such as STAR, stringtie, TACO, CCS, lima, isoseq3, pbmm2, minimap2, tama, bedtools, ORFfinder, diamond, eggNOG, sqanti3, and samtools, or visualized using, e.g., IGV. The uploaded Illumina and PacBio files were not trimmed and contain terminal poly(A) sequences as well as 5′ and 3′ adapter sequences, which can be used to determine the orientations of the reads.

Code availability

All the software that was used in this study is publicly available, and the parameters that were used are clearly described in the Methods sections. If no detailed parameters were mentioned for a particular software, the default parameters were used as suggested by the developer.

The code for the analysis pipelines was deposited at GitHub: https://github.com/Ckenen/integrated-transcriptome-of-seriola-dumerili.

References

Bordignon, F. et al. Fatty acid signatures in different tissues of mediterranean yellowtail, Seriola dumerili (Risso, 1810), fed diets containing different levels of vegetable and fish oils. Animals (Basel) 10, 198 (2020).
Article PubMed Google Scholar
His, H., Hsu, Y., Chang, T. & Chien, L. Methylmercury concentration in fish and risk-benefit assessment of fish intake among pregnant versus infertile women in Taiwan. PLoS One 11, e0155704 (2016).
Article Google Scholar
Araki, K. et al. Whole genome sequencing of greater amberjack (Seriola dumerili) for SNP identification on aligned scaffolds and genome structural variation analysis using parallel resequencing. Int. J. Genomics 2018, 7984292 (2018).
Article PubMed PubMed Central Google Scholar
Yasuike, M. et al. The yellowtail (Seriola quinqueradiata) genome and transcriptome atlas of the digestive tract. DNA Res. 25, 547–560 (2018).
Article CAS PubMed PubMed Central Google Scholar
Chen, Z., Li, Y., Liang, P. & Li, M. The complete mitochondrial genome of longfin yellowtail S. rivoliana (Perciformes: Carangidae). Mitochondrial DNA Part A 27, 3385–3386 (2016).
Article CAS Google Scholar
Purcell, C. M. et al. Insights into teleost sex determination from the Seriola dorsalis genome assembly. BMC Genomics 19, 31 (2018).
Article PubMed PubMed Central Google Scholar
Zhao, Y. et al. Dynamic gene expression and alternative splicing events demonstrate co-regulation of testicular differentiation and maturation by the brain and gonad in common carp. Front. Endocrinol (Lausanne). 12, 820463 (2022).
Article PubMed PubMed Central Google Scholar
Sarropoulou, E. et al. Full genome survey and dynamics of gene expression in the greater amberjack Seriola dumerili. Gigascience 6, 1–13 (2017).
Article CAS PubMed PubMed Central Google Scholar
Torre, D. Isoform-resolved transcriptome of the human preimplantation embryo. Nat. Commun. 14, 6902 (2023).
Article CAS PubMed PubMed Central ADS Google Scholar
Ebeneezar, S. et al. Full-length transcriptome from different life stages of cobia (Rachycentron canadum, Rachycentridae). Sci. Data 10, 97 (2023).
Article CAS PubMed PubMed Central Google Scholar
Liao, X. et al. Transcript annotation of Chinese sturgeon (Acipenser sinensis) using Iso-seq and RNA-seq data. Sci. Data 10, 105 (2023).
Article CAS PubMed PubMed Central Google Scholar
Wang, K. Multi-strategic RNA-seq analysis reveals a high-resolution transcriptional landscape in cotton. Nat. Commun. 10, 4714 (2019).
Article CAS PubMed PubMed Central ADS Google Scholar
NCBI https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_002260705.1 (2017).
Ensembl https://ftp.ensembl.org/pub/release-103/gtf/seriola_dumerili (2020).
Dobin, A. et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics. 29, 15–21 (2013).
Article CAS PubMed Google Scholar
Pertea, M., Kim, D., Pertea, G. M., Leek, J. T. & Salzberg, S. L. Transcript-level expression analysis of RNA-seq experiments with HISAT, StringTie and Ballgown. Nat. Protoc. 11, 1650–1667 (2016).
Article CAS PubMed PubMed Central Google Scholar
Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics. 34, 3094–3100 (2018).
Article CAS PubMed PubMed Central Google Scholar
Pardo-Palacios, F. J. et al. SQANTI3: curation of long-read transcriptomes for accurate identification of known and novel isoforms. Nat. Methods 21, 793–797 (2024).
Article CAS PubMed PubMed Central Google Scholar
Buchfink, B., Xie, C. & Huson, D. H. Fast and sensitive protein alignment using DIAMOND. Nat. Methods 12, 59–60 (2015).
Article CAS PubMed Google Scholar
Huerta-Cepas, J. et al. eggNOG 5.0: a hierarchical, functionally and phylogenetically annotated orthology resource based on 5090 organisms and 2502 viruses. Nucleic Acids Res. 47, 309–314 (2019).
Article Google Scholar
Pala, M. et al. Population- and individual-specific regulatory variation in Sardinia. Nat. Genet. 49, 700–709 (2017).
Article CAS PubMed PubMed Central Google Scholar
NCBI Sequence Read Archive https://identifiers.org/ncbi/insdc.sra:SRP470228 (2024).
Zhao, Y. The transcriptome GTFs, FASTA and SQANTI reports for short-read assembled isoforms, long-read assembled isoforms and our assembled isoforms. Zenodo https://doi.org/10.5281/zenodo.11207647 (2024).

Download references

Acknowledgements

This work was supported by funds from the Fund of Southern Marine Science and Engineering Guangdong Laboratory (Zhanjiang) (ZJW-2019-06) to DL, WH and CZ, National Natural Science Foundation of China (No. 31922085) to DL, Laboratory of Lingnan Modern Agriculture Project (NT2021008) to WH, and DL.

Author information

These authors contributed equally: Yuanli Zhao, Zonggui Chen, Meidi Hu.

Authors and Affiliations

Key Laboratory of Breeding Biotechnology and Sustainable Aquaculture, Institute of Hydrobiology, The Innovative Academy of Seed Design, Hubei Hongshan Laboratory, Guangdong Laboratory for Lingnan Modern Agriculture, Chinese Academy of Sciences, Wuhan, 430072, China
Yuanli Zhao, Zonggui Chen, Meidi Hu, Hairong Liu, Haiping Zhao, Wei Hu & Daji Luo
Academy for Advanced Interdisciplinary Studies, Peking University, Beijing, 100871, China
Zonggui Chen
Fisheries College, Ocean University of China, Qingdao, 266001, China
Meidi Hu & Daji Luo
China Southern Marine Science and Engineering Guangdong Laboratory (Zhanjiang), Zhanjiang, 524025, China
Yang Huang & Chunhua Zhu
Fisheries College of Guangdong Ocean University, Guangdong Research Center on Reproductive Control and Breeding Technology of Indigenous Valuable Fish Species, Guangdong Provincial Key Laboratory of Aquatic Animal Disease Control and Healthy Culture, Zhanjiang, 524088, China
Yang Huang, Mouyan Jiang, Guangli Li & Chunhua Zhu
Guangdong Provincial Key Laboratory of Marine Biology, Shantou University, Shantou, 515063, China
Shengkang Li

Authors

Yuanli Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Zonggui Chen
View author publications
You can also search for this author in PubMed Google Scholar
Meidi Hu
View author publications
You can also search for this author in PubMed Google Scholar
Hairong Liu
View author publications
You can also search for this author in PubMed Google Scholar
Haiping Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Yang Huang
View author publications
You can also search for this author in PubMed Google Scholar
Mouyan Jiang
View author publications
You can also search for this author in PubMed Google Scholar
Shengkang Li
View author publications
You can also search for this author in PubMed Google Scholar
Guangli Li
View author publications
You can also search for this author in PubMed Google Scholar
Chunhua Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Wei Hu
View author publications
You can also search for this author in PubMed Google Scholar
Daji Luo
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Yuanli Zhao: Investigation, Data curation, Visualization, Software, Writing original draft and review & editing. Zonggui Chen: Data curation, Visualization, Software, Resources, Writing-review & editing. Meidi Hu: Data curation, Validation, Resources, Writing-review & editing. Hairong Liu: Sampling. Haiping Zhao: Sampling. Mouyan Jiang: Sampling. Yang Huang: Resources. Shengkang Li: Data curation. Guangli Li: Resources. Chunhua Zhu: Funding acquisition, Conceptualization, Resources. Wei Hu: Funding acquisition, Conceptualization, Resources. Daji Luo: Conceptualization, Funding acquisition, Methodology, Project administration, Resources, Supervision, Writing original draft and review & editing.

Corresponding authors

Correspondence to Chunhua Zhu, Wei Hu or Daji Luo.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Zhao, Y., Chen, Z., Hu, M. et al. Integrating Iso-seq and RNA-seq data for the reannotation of the greater amberjack genome. Sci Data 11, 675 (2024). https://doi.org/10.1038/s41597-024-03495-7

Download citation

Received: 31 January 2024
Accepted: 07 June 2024
Published: 22 June 2024
DOI: https://doi.org/10.1038/s41597-024-03495-7
Springer Nature Limited

Integrating Iso-seq and RNA-seq data for the reannotation of the greater amberjack genome

Abstract

Similar content being viewed by others

Integrated transcriptome catalog of Tenualosa ilisha as a resource for gene discovery and expression profiling

Single-molecule Real-time (SMRT) Sequencing Facilitates Transcriptome Research and Genome Annotation of the Fish Sillago sinica

Guidelines for RNA-seq projects: applications and opportunities in non-model decapod crustacean species