Analysis of the transcriptome of the Indonesian coelacanth Latimeria menadoensis
- 5.8k Downloads
Latimeria menadoensis is a coelacanth species first identified in 1997 in Indonesia, at 10,000 Km of distance from its African congener. To date, only six specimens have been caught and just a very limited molecular data is available. In the present work we describe the de novo transcriptome assembly obtained from liver and testis samples collected from the fifth specimen ever caught of this species.
The deep RNA sequencing performed with Illumina technologies generated 145,435,156 paired-end reads, accounting for ~14 GB of sequence data, which were de novo assembled using a Trinity/CLC combined strategy. The assembly output was processed and filtered producing a set of 66,308 contigs, whose quality was thoroughly assessed. The comparison with the recently sequenced genome of the African congener Latimeria chalumnae and with the available genomic resources of other vertebrates revealed a good reconstruction of full length transcripts and a high coverage of the predicted full coelacanth transcriptome.
The RNA-seq analysis revealed remarkable differences in the expression profiles between the two tissues, allowing the identification of liver- and testis-specific transcripts which may play a fundamental role in important biological processes carried out by these two organs.
Given the high genomic affinity between the two coelacanth species, the here described de novo transcriptome assembly can be considered a valuable support tool for the improvement of gene prediction within the genome of L. chalumnae and a valuable resource for investigation of many aspects of tetrapod evolution.
KeywordsCoelacanth Latimeria menadoensis Transcriptome de novo assembly RNA-seq
One of the most important transitions in vertebrate evolution was the arising of terrestrial vertebrates, which entailed considerable morphological changes related to the acquisition of novel functions by pre-existing and, in several cases, pre-adapted structures, like the evolution of lobe fins into tetrapod limbs. The terrestrial vertebrates would have derived from fossil forms of lobe-finned fishes, a highly successful group in the Devonian (400 Mya), with hundred species populating the Gondwana supercontinent’s oceans and river systems [1, 2].
Until 1938 only two sarcopterygian taxa were considered to have survived post-Devonian extinction: the dipnoi (lungfish), with three extant genera, and the tetrapods, with ~23,500 species. Hence the clamour when the first living coelacanth (Latimeria chalumnae), a fish considered extinct, was found off the estuary of river Chalumna, in South Africa [3, 4]. In 1997 a specimen of another Latimeria population was identified by Mark V. Erdmann in a fish market in Manado Tua (Sulawesi, Indonesia) . The distance between the two sites (more than 10,000 Km) and the early molecular findings [6, 7] led to the identification of the latter specimen as belonging to a distinct species, called L. menadoensis. Several individuals of L. chalumnae have been fished to date, as opposed to only six individuals of L. menadoensis. Specimens of Indonesian coelacanth are therefore very rare and constitute a valuable scientific resource and a mine of precious genetic information.
The main molecular and morphological studies of the genus Latimeria have addressed the evolutionary relationships linking lungfishes, coelacanths and tetrapods. Their results have however been discordant, since different datasets have sustained different hypotheses [6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20]. L. menadoensis has also been the subject of other molecular investigations aimed at characterizing some genes of evolutionary interest: Hox genes [21, 22, 23], ParaHox genes , the Protocadherine cluster , the RAG1 and RAG2 genes , Sonic hedgehog gene and its enhancers , visual pigments , a Heat Shock Protein 70, neurohypophysial hormones , and vitellogenins . Furthermore some transposable elements (TEs) were characterized in this species [31, 32, 33, 34]. The importance of L. menadoensis has led to the study of its whole mitochondrial genome [10, 35] as well as to partial sequencing of a BAC library, which has made ~5 Mb of genomic sequences available to databases . Furthermore, correlations between quantitative and compositional characteristics of the genome of L. menadoensis were considered in Makapedua et al. .
Over the last few years next generation sequencing technologies (NGS) have revolutionized the fields of genomics and transcriptomics, providing the opportunity to analyze genomes and transcriptomes with high sequencing depth in a relatively short time in comparison with Sanger sequencing. The molecular data obtained with such technologies, applied to a rising number of organisms, are proving steadily important to study their relationships at the macro- and micro-evolutionary levels. In this respect, having in mind that genes are targets of evolution-driven changes that lead to the different morphology of animals, in the framework of studies of genomic features of coelacanths , we examined the L. menadoensis transcriptomes of liver and testis using deep-sequencing techniques.
The liver is one of the most suitable tissues in that it participates more than other organs in a range of physiological processes and contains cell types endowed with distinct roles and functions. Considering the different sex determination mechanisms occurring across vertebrates, the expression in testis is interesting to better understand the genes involved in these processes and their evolution since the systematic position of Latimeria. Furthermore, this tissue proved to be particularly suitable and useful for deep RNA-seq, as it expresses a broad range of different transcripts, permitting the assembly of a high number of full length sequences. In fact, about a half of the sequences included in the high quality set of 66,308 contigs was estimated to have been correctly assembled to the full length.
The de novo transcriptome assembly was able to significantly enhance the global view of the sequences expressed in coelacanth, overcoming the limitations linked to the automated and conservative, protein coding gene-focused, prediction by Ensembl, and providing a remarkable amount of information concerning expressed sequences produced by non-annotated genes. This knowledge provided significant information not only to investigate important biological processes and metabolic pathways in Latimeria, but also to acquire information on the origin of tetrapods and on the possible evolutionary dynamics relative to the genes involved in the transition from aquatic to terrestrial vertebrates.
Latimeria menadoensis transcriptome sequencing
Number of reads before trimming
Reads kept after trimming
Percentage of discared reads
Reads average length before trimming
Reads median length before trimming
Reads average length after trimming
Reads median length after trimming
Ribosomal RNA reads
Percentage of ribosomal RNA reads
Number of high quality reads
De novo assembly
The de novo transcriptome assembly performed with Trinity  by using both liver and testis reads generated a total of 306,882 contigs. The filtering step used to select only the longest transcript per gene produced 223,365 contigs, and the additional step applied to remove redundant sequences by MIRA 3.4.0  and to filter sequences shorter than 250 bp further reduced the Trinity assembly to a set of 105,653 transcripts. The de novo assembly produced with the CLC Genomic Workbench 4.5.1 (CLC Bio, Katrinebjerg, Denmark) generated 149,339 raw contigs. The high quality subset of protein-coding sequences selected to integrate the Trinity assembly, as described in the methods section, comprised 48,846 sequences. A total of 8,496 CLC contigs were detected by BLASTn as matching existing Trinity contigs and significantly longer than them. The corresponding Trinity contigs were therefore replaced. The remaining 40,350 CLC contigs were discarded, as they could not significantly improve the Trinity assembly.
Assembly quality assessment
Transcript fragmentation was assessed with the Ortholog Hit Ratio method , which relies on the comparison between the observed length of contigs and the full length of known ortholog sequences of other species, detected by BLASTx. This method is strongly influenced by inter-species divergence and by the different substitution rates observed among genes and can often lead to an under-estimation of transcript integrity . To overcome this imperfection of the method we applied a correction considering in the analysis only highly conserved genes (characterized by a BLAST identity higher than 90%, independently from the hit length). By these means, a sufficiently large set of sequences was analyzed (6,024 contigs), permitting to obtain a reliable estimate of fragmentation within the high quality liver and testis transcripts. The comparison with ortholog sequences revealed that about a half of the contigs were assembled to their full length (Additional file 1: Figure S1e). The mean and median ratios resulted to be 0.72 and 0.86, respectively. Approximately a quarter of the high quality transcript set is expected to be composed by highly fragmented contigs (covering less than 50% of the expected length).
Total number of high quality assembled reads
Number of created contigs
Number of bases in contigs
Average length (bp)
Median length (bp)
N50 (see the text for definition)
N80 (see the text for definition)
N90 (see the text for definition)
Longest contig (bp)
Number of contigs > 5 Kb
Number of contigs > 10 Kb
The 25 most abundant InterPro domains revealed by the InterProScan annotation of the high quality coelacanth transcript set
Number of detected contigs
Protein kinase, catalytic domain
Zinc finger, C2H2
Serine/threonine-/dual-specificity protein kinase, catalytic domain
Zinc finger, C2H2-like
Tyrosine-protein kinase, catalytic domain
Protein kinase-like domain
Ankyrin repeat-containing domain
Pleckstrin homology domain
Fibronectin, type III
Src homology-3 domain
Serine/threonine-protein kinase, active site
Zinc finger, RING-type
Small GTPase superfamily, Ras type
Zinc finger C2H2-type/integrase DNA-binding domain
RNA recognition motif domain
Zinc finger, RING/FYVE/PHD-type
WD40/YVTN repeat-like-containing domain
TEs in the coelacanth transcriptome
RNA-seq mapping on the African coelacanth genome
RNA-seq mapping on L. chalumnae genome statistics
Mapping on annotated genes
Reads mapped in pairs
Reads mapped in broken pairs
Fragments mapped on exon-exon junctions
Fragments mapped on exon-intron junctions
Total fragments mapped on exons
Total fragments mapped on introns
Mapping on genomic scaffolds
Total number of reads mapping on genomic scaffolds
Reads mapped on non-annotated genes
RNA-seq mapping on L. menadoensis transcriptome
Percentage of counted fragments
Reads mapped in pairs
Reads mapped in broken pairs
Percentage of mapped reads
Reads not mapped
List of the 20 most expressed genes in liver
Liver expression level (FPKM*)
Testis expression level (FPKM*)
α-2-macroglobulin like 1
Inner centromere protein A
Vitellin layer outer membrane 1
Fibrinogen α chain
Fibrinogen β chain
Elongation factor 1-α
Serine proteinase inhibitor Kazal type 2
ATP synthase F0 subunit 6
Ferritin heavy polypeptide 1
Fibrinogen gamma polypeptide
Serum amyloid P
List of the 20 most expressed genes in testis
Testis expression level (FPKM*)
Liver expression level (FPKM)
Prostaglandin H2D isomerase
Y-box transcription factor
Sjogren syndrome nuclear autoantigen 1
Tubulin α chain, testis-specific
Elongation factor 1-α
H\ACA ribonucleoprotein complex, subunit 2
Tubulin β 2-C
ATP synthase F0 subunit 6
Sperm nuclear basic protein PL-I
Ferritin heavy chain
TP-53 target gene protein-like
Ribosomal protein S6
High mobility group protein B2
De novo transcriptome assembly
The advent of NGS technologies has had an outstanding impact on many fields of biology, including genetics , functional and comparative genomics [46, 47] and molecular ecology . The remarkable potential range of application of these techniques will likely move the focus of high throughput sequencing in the near future from genome and transcriptome sequencing to the use in clinical medicine and diagnostics [49, 50, 51]. Due to its potential application to deep RNA-seq, NGS has been praised as a cost-effective and revolutionary tool for transcriptomics since the very early stages of its development . Although great technical advances have been made in a relatively short lapse of time in the improvement of both sequencing technologies and sequencing data management, significant challenges linked with RNA-seq still remain unsolved. The major computational issues in the management of NGS data is represented by the reliable de novo assembly of transcriptomes [53, 54]. This is a complex task, due to presence of alternatively spliced transcript variants, gene duplications, allelic polymorphisms and noise due to suboptimal sequence quality, which often leads to the generation of a high number of short and poorly assembled contigs .
The massive amount of sequencing reads obtained from L. menadoensis liver and testis allowed us to apply stringent filtering criteria, both in the processing of raw sequencing reads and in the filtering of assembled contigs, in order to achieve a final set of high quality transcripts and to overcome the most common pitfalls of NGS assemblies. We chose to use the Trinity assembler, able to efficiently recover full length transcripts across a broad range of expression levels but somewhat redundant because of the inclusion of alternatively spliced variants . The Trinity assembly was used as a reference sequence set to be appropriately refined and enriched, whenever possible, by a second de novo assembly performed with the assembler included in the CLC Genomic Workbench. The choice of integrating the Trinity output with the CLC assembly was made because of the empirical observation of a more effective reconstruction of full length transcripts and because of the operational speed of its assembly algorithm, based on de Bruijn graph. As this method, although extremely fast, is known to produce assemblies which are quite fragmented in comparison with other assemblers , only a selected set of assembled contigs was used to improve the Trinity assembly, with a particular emphasis on protein-coding transcripts.
De novo assembly quality assessment
One of the problems most commonly arising from the de novo assembly of RNA-seq data is represented by sequence fragmentation . In order to minimize this problem, as described in the methods section, all the contigs with an average coverage lower than 5 were removed prior to further analysis, reducing the number of contigs from 105,653 to a final set of 66,308 high quality sequences, reducing the fraction of short sequences with a proportional enrichment in longer transcripts (Figure 3). Furthermore, the contig processing strategy we used, graphically summarized in Figure 1, contributed to significantly reduce the sequence redundancy of the assembly (which was calculated to be 17.39%), in respect with the Trinity output (Figure 2). Although several factors can negatively influence the outcome of a de novo transcriptome assembly, affecting the reconstruction of full length sequences, the ortholog hit ratio analysis highlighted good mean and median ratio values and a high proportion of transcripts assembled to their full length (Additional file 1: Figure S1e). Therefore, despite the inevitable presence of broken transcripts, the results of the de novo assembly were extremely satisfying, highlighting that about half of the sequences, contained in the final set of transcripts, was assembled to the full length or very close to it and that just about a quarter of the contigs were resulting from highly fragmented transcripts.
The analysis of the top hit species distribution resulting from BLAST (Figure 5) reveals Gallus gallus as the first species, followed by Xenopus tropicalis. The first teleost fish of the list, Danio rerio, ranked at the sixth place of the list, after the mammal Monodelphis domestica. These results are clearly biased towards organisms whose genome has been largely and deeply studied and annotated, mainly because of the higher quality of genome assemblies, of the more accurate gene predictions and of the higher number of protein sequences deposited in public sequence databases. Nevertheless, the absence of a prominent species with extended sequence homologies to L. menadoensis, neither in fishes nor in tetrapods, is consistent with the phylogenetic placement of lobe-finned fishes. However, for an in-depth analysis of the phylogenetic relationship between coelacanth and these two major vertebrate groups, and for an extended discussion on the implications on tetrapod evolution we refer to the whole-genome scale analysis reported by Amemiya and colleagues .
Compared to those having a positive BLAST result, a higher number of contigs (42,667) were annotated by InterProScan. Since the presence of InterPro domains is a strong indication of coding sequences, these data point out that 64.35% of the coelacanth de novo assembled contigs are coding for proteins characterized by known InterPro domains.
Divergence between the two coelacanth species
The evolutionary divergence between the two species of coelacanth has been a subject of debate for a long time. Although the complete sequencing of mitochondrial DNA highlighted a sequence identity of 96%, variable divergence times have been proposed, ranging from 6 to 40 Mya [7, 10]. The sequencing of the genome of L. chalumnae permitted to extend the comparison to large genomic regions with the available BACs of L. menadoensis, evidencing an identity of 98.7%. Our transcriptomic data offered the opportunity to assess the sequence identity within the coding regions, which resulted to be surprisingly high, standing at 99.73% (see Additional file 1 for methodological details). Nevertheless, while the massive amount of information gathered permits a rather easy calculation of divergence rates, the estimate of divergence time is not such a trivial task, given the uncertainties related to the calibration of a molecular clock : the slow generation time, the absence of other closely related living species, and the allegedly low rate of molecular evolution of coelacanths.
Although both the genic and genomic divergence between the two species are similar to those observed between human and chimp , which diverged 6–8 Mya, we also performed a phylogenomic comparison between coelacanths and T. rubripes/T. nigroviridis, two organisms with a completely sequenced genome, which evolved in an aquatic environment and were subject to somewhat similar selective pressure and whose divergence, based on paleontological evidence, is estimated between 32.25 and 56 Mya . Based on the alignment of approximately 40 Kb of ortholog transcribed sequences in Latimeria, we estimated the substitution rate to be 0,49/100 bp, whereas the substitution rate in the same set of selected genes in the Takifugu/Tetraodon pair was approximately 16 times higher (8,25/100 bp). A simple molecular clock correlation would indicate that the dating of divergence between African and Indonesian coelacanth should be placed between 1.9 and 3.3 Mya. Nevertheless the slower rate of molecular evolution, as well as the considerably longer generation time have to be taken into account, likely moving the divergence time much back in time to a date close to the lower end of the estimates based on mitochondrial DNA.
In metazoans repeat elements cover a considerable part of genomes. Moreover, the transcriptome analysis allowed the evaluation of the transcriptional activity of transposable elements (TEs) which play a key role in gene evolution and genome plasticity. TEs are divided in two classes: Class I is composed of Long Terminal Repeat retrotransposons (LTRs) and Non-LTRs (subdivided in LINEs and SINEs); Class II is composed of DNA transposons.
The RepeatMasker analysis revealed that 11.17% of contigs harbored a repeat and the most represented elements belong to SINE families. The latter result is in line with the studies performed in the Indonesian coelacanth genome [31, 32, 33], in which the activity of SINE elements in Latimeria was inferred. The identification of LF-SINEs and DeuSINEs in L. menadoensis transcriptome might confirm that these elements are actually active. Moreover, since their conservation in higher vertebrates, this movement might predate the common ancestor of Crossopterygians, for more than 400 Myr. On the other hand the occurrence of complete SINEs in contigs bearing protein-coding sequence might reveal the gain of new functional roles (exaptation) , as previously described in tetrapod genomes.
Concerning the activity of LINEs, the second most represented interspersed elements, the InterProScan analysis identified amino acidic domains linked to these autonomous retrotransposons. Chicken Repeat 1 (CR1) elements are the most abundant among LINEs. In contrast to the G. gallus genome where these elements are predominant but, with very few exceptions, nonfunctional , in Latimeria they seem to be active. Fragmented LTRs and ERVs (Endogenous RetroViruses) were identified in the transcriptome. This result is in agreement with the analyses on Foamy-like retroviral elements recently discovered in L. chalumnae genome by Han and Worobey  showing many frame-shifts and stop codons. The abundance of the Harbinger DNA transposons in L. menadoensis genome  suggests that Class II elements represent a remarkable fraction of the coelacanth TEs, however our analysis indicates that few DNA elements are expressed. This discordance may be related to the lack of coelacanth specific sequences belonging to this class in the RM database or to their propagation mode. The identification of mobile elements in transcriptomes sheds light on an unexpected genome dynamicity in an organism considered to be a living fossil even from a molecular point of view [21, 25].
RNA-seq mapping on the African coelacanth genome
More than half of the sequence data generated by the RNA-seq of L. menadoensis liver and testis mapped on the genes annotated by Ensembl on the L. chalumnae genome (Table 4), revealing an overall good annotation of the African coelacanth transcripts, even though in some cases the RNA-seq data produced in this study could provide some evidence of additional exons and alternative splicing, given that the 6.97% of the reads corresponded to regions annotated as introns.
Nevertheless, a rather high proportion of reads, close to 40%, could not be mapped on the genes annotated by Ensembl, consistently with the strategy adopted by Ensembl for the annotation pipeline, which is automated and mainly focused on protein-coding gene models. In fact, almost the 35% of the sequencing reads could map on the assembled genomic scaffolds outside from the annotated gene boundaries, revealing that a relevant portion of the transcripts expressed in the Indonesian coelacanth liver and testis might correspond to genes which were not annotated by the Ensembl RNA-seq annotation pipeline (Table 4). Therefore, the deep RNA-seq of liver and testis can be considered as a fundamental tool for the discovery of novel genes, and in particular, of many not yet annotated non-coding transcripts. As a matter of fact, the NGS sequence data will certainly provide a fundamental source of information for the study of atypical transcripts originated by trans- and circular splicing events, a topic which is currently under investigation (Stadler, personal communication).
Slightly more than 3 million reads did not map on the genomic scaffolds. These sequence data could either correspond to mitochondrial RNA (which was esteemed to account for 3.03% and 2.08% of the reads in liver and testis, respectively) or to coding genes harbored in L. chalumnae genomic regions which were not successfully assembled.
Liver and testis transcriptomes comparison
The expression profile of the two organs analyzed was expected to be quite different, considering the largely different tasks they perform and the highly specialized cellular types involved. This difference was immediately evident by the graphical representation of the expression scatter plot (Figure 6). Among the 20 most expressed transcripts in liver, a large fraction is constituted by plasma proteins, whose synthesis is carried out by this organ (such as the three chains constituting fibrinogen, α-2 macroglobulins, apolipoproteins, hemopexin, vitronectin, lipocalin, serum amyloid P and serum albumin) and which constitute the core of the highly expressed genes in this tissue (Table 6). On the other hand testis invests a significant portion of transcription in genes involved in chromatin and cytoskeletal rearrangements. In particular, a testis-specific histone results to be expressed almost 25 times more than the second most expressed gene, prostaglandin H2D-isomerase, and accounts for about 18% of the global testis transcription. A significant amount of the total gene expression is derived from the synthesis of messengers of protamines, used for the replacement of histones and the effective packaging of DNA in the sperm acrosome . The expression of genes involved in chromatin rearrangement is strictly regulated, as testis-specific histones are transiently and selectively expressed only during specific phases of spermiogenesis . In fact, also sperm nuclear basic protein PL-I and histone H1x-like figure among the most representative testis genes. Furthermore a relevant number of other testis-specific genes can be linked to the meiotic process carried out in the testicular germinal cells and to the cytoskeletal rearrangements consequently required (tubulin α chain testis-specific, tubulin β 2-C and centrin-1). Moreover, specific types of microtubules are required for the correct assembly of mitotic and meiotic spindles and of the flagellum axoneme of spermatozoa [64, 65]. The tubulin genes highly expressed in testis are likely linked to these functions.
Although the expression of a large fraction of genes was clearly strictly tissue-specific, thanks to the sequencing depth applied, a relevant overlap between the two transcriptomes (77.37%) was observed. The issue of transcriptome richness was addressed by analyzing the relative contributions of the expression of each contig to the total observed transcription in the two tissues, and in RNA-seq data of L. chalumnae muscle (Figure 7). Highly specialized tissues are expected to invest the most gene expression in a selected set of genes, thus being transcriptionally poor, whereas tissues involved in many different biological functions, characterized by the coexistence of many different cell types are expected to be transcriptionally rich, as they express a broader range of transcripts. Within this picture, muscle is a classic example of a highly specialized tissue, expressing at particularly high levels a limited set of genes involved in the structural organization of muscle fibers and responsible of contraction. Testis expresses a broader range of transcripts, which is in agreement with the assumption that cells in this organ are characterized by drastic morphological and functional changes linked to gamete generation: in this scenario chromatin structure is constantly rearranged and gene expression may therefore be substantially variable during the different stages of spermatogenesis . Despite being transcriptionally poorer than testis, the RNA-seq of liver likely brought a remarkable amount of additional data as pointed out by the extent of the overlap between the two transcriptomes (Figure 8). Therefore, although the RNA-seq of two different organs like testis and liver was particularly effective to approach the coverage of a complete transcriptome, the incomplete overlap observed between the two L. menadoensis transcriptomes and the L. chalumnae muscle suggests that the sequencing of RNAs obtained from additional samples would be useful in order to effectively describe the complete transcriptome of this organism.
The de novo assembly of the Indonesian coelacanth L. menadoensis liver and testis transcriptomes here described contains complete information concerning the expressed sequences involved in the important biological processes held by liver and testis tissues, such as metabolism and reproduction [30, 67]. Furthermore, thanks to the high sequencing depth applied and to the broad range of transcripts expressed, the assembly also contains a great amount of sequence data originated from genes which are not directly linked to liver and testis, permitting to obtain a good overview of the overall coelacanth transcriptome.
In addition, the RNA-seq data generated in the present work provided a valuable resource for the Ensembl annotation of the recently sequenced genome of the African coelacanth L. chalumnae. In fact, the paired-end sequence data from liver and testis were processed through the Ensembl RNA-seq pipeline, generating 9,364 high confidence gene models, which permitted to improve the genome annotation by the addition of 547 new genes and 1,782 related transcripts . This was a considerable improvement with respect to the previous annotation, based both on sequence similarity and on the data provided by the RNA-seq of L. chalumnae muscle, a tissue transcriptionally poor if compared with liver and testis.
Overall, the RNA-seq data here described represent a valuable source of information for the improvement of the genetic and genomic studies of coelacanths, including the analysis of TE activity, the occurrence of atypical splice events and the divergence between the African and Indonesian congeners. The importance of the obtained results mainly regards the origin of terrestrial vertebrates since the key position of Latimeria as the unique extant representative of the lineage from which tetrapods should have arisen. The transcriptome data indicate a higher affinity of this species to several terrestrial vertebrates, even if only in few species the genome and transcriptome have been exhaustively analyzed. Therefore, the de novo transcriptome assembly, for the quality of information it generated, may certainly be considered a step ahead in helping to understand the biology of this living fossil.
On 16th September 2009 a coelacanth was found in a shark net near Talise Island, Indonesia. This male, weighing 27 kg with a total length of 116 cm, was the fifth specimen of L. menadoensis ever caught since the discovery of this species in 1997. The animal was moved to the Faculty of Fisheries and Marine Science, University of Sam Ratulangi, Manado (Indonesia), where the liver and testis tissues used in this study were collected immediately after death and directly fixed in RNAlater (Applied Biosystems, Warrington, UK). Tissue samples were shipped to the Science Faculty, Università Politecnica delle Marche, Ancona, Italy, under the Convention on International Trade in Endangered Species (CITES; permit no. IT/IM/2009/MCE/01585-2009/19713).
Total RNA was isolated from liver and testis using TRIzol reagent (Invitrogen, Carlsbad, CA). Following the treatment with DNase I Amplification Grade (Sigma, Steinheim, Germany), an aliquot of the extracts was used to assess the quality and quantity of RNA by spectrophotometric and capillary electrophoretic analysis. The liver RNA sample resulted to have a 260/280 nm absorbance ratio of 1.74, a 260/230 nm absorbance ratio of 0.94 and a RNA integrity number (RIN, estimated with an Agilent2100 Bioanalyzer) of 6.6. The testis RNA sample resulted to have a 260/280 and a 260/230 nm absorbance ratios of 1.89 and 1.23, respectively, with a RIN of 7.
Sequencing of the liver and testis transcriptomes
Messenger RNA selection and cDNA library preparation were performed by the Istituto di Genomica Applicata (IGA, Udine, Italy). The sequencing of the libraries was performed on an Illumina Genome Analyzer II platform (San Diego, California). Briefly, the poly-A mRNAs were selected using magnetic beads-linked oligo (dT) probes. The fragmentation was obtained with divalent cations. cDNA was synthetized and Illumina sequencing adapters were then ligated to the fragments, according to the manufacturer’s protocol. A smear of ligated fragments of 150 to 400 bp of length was selected by size and excised from an agarose gel. The sequencing of the cDNA libraries was performed on a flow cell using a 100-cycles paired-end strategy.
Data processing and de novo assembly of Latimeria menadoensis transcriptome
The raw sequencing reads were trimmed by removing Illumina adapter sequences and low quality bases (the quality limit was set to 0.05). The resulting trimmed sequences shorter than 75 bp were discarded. All the reads originated from ribosomal RNA were also removed prior to the assembly step.
The de novo assembly of the processed reads was performed with a combined approach, by integrating the outputs of two different methods, which have been specifically developed for de novo assembly of short reads: Trinity  and the commercially available CLC Genomic Workbench 4.5.1 (CLC Bio, Katrinebjerg, Denmark). At first, the two assemblies were performed separately using as input the same sequence set, comprising both liver and testis sequence data. The schematic summary of the procedure used for integrating the outputs of the two assemblers is detailed in Additional file 1.
To ensure the creation of a highly reliable set of assembled transcripts, contigs covered by a low number of reads were discarded, following a global mapping of the complete set of both liver and testis filtered reads (CLC Genomic Workbench, mismatch/insertion/deletion costs set at 2/3/3, length fraction/similarity set to 0.75/0.95). All the transcripts showing an average coverage <5 were considered as possible fragments of longer transcripts, not reliable enough to be included in the high quality coelacanth transcript collection, and were therefore discarded. Only transcripts longer than 249 bp were kept.
Assembly quality assessment
In order to assess the quality of the contigs obtained with the filtering procedure in respect with the non-filtered set, the sequences were grouped into categories according to their sizes (intervals of 100 bp) and the relative abundance of each category was plotted in a histogram. The distributions of transcript lengths pre- and post-filtering were compared.
The sequence redundancy was estimated by the RNA-seq mapping of the reads from both tissues on the contigs created by the original Trinity assembly and to the filtered and non-filtered sets of contigs obtained with the Trinity and CLC combined approach. The RNA-seq analysis tool included in the CLC Genomic Workbench was used for this purpose (minimum length fraction and minimum similarity fraction were set at 0.75 and 0.95, respectively). The total number of reads mapped and the set of reads mapping non-specifically (matching on more than 1 contig) were compared, in order to evaluate the improvement of the assembly quality obtained with the processing steps. Sequence redundancy was calculated as the percentage of reads mapping not-specifically.
The total number of reads originated from mitochondrial RNA was assessed by the mapping of the filtered reads set to the deposited mitochondrial DNA sequence of L. menadoensis (Genebank accession: NC_006921.2). The mapping was performed with the CLC Genomic Workbench, using the same settings described above to estimate sequence redundancy.
Transcript functional annotation
The filtered transcripts were annotated with Blast2GO (version 2.4.4, http://www.blast2go.com/), a tool specifically developed for the annotation of novel sequence sets . Sequence similarity was evaluated with BLASTx  against the NCBI non-redundant (nr) protein database using an e-value cutoff of 1×10-6. The presence of conserved domains was researched and annotated using InterProScan  on the six possible translation frames of each contig. Contigs were functionally annotated according to the Gene Ontology (http://www.geneontology.org/) nomenclature. GO terms were assigned to each transcript and annotated according to the level 2 of the Cell Component, Molecular Function, and Biological Process categories. Furthermore, in order to identify by homology transposable elements and repeated sequences from a database of vertebrate repeats, the contigs were analyzed with RepeatMasker (http://www.repeatmasker.org).
Mapping on L. chalumnae genome
The liver and testis sets of filtered reads were mapped on the annotated L. chalumnae genome Ensembl release e!67 using the Genomic Workbench 4.5.1 RNA-seq tool, assuming a minimum length fraction of 0.75 and a minimum similarity fraction allowed of 0.95. As the sequence similarity between L. menadoensis and L. chalumnae was estimated to be 99.73%, the mapping parameters used were supposed not to significantly influence the mapping outcome. The allowed paired-end read distance was set between 100 and 350 bp. Based on gene annotations, it was possible to categorize the fragments as mapping within exons, within introns and on exon-exon or exon-intron junctions. Furthermore, the number of reads mapping on non-annotated genomic regions was also calculated, to assess the amount of sequence data accounting for the expression of non-annotated genes.
The liver and testis filtered reads were separately mapped to the high quality set of the assembled contigs to assess the expression values in the two tissues. The mapping was carried out with the Genomic Workbench 4.5.1 RNA-seq tool, with a minimum length fraction allowed of 0.75 and a minimum similarity fraction allowed of 0.95. Paired-end read distance was considered to be comprised between 100 and 350 bp. Only intact sequence pairs (fragments) mapping were counted and expression values were calculated as FPKM (Fragments Per Kilobase per Million fragments mapped). Besides liver and testis, also RNA-seq data obtained from the African congener L. chalumnae muscle (Sequence Read Archive sample ID: SRS283232) were used for comparison purpose (the muscle transcriptome was de novo assembled with Trinity and processed to remove redundancy exactly as previously described for the liver and testis assembly).
The transcriptome richness was graphically inspected by plotting the cumulative expression of the 1,000 most transcribed genes in each tissue, normalized on the total expression per tissue. The overlap between liver and testis transcriptomes was estimated by the comparison of the sets comprising the 1,000 most expressed genes per tissue. The comparison was also extended to the L. chalumnae muscle transcript set generated in the frame of the African coelacanth genome sequencing project .
Availability of supporting data
[SRA: SRS362269, SRA: SRS362270]
AP, MG,and GDM used “University research funds”. AC, EO, MB, MF, DMM, and MAB used the Italian Ministry for University and Research grant (UNIVPM6170,UNIVPM6823). GS, FB, and AMF used “University research funds”. JA, FDP, and JT-M used the NIH grant (U54 HG003067-08).
The authors of the Dipartimento di Scienze della Vita e dell’Ambiente, Università Politecnica delle Marche (Ancona, Italy) are affiliated to Istituto Nazionale Biosistemi e Biostrutture (INBB).
- 2.Maisey JG: Discovering fossil fishes. 1996, New York: HoltGoogle Scholar
- 6.Pouyaud L, Wirjoatmodjo S, Rachmatika I, Tjakrawidjaja A, Hadiaty R, Hadie W: Une nouvelle espèce de coelacanthe. Preuves gènètiques et morphologiques. Comptes Rendus de l'Académie des Sciences - Series III - Sciences de la Vie. 1999, 322: 261-267. 10.1016/S0764-4469(99)80061-4.CrossRefGoogle Scholar
- 17.Gorr T, Kleinschmidt T, Sgouros JG, Kasang L: A “living fossil” sequence: primary structure of the coelacanth (Latimeria chalumnae) hemoglobin - evolutionary and functional aspects. Biol Chem H-S. 1991, 372: 599-612.Google Scholar
- 21.Amemiya CT, Powers TP, Prohaska SJ, Grimwood J, Schmutz J, Dickson M, Miyake T, Schoenborn MA, Myers RM, Ruddle FH, Stadler PF: Complete HOX cluster characterization of the coelacanth provides further evidence for slow evolution of its genome. P Natl Acad Sci USA. 2010, 107: 3622-3627. 10.1073/pnas.0914312107.CrossRefGoogle Scholar
- 28.Modisakeng K, Jiwaji M, Pesce E-R, Robert J, Amemiya C, Dorrington R, Blatch G: Isolation of a Latimeria menadoensis heat shock protein 70 that has all the features of an inducible gene and encodes a functional molecular chaperone. Mol Genet Genomics. 2009, 282: 185-196. 10.1007/s00438-009-0456-4.CrossRefPubMedGoogle Scholar
- 35.Sudarto , Lalu XC, Kosen JD, Tjakrawidjaja AH, Kusumah RV, Sadhotomo B, Kadarusman , Pouyaud L, Slembrouck J, Paradis E: Mitochondrial genomic divergence in coelacanths (Latimeria): slow rate of evolution or recent speciation?. Mar Biol. 2010, 157: 2253-2262. 10.1007/s00227-010-1492-7.CrossRefGoogle Scholar
- 38.Amemiya CT, Alfőldi J, Lee A, Fan S, Brinkmann H, Maccallum I, Braasch I, Manousaki T, Schneider I, Rohner N, Organ C, Chalopin D, Smith J, Robinson M, Dorrington R, Gerdol M, Aken B, Biscotti M, Barucca M, Baurain D, Berlin A, Blatch G, Buonocore F, Burmester T, Campbell M, Canapa A, Christoffels A, De Moro G, Edkins A, Fan L: The African coelacanth genome provides insights into tetrapod evolution. Nature. 2013, 496: 311-316. 10.1038/nature12027.PubMedCentralCrossRefPubMedGoogle Scholar
- 39.Grabherr MG, Haas BJ, Yassour M, Levin JZ, Thompson DA, Amit I, Adiconis X, Fan L, Raychowdhury R, Zeng Q, Chen Z, Mauceli E, Hacohen N, Gnirke A, Rhind N, Di Palma F, Birren BW, Nusbaum C, Lindblad-Toh K, Friedman N, Regev A: Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nat Biotechnol. 2011, 29: 644-652. 10.1038/nbt.1883.PubMedCentralCrossRefPubMedGoogle Scholar
- 44.Lanfranchi G, Muraro T, Caldara F, Pacchioni B, Pallavicini A, Pandolfo D, Toppo S, Trevisan S, Scarso S, Valle G: Identification of 4370 expressed sequence tags from a 3′-end-specific cDNA library of human skeletal muscle by DNA sequencing and filter hybridization. Genome Res. 1996, 6: 35-42. 10.1101/gr.6.1.35.CrossRefPubMedGoogle Scholar
- 53.Cahais V, Gayral P, Tsagkogeorga G, Melo-Ferreira J, Ballenghien M, Weinert L, Chiari Y, Belkhir K, Ranwez V, Galtier N: Reference-free transcriptome assembly in non-model animals from next-generation sequencing data. Mol Ecol Resour. 2012, 12: 834-845. 10.1111/j.1755-0998.2012.03148.x.CrossRefPubMedGoogle Scholar
- 55.Feldmeyer B, Wheat C, Krezdorn N, Rotter B, Pfenninger M: Short read Illumina data for the de novo assembly of a non-model snail species transcriptome (Radix balthica, Basommatophora, Pulmonata), and a comparison of assembler performance. BMC Genomics. 2011, 12: 317-10.1186/1471-2164-12-317.PubMedCentralCrossRefPubMedGoogle Scholar
- 59.Gould SJ, Vrba ES: Exaptation - a missing term in the science of form. Paleobiology. 1982, 8: 4-15.Google Scholar
- 63.Martianov I, Brancorsini S, Catena R, Gansmuller A, Kotaja N, Parvinen M, Sassone-Corsi P, Davidson I: Polar nuclear localization of H1T2, a histone H1 variant, required for spermatid elongation and DNA condensation during spermiogenesis. P Natl Acad Sci USA. 2005, 102: 2808-2813. 10.1073/pnas.0406060102.CrossRefGoogle Scholar
- 67.Forconi M, Canapa A, Barucca M, Biscotti MA, Buonocore F, Fausto AM, Makapedua DM, Pallavicini A, Gerdol M, De Moro G, Scapigliati G, Olmo E, Schartl M: Characterization of sex determination and sex differentiation genes in Latimeria. PLoS One. 2013, 8: e56006-10.1371/journal.pone.0056006.PubMedCentralCrossRefPubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.