Phylogenomics of non-model ciliates based on transcriptomic analyses

Ciliates are one of the oldest living eukaryotic unicellular organisms, widely distributed in the waters around the world. As a typical marine oligotrich ciliate, Strombidium sulcatum plays an important role in marine food webs and energy flow. Here we report the first deep sequencing and analyses of RNA-Seq data from Strombidium sulcatum. We generated 42,640 unigenes with an N50 of 1,451 bp after denovo assembly and removing rRNA, mitochondrial and bacteria contaminants. We employed SPOCS to detect orthologs from S. sulcatum and 17 other ciliates, and then carried out the phylogenomic reconstruction using 127 single copy orthologs. In phylogenomic analyses, concatenated trees have similar topological structures with concordance tree on the class level. Together with phylogenetic networks analysis, it aroused more doubts about the placement of Protocruzia, Mesodinium and Myrionecta. While epiplasmic proteins are known to be related to morphological characteristics, we found the potential relationship between gene expression of epiplasmic proteins and morphological characteristics. This work supports the use of high throughput approaches for phylogenomic analysis as well as correlation analysis between expression level of target genes and morphological characteristics.


INTRODUCTION
Ciliates are unique among unicellular organisms in having two types of nuclei thus separating the germline and somatic functions and are therefore commonly used to study genome structure and gene expression (Aury et al., 2006). Over the last ten years, high throughput sequencing has made ciliate genomic research available on a large scale, and the genomes of several species have been sequenced and analyzed, e.g. Tetrahymena thermophila (Eisen et al., 2006), Paramecium tetraurelia (Auryet al., 2006), Ichthyophthirius multifiliis (Coyne et al., 2011) and Oxytricha trifallax (Swart et al., 2013). In addition, the transcriptomes of 13 ciliate species representing 10 genera and five classes are now available (Table  S1) including those of Tetrahymena thermophila (Xiong et al., 2012) and Chilodonella uncinata (Grant et al., 2012).
Here we present Illumina Hiseq 2000 RNA-Seq data from Strombidium sulcatum. The transcriptome data are analyzed, including assembly and annotation. Phylogenomic reconstruction is carried out using 127 single copy orthologs, and evolutionary relationships are revealed by concatenated and concordance tree analyses. Thereafter, we investigate the transcriptional expression of epiplasmic proteins in S. sulcatum to determine whether there is a relationship between morphological characteristics and gene expression of epiplasmic proteins.

Assembly
A total of 24,958,136 paired-end reads (101 bp for each read) were produced for S. sulcatum (5,041,543,472 bases, % GC: 52%). After filtering low quality reads, 95.55% reads were retained. As rRNA had been depleted after extraction of the total RNA, it hardly existed in RNA-seq data (0.60%). The remaining rRNA sequences were removed for downstream analyses.
The remaining sequences were assembled with Trinity into 44,633 contigs (Max length: 18,235 bp, Min length: 201 bp, N50: 1498 bp). 94.33% of reads could be properly aligned to contigs in pairs. In order to eliminate assembly errors, 157 trans chemiras were detected and removed. Subsequently, the contigs were clustered by high similarity threshold (98%) using CD-HIT and one representative contig for each group was retained (43,437 contigs remained). After linking contigs whose terminals were perfectly matched, the contig number dropped to 43,009. These contigs were then BLASTed to Oxytricha trifallax whole genome, and BLAST hit results were contrasted to O. trifallax mitochondrial genomic peptides. This revealed that 18 contigs were related to the mitochondrial genome. Given that we used bacteria as the food source during the cultivation of S. sulcatum, the assembled contigs probably included a few contaminated sequences from bacteria. Based on BLAST results, we determined that 348 contigs are bacterial. After elimination of the contaminating rDNAs and bacterial sequences, raw reads from S. sulcatum were assembled into 42,640 unigenes. Of these unigenes, 27,965 had strong BLAST hits to eukaryotic sequences, and 16,668 had no strong affinity to any domain.

Gene content in transcriptomes
A variety of methods were used to assess the content of the Strombidium sulcatum transcriptome including determining closest relatives for ciliate proteins, identifying most highly expressed genes, CDS region prediction and protein annotation. Strombidium sulcatum's transcriptome had a significant number of BLAST hits to those of S. inclinatum and S. rassoulzadegani and to the genome of Oxytricha trifallax. We identified highly expressed genes as contigs occupping 10% or more of sequence reads with high FPKM (Fragments Per Kilobase of transcript sequence per Millions base pairs sequenced). The functions of about 70% of the contigs with the highest read numbers are unclear. The majority of them have homologous contigs (identity > 90%) in the other 14 ciliates represented in the transcriptome dataset, the exceptions being comp7405_c0 and comp10361_c0, each of which had a blast top hit identity of only about 30%.
After CDS prediction, 40,746 CDS regions were identified from 42,640 unigenes with an average length of 471 aa. At the same time, 40,746 peptide sequences were predicted. With these predicted peptides, we conducted a one-way BLAST search to the GenBank ciliate database against all proteins contained therein; 25,676 of S. sulcatum predicted peptides had BLAST results and could be grouped into 16,553 gene groups, 9765 of which were found in the genomes of the other four ciliates (data not shown). The identity distribution of the top hits is shown in Fig. 1. The remaining S. sulcatum genes did not satisfy the pairing cut-off criteria (E-value ≤ 1 × 10 −5 ).
Annotation of predicted protein products were matched to models and Swiss-Prot accessions. 20,668 HSPs (Highly Scalable Parallel) were found when blasted to Swiss-Prot database, and each match 16,315 domains in Pfam-A database, 12,201 domains in Superfamily database, 8594 domains in TIGRFAMs database respectively. Assessment of the top BLAST hits' species distribution for S. sulcatum is shown in Fig. 2.

Ortholog detection and phylogenomics
A useful approach for surveying the protein-coding gene landscape of a newly sequenced genome is to group genes by orthology, which can provide guidance for functional annotation. Here, we grouped the proteins predicted from S. sulcatum transcriptome with four ciliate genomes (Oxytricha trifallax, Tetrahymena thermophila, Ichthyophthirius multifiliis and Paramecium tetraurelia) using a reciprocal BLAST hit (RBH) approach. After removing short (<50 aa) and less-conserved (<30% identity) proteins, 684 orthologs were identified as shared by them all. Strombidium sulcatum shared most orthologs with O. trifallax, which is consistent with current systematic arrangements based on morphology and traditional molecular phylogeny (Fig. 3A). In addition, 1700 orthologs were identified in the three Strombidium spp. for which transcriptome data are available with S. sulcatum sharing most orthologs with S. inclinatum (Fig. 3B). We detected further orthologs among the four ciliate genomes and the 14 ciliate transcriptomes. One hundred and twenty-seven orthologs shared by all 18 ciliates with BLAST identity ≥ 30% and match length ≥ 50 aa were retained to perform phylogenomic analysis. After the 127 ortholog concatenated alignment dataset was produced, and its model was selected as RtREV + I + G + F (gamma shape = 0.947; proportion of invariable sites = 0.063), a concatenated phylogenomic tree was calculated using both maximum likelihood and Bayesian inference methods (Fig. 4). According to the maximum likelihood tree based on 36,724 aa positions from 127 predicted peptide sequences, genera in the class Oligohymenophorea, which includes Tetrahymena, Ichthyophthirius and Paramecium, clustered as a clade with full support. The genus Platyophrya (class Colpodea) grouped with the class Oligohymenophorea with full support. The genus Litonotus (class Litostomatea) clustered with the class Colpodea and Oligohymenophorea with low support. The ambiguous genus Protocruzia, which is known to have an elevated rate of evolution, did not group within any of the major clades. However, most groups with morphological identity are recovered robustly.
The three species of Strombidium and Oxytricha trifallax clustered together with full support, and then grouped with Euplotes, Strombidinopsis and Favella. The Mesodinium (class Mesodiniea) did not group together with Myrionecta (class Mesodiniea). Instead, Myrionecta and Tiarina (class Prostomatea) formed a clade with moderate support that was sister to Mesodinium (class Mesodiniea) with strong support.
Coalescent analysis was also carried out and a concordance tree was estimated by consolidating all BI trees inferred from the 127 ortholog alignments (Fig. 4). The topology of the concordance tree was similar to that of the concatenated tree. However, Protocruzia's systematic placement was even closer to the class Heterotrichea, while other Spirotrichs clustered as a monophyletic group.
Split decomposition analysis of phylogenetic networks was performed to uncover all possible relationships among the ciliate lineages (Fig. 5). The neighbor-net graphs were calculated based on 36,724 aa from 127 orthologs. The placement of most classes in the networks analysis was generally consistent with both the concatenated and the concordance analyses, and revealed that: (1) Protocruzia was in a position that is separate from all existing classes, and; (2) Myrionecta was grouped with Tiarina which represent the class Prostomatea, as same as the result indicated in maximum likelihood tree.  Comparative transcriptomic analysis reveals the differential expression of epiplasmic proteins in ciliates

Strombidium sulcatum -Four other ciliates BLAST identity distribution
Strombidium is caharacterized by its pyriform body shape, welldeveloped adoral zone of membranelles, somatic cilia arranged as a girdle, and the presence of polysaccharide cortical plates in the posterior half of the cell (Fig. 1). In order to reveal the relationship between expression level difference and cellular geometry, morphological information of the thirteen ciliates (S. sulcatum plus other 12 ciliates whose RNA-seq transcriptome data were available from MMETSP, Condylostoma magnum was not involved because of its inadequate transcriptome data) is summarized in Table 1. Seven potential epiplasmic proteins related to morphology formation were

RESEARCH ARTICLE
Xiao Chen et al.     taken into account. Sequences from thirteen of all fourteen species' transcriptome data (Condylostoma magnum was omitted because of the great differences in its transcriptome compared with that of other ciliates) were BLASTed to the epiplasmic proteins dataset (BLASTP, E-value < 1 × 10 −2 , identity > 30%) and homologs were identified in each species. These species, which represented six classes based on morphological characters, were always separated into two groups based on the distinction of one characteristic for each time Table 1. Then, rank-sum tests were performed to both expression level and number of types of epiplasmic protein under several conditions of different group divisions. Some notable results suggest: (1) Strombidium spp. have significantly higher expression levels of alveolin than other ciliates (Fig. 6A), including other spirotrichs (Fig. 6B); (2) ciliates that possess extrusomes (e.g. Strombidium sulcatum, see (Song et al., 2000) have significantly higher expression levels of alveolin than those that lack extrusomes (Fig. 6C); (3) spirotrichs have significantly lower expression levels of EPC than other ciliates (Fig. 6D), although EPC expression was significantly higher in Strombidium spp. than in other spirotrichs (Fig. 6E); (4) ciliates such as spirotrichs with a highly specialized AZM and well-developed cytostome structure, have significantly lower expression levels of EPC than those whose circumoral kinetids and cytostome structure were less well-developed or degenerated ( Fig. 6F and 6G). On the other hand, the difference of epiplasmic protein homolog types does not have significant impact on morphological characteristics of these ciliates.

Non-model ciliate transcriptome assembly
A problem that has been largely ignored by previous de novo transcriptome analyses is the creation of chimeras (Yang & Smith, 2013). Chimeras can come from misassembly of short reads or PCR-induced recombination during library preparation. Chimeras may also be real biological products from gene fusion or transsplicing. Because most chimeras are trans chimera in Trinity assembly results, and cis chimera detection may lead to false identification according to Yang and Smith (2013), we simply removed all trans chimeras. A second problem in transcriptome assembly of ciliates that cannot be grown axenically is that there may be mitochondrial as well as bacterial transcript-related reads mixed in RNA-seq data. In order to ensure that final unigenes covered only S. sulcatum nuclear transcripts, we BLASTed contigs to databases consisting of Oxytricha trifallax mitochrondrial genome and bacterial genomes because Strombidium is closely related to Oxytricha. Although mitochondrial genomes varied significantly among different classes, genes were poorly conserved even among closely related genera, and the gene content and gene order also varied greatly (Swart et al., 2012), only 18 contigs were identified as transcripts related to mitochondria.

Ortholog detection
The prediction of orthologs and paralogs is a basic step in transcriptome studies, and is usually used to identify the core and auxiliary genes among related organisms (Yang & Smith, 2013). When predicted peptides of four ciliate genomes and the S. sulcatum transcriptome are collapsed into ortholog sets, it is found that S. sulcatum and O. trifallax share more unique orthologs with each other (2005) than are shared between S. sulcatum and any of: T. thermophila (231), P. tetraurelia (440), I. multifiliis (166). Within the scope of ciliates, this finding provides direct evidence for the close evolutionary relationship between S. sulcatum and O. trifallax, and is much stronger than phylogenetic analyses based on sequence data for a single or a few genes.

Phylogenomic analysis
Strombidium sulcatum and two other Strombidium spp. clustered strongly in a group nesting deep within the Spirotrichea clade, and were the closest neighbors of Oxytricha.
(both concatenated tree and concordance tree) and phylogenetic network in current study indicated similar findings that Protocruzia has a strong impact on the monophyly of the class Spirotrichea, and it might be an early branch of the subphylum Intramacronucleata in the evolutionary line from the class Heterotrichea (subphylum Postciliodesmatophora).
The similarity between the concatenated tree and concordance trees indicates that a robust phylogenetic topological structure could be predicted based on hundreds of molecular loci, even though these had markedly divergent rates of molecular evolution. Furthermore, according to Philippe et al. (2004), missing data in large alignments do not significantly affect inferred phylogeny when resolving relationships among eukaryotic groups. Therefore eukaryotic phylogenomic analysis at the rank of class or higher may be reliable when based on a large data set, even when taxon sampling is limited. This is in contrast to phylogenetic analyses based on a single gene such as the small subunit ribosome DNA (SSU rDNA) which is by far the most commonly used gene marker (Lynn, 2008, Adl et al., 2012. By analysing such a significantly expanded data set, phylogenomics is likely to result in reduced statistical errors and to reveal true evolutionary relationships more reliably. On the other hand, a tree-like topological structure of relationships among species is assumed in traditional phylogenetic analyses, but recombination, hybridization, gene conversion, and gene transfer can all lead to evolutionary histories that are not adequately modelled by a single tree (Bryant & Moulton, 2004). It is quite likely that binary trees constructed in common analyses failed to reflect the complexity of true evolutionary histories of organisms (Morrison, I  II  I  II  I  II   I  II  I  II  I II I II 2011). Phylogenetic networks can break the restrictive supposition that the topological structure must be tree-like and can be used to visualize genealogical relationships among organisms (Huson et al., 2010). In addition, another advantage of phylogenetic network is to highlight conflicts for different taxa (Hall, 2011).

Epiplasmic proteins' transcriptional levels and morphological characteristics of ciliates
As the Paramecium epiplasm proteins (Epi1-Epi51) have no counterparts in other organisms (Aubusson-Fleury et al., 2013), we separated the epiplasmins into two types: EpiC (EPC) and Epi1-Epi51 (EPI). So epiplasmic proteins were divided into five classes: (1) the articulins (in various protists) (Marrs & Bouck, 1992, Huttenlauch et al., 1995, Huttenlauch et al., 1998; (2) articulin-like proteins (platein in Euplotes) (Kloetzel et al., 2003a); (3) EpiC (EPC, in Tetrahymena) (Honts & Williams, 2003); (4) Epi1-Epi51 (EPI, in Paramecium) (Damaj et al., 2009, Aubusson-Fleury et al., 2013; and (5) avleolins, perhaps homologues of alveolins and epiplasmins (Gould et al., 2008). Homologs of these five epiplasmic proteins types (articulins, platein, EPC, EPI, avleolins) in thirteen transcriptomes were investigated in this study. In order to test whether these five kinds of epiplasmic proteins' abundance difference in transcriptional level correlate with present systematic arrangement, we established four grouping criteria (Table 1, criteria 1-4). Rank sum test results suggest transcript abundances of only two epiplasmic protein groups, namely alveolins and EpiC, were significantly correlated to grouping criteria, specifically, criteria 3 (Fig. 6A) & 4 (Fig. 6B) to alveolins and criteria 2 (Fig. 6D) & 4 (Fig. 6E) to EpiC. Interestingly, both of these epiplamic protein groups show a strong correlation in criterion 4, i.e. Strombidium versus other species in the class Spirotrichea, which indicates that Strombidium spp. have unique cytoskeletal constituents within the class Spirotrichea. On the other hand, homologs of these five epiplasmic proteins do not show much correlation with grouping by present systematic arrangement. Williams (2004) generated an epiplasm gene EPC1 knockout construct that was successful in transforming T. thermophila cells, and the results suggested that the epiplasm plays a role in the control of either cell shape or cortical development. In this study, preliminary analyses revealed potential correlations between certain morphological characters and differences in the transcriptional levels of epiplasmic proteins. Here we identify seven morphological characteristics as potential grouping criteria for ciliates and which have potential correlations with epiplasmic proteins (Table 1, criteria 5-11). For example, the abundance of transcripts for alveolins is significantly (Sig. = 0.046 < 0.05, criterion 9 in Table 1) correlated with the presence of extrusomes, specifically, the abundance of transcripts for alveolins in species that have extrusomes is higher than that in which lack extrusomes (Fig. 6C). This supports the hypothesis that alveolins are the main constituents of the alveoli, which are part of the cortical epiplasm that includes the extrusomes (Gould et al., 2008). A second example is the correlation between EpiC (EPC) and the presence of circumoral kineties (criterion 6) and the size of the buccal area (criterion 11) ( Fig. 6F and 6G). Williams (2004) reported that the most severe abnormality associated with EPC1 knockout cells was the presence of branched and misaligned membranelles (clusters of cilia) found in the oral apparatus, indicating the importance of this group of epiplasmic proteins in the development of the oral ciliature. The results of the present study of EpiC support these hypotheses. Although present results indicate epiplasmic proteins' transcriptional levels are potentially correlated with morphological characteristics (such like circumoral kineties, extrusomes and buccal area) of ciliates, further experimental evidence is needed in order to confirm this finding.

Cultures
Strombidium sulcatum was grown in 75 cm 2 plastic culture flasks in filtered marine water with a monoclonal population of Escherichia coli as the food source. Cultures were maintained at 25°C for 7-14 days until reaching over 10 6 cells in total.

RNA and Illumina Sequencing
Live species were harvested by centrifugation at 1700 g for 10 min and stored at −80°C until further treatment. The total RNA was extracted using the RNeasy kit (Qiagen, Hilden, Germany) and digested with DNase. The rRNA fraction was depleted using GeneRead rRNA Depletion Kit (Qiagen, Hilden, Germany). The quality of the remaining RNA was assayed using a BioAnalyzer (Agilent Technologies, Palo Alto, CA, USA). In total, 250 ng of the RNA was used to synthesize cDNA using Affymetrix 30 IVT Express Kit (Affymetrix Inc., Santa Clara, CA, USA). The resulting high quality cDNAs were used to construct a library for paired-end sequencing.
Trans chimeras were detected and removed following the protocol of Yang and Smith (2013). Assembled sequences were BLASTXed against the ciliate genome/transcriptome protein dataset. Redundancy of contigs was eliminated by CD-HIT v4.6.1 (Fu et al., 2012) (CD-HIT-EST, with 98% sequence identity threshold). Non-redundant contigs were passed to custom PERL scripts that used a BLAST-based strategy to link contigs with others whose terminals were a perfect match (100%). Oxytricha trifallax mitochondrial genomic peptides and bacterial genomes database were downloaded from GenBank as a mapping reference to remove reads which were probably contaminated by mitochondria or bacteria.

Gene prediction and annotation in transcriptomes
Abundance estimation of contigs was analysed using RSEM v1.2.3 (Li & Dewey, 2011) and PERL scripts in the Trinity package based on Bowtie, following the protocol of Trinity. CDS regions were predicted using ESTScan v2.2.1 (Iseli et al., 1999), and only predicted sequences longer than 50 bp were retained. Predicted protein products were blasted to the Swiss-Prot database, and were restricted to the top five HSP bitscores with E-value ≤ 1.0 × 10 −20 . Predicted proteins were matched to HMM models (Pfam-A, Superfamily and TIGR-FAMs) using HMMER3 (Eddy, 2009), and were restricted to E-value ≤ 1.0 × 10 −5 with the top five hits reported.
Orthologs among S. sulcatum and other ciliates SPOCS (Curtis et al., 2013) (-H -M 2) was employed to identify shared orthologs among S. sulcatum and genome peptide data of the other four ciliates. Transcriptome peptide data of 13 ciliates were BLASTed to the S. sulcatum predicted peptide dataset by BLASTP (E-value ≤ 1 × 10 −2 , identity ≥ 30%, length ≥ 50 aa), and vice versa. Pairwise mutual best-hits were identified as putative orthologs.
In order to visualize all available phylogenetic signals (36,724 aa in total) in the 127 ortholog alignments, split decomposition analyses were calculated with the computer program SplitsTree ver. 4 (Huson, 1998, Huson & Bryant, 2006. Since we were interested in the ambiguity of relationships of taxa at class level, phylogenetic networks were generated for the concatenated dataset of 127 ortholog alignments using the neighbornet algorithm with uncorrected distances (Bryant & Moulton, 2004). To assess the reliability of the phylogenetic networks, bootstrap analyses with 1000 replicates were carried out.

Epiplasmic protein expression analyses based on thirteen ciliates' transcriptome data
Sequences of the major epiplasmic proteins (EPI, Articulin, EPC, Platein, Alveolin) were acquired from GenBank and ParameciumDB (http://paramecium.cgm.cnrs-gif.fr/) and formed the epiplasmic proteins dataset (Acc. No. see Table S2). This epiplasmic proteins dataset were used as queries to BLASTed (E-value ≤ 1.0 × 10 −2 , identity ≥ 30%) aganist the transcriptome data of these ciliate species in order to find homologs of epiplasmic proteins. Relative transcript abundance (number of reads mapping to the gene/total number of all reads of this species) and homolog type numbers of these genes were acquired from their transcriptome mapping data (Resources, 2014).