Background

During the last decade, there has been increasing interest in the use of Schmidtea mediterranea as a model organism for the study of stem cells. These freshwater planarians contain a population of adult stem cells known as neoblasts, which are essential for normal cell renewal during homeostasis and which confers them with amazing regeneration capabilities [1-4]. Although a number of studies based on massive RNA interference (RNAi) [5], gene inhibition [6], microarray [7], and proteomics [8,9] approaches have been carried out to identify the crucial neoblast genes responsible for their stemness, our understanding of their biology is far from complete. The use of next generation sequencing (NGS) technologies provides an opportunity to study these cells in depth at a transcriptional level. For that to be accomplished, however, a reliable transcriptome and genome references are required. Up to eight versions of the transcriptome for this organism have been published to date, making use of different RNA-Seq technologies [10-16], including one meta-assembly which slightly improves each one separately [17]. Despite all these efforts, a consistent reference transcriptome is still lacking.

Some studies have provided quantitative data on transcripts and their respective assemblies, focusing on regeneration [13,17,18] or directly on neoblasts [11,14,15,19]. However, RNA-Seq suffers from an intrinsic bias that affects the quantification of transcript expression in a length-dependent manner. This bias is independent of the sequencing platform and cannot be avoided nor removed by increasing the sequencing coverage or the length of the reads. Furthermore, it cannot be corrected a posteriori during the statistical analysis (by transcript length normalization, for instance). Consequently, the quantification of the transcripts and the detection of differentially expressed genes is compromised [20-22]. Digital gene expression (DGE) [23] is a sequence-based approach for gene expression analyses, that generates a digital output at an unparalleled level of sensitivity [22,24]. The output is highly correlated with qPCR [25-27] and does not suffer from sequence-length bias. The combination of DGE and RNA-Seq data has been shown to help overcome the specific limitations of RNA-Seq [28], and the usefulness of DGE has been thoroughly demonstrated in research ranging from humans [26,29] to non-model organisms [22,24]. However, to date, DGE has not been extensively applied to the study of the planarian transcriptome.

Here, we have compiled and analyzed all the transcriptomic and genomic data available for S. mediterranea using DGE. This has facilitated an improved annotation and provided tools to ease the comparison and browsing of all the information available for the planarian community.

We have taken advantage of the resolution of DGE to quantitatively characterize isolated populations of proliferating neoblasts, their progeny, and differentiated cells through fluorescence-activated cell sorting (FACS) [30,31]. The resulting changes in transcription levels were analyzed to obtain transcript candidates for which an extensive experimental validation was performed. This has yielded new neoblast-specific genes, including many transcription factors and cancer-related homologous genes, confirming the validity of our strategy and the utility of the tools that we have implemented. Moreover, we provide a deeper molecular description of four of those candidates, the Smed-meis-like, and the three subunits of the Nuclear Factor Y (NF-Y) complex Smed-nf-YA, Smed-nf-YB-2, and Smed-nf-Y-C. Both families of genes are attractive candidates to be studied in planaria. The Meis family of transcription factors specify anterior cell fate and axial patterning [32], whereas the NF-Y complex is a heterotrimeric transcription factor that promotes chromatin opening and is involved in the regulation of a wide number of early developmental genes [33].

Results and discussion

Three DGE libraries were obtained from FACS-isolated cell populations X1 (proliferating stem cells, S/G2/M), X2 (a mix of stem cell progeny and proliferating, G0/G1), and Xin (differentiated cells, G0/G1) [30] (Additional file 1). 8,298,210 total reads were sequenced (X1: 3,641,099; X2: 3,488,712; Xin: 1,168,399), representing 98,156 distinct tags (X1: 70,849; X2: 24,621; Xin: 25,221), with an average of 84.5 reads per tag (X1: 51.4; X2: 141.7; Xin: 46.3). The distribution of the tags in each cell population can be observed in Additional file 2A. DGE is reported to achieve near saturation in genes detected after 6-8 million tags [22]. Furthermore, for moderately to very highly expressed genes (>2 cpm) it occurs with three or even just two million tags [22,34]. Figure 1 shows that saturation was reached at around two million tags for most of the data sets which the distinct tags were mapped to, although the slope for the total number of distinct tags decreases without saturating. It is worth noting that all the reference transcriptome sets performed similarly, achieving a maximum near 20,000 mapped tags. However, when looking at how many distinct tags map to any of those transcriptomes, about 5,000 tags appear not to be shared among all of them (see the “All mapped” and the “All distinct” data series on Figure 1, and further details on mapping below).

Figure 1
figure 1

Saturation plot for the distinct tags mapped over each reference data set. Tag sequences were randomly taken to build, by steps of 200,000 tags, increasing-size libraries that were then mapped against the reference data sets. Saturation is reached for libraries around two million tags.

A critical point in this kind of experiment has to do with the number of times a tag has to be seen so that it can be considered reliable. Discarding too many tags in an attempt to increase reliability will result in a loss of information whereas keeping all of them may generate background noise. To estimate the specificity of our tags and to establish an optimal cutoff for the minimum number of counts a tag should have in order not to be considered artefactual, we performed a series of simulations mapping iteratively randomized sets of our data. The results are summarized in Additional file 3 for the different cutoffs tested (1, 5, 10, 15 and 20 minimum occurrences of tags). For cutoffs higher than five there is no substantial gain in terms of specificity (the number of hits decreases less than one order of magnitude). Thus, we defined reliable tags as those sequenced five times or more and discarded the rest. Thereafter, for the subsequent computational and experimental analyses, only those tags occurring at least five times were considered. From the initial set of 98,156 distinct tags, 40,670 passed that cutoff (Additional file 2B).

The low technical variability of DGE and its high reproducibility, together with the digital quantification of transcripts, enables direct comparison of samples across different experiments, even from different laboratories [21,22,24-26,29,35]. That property allowed us to contrast our results with those from Galloni [36], who used DGE to identify neoblast genes by comparing irradiated versus control animals over the same strain of clonal S. mediterranea. A Venn diagram showing the similarity of the strategies can be seen in Additional file 4. From the total distinct tags, 31.38% (30,806 out of 98,156) were sequenced 10 times or more in our study, compared with just 11,28% (42,159 out of 373,532) in the irradiation strategy, indicating a greater representation of each tag. This suggests, as expected, that the cell-sorting approach has higher specificity. In addition, the strand-specific nature of DGE allows the discrimination of sense and antisense transcripts. Almost 30% of the transcripts successfully identified also presented antisense transcription, even though at lower levels than canonical transcription. This confirms the findings of the aforementioned study in planarians [36] and others [37], and shows that a large proportion of the genome is transcribed from both strands of the DNA. Although the purpose of these transcripts is still open to debate, evidences point to a post-transcriptional gene regulatory function [38].

Tag mapping to reference sequence data sets

An essential step in DGE is the recovery of the transcript represented by each tag. The nature of the DGE methodology, which generates reads of only 21 nucleotides, implies mapping short reads against a reference genome or a collection of ESTs to retrieve full-length sequences for the original transcripts. On the other hand, the short length facilitates the fast mapping of the tags against the reference sequence data set. To obtain the maximum number of transcripts, tags were mapped against the 94,876 S. mediterranea ESTs from the NCBI dbEST[39-42] and all the available transcriptomes (formally those can also be considered as ESTs libraries). 26,822 tags (65.95%) mapped over at least one set of ESTs/transcripts, leaving a huge number (34.05%) unmapped.

In an attempt to recover tags that did not map over the transcripts, tags were also mapped over the S. mediterranea genome assembly draft AUVC01 masked with the S. mediterranea repeats [23,43-45] (Table 1 and Figure 2). The overlap between transcriptomes was high. Although in most cases sets of reads mapping over a single transcriptome has a very low incidence, there were two cases where one could find a relatively small number of tags mapping to only one transcriptome: 327 tags (1.1%) for Labbé et al. 2012; 208 tags (0.7%) for Rohuana et al. 2012; 3,231 tags (10.7%) remarkably mapping only over the genome; and 26.1% of tags (10,617 out of 40,670) not mapping at all. For tags sequenced 10 times or more, the proportion of unmapped tags is similar: 20.5% (6,327 out of 30,806) (Additional file 2B). Even allowing up to two mismatches, 9.36% of the reads remain not mappable to the genome. This is still an important amount, considering that two mismatches is very permissive (it represents almost a 10% of nucleotide substitution in the read with respect to the reference sequence).

Table 1 Summary of mapped tags
Figure 2
figure 2

Venn stave showing the proportions of the distinct tags mapped over the different reference data sets. Integrating data for Venn diagrams for sets larger than four or five can be a challenging task, so that, a linear projection of such a diagram is provided in the stave—showing the 20 topmost scoring comparisons from 752 different subsets, accounting for 62.26% (18,710 out of 30,053) of total mappings—for ten reference sequence sets: eight transcriptomes, the S. mediterranea ESTs from NCBI dbESTs [39-42], and the latest genome draft AUVC01 [43,44]. Color gradient scale is provided on the bottom bar and it is proportional to the number of unique tags mapped over each sequence subset. X-axis ticks present the number of tags and their relative percent; the numbers on the right Y-axis correspond to the total number of tags mapped into a given sequence sets comparison. It is easy to spot that 15% of the unique reads are mapping onto all the sequence sets.

These results indicate that there will be a significant number of transcripts that are not represented yet neither in the current transcriptomic sets nor in the reference genome, despite their coverage depth [46-49], and may correspond, for instance, to weakly expressed genes [50]. Mapping tags are expressed on average at 50.78 cpm, while non-mapping tags only at 19.85 cpm. Nonetheless, since transcriptomes currently available lack the complete annotation of 3’-UTR regions and the DGE libraries were made from the 3’-ends, reads that map to genomic sequences but not to current transcripts may potentially come from the 3’-UTR ends not yet sequenced. To evaluate this possibility, we have projected the transcriptome from Kao et al. 2013 [17] over the genome and looked for the proximity of the tags mapping next to the 3’-end of the transcripts (Additional file 5). Downstream sequenced DGE tags account for 4.12% of all the possible CATG targets. This small amount of sequenced tags only mapping to the genome may correspond to potential novel unsequenced transcripts, alternative 3’-UTR exons of splicing isoforms, misannotated or alternative poly-adenylation sites, or even to non-coding RNAs not represented yet in the present transcriptome sets. Future RNA-Seq experiments may provide further sequence evidences supporting transcripts for those tags.

Functional annotation

8,903 contigs from Smed454_90e—Smed454 from now on—[10] showing significant expression changes (p < 0.001) were selected and, from those, 7,735 contigs presented a hit to a Pfam domain model (Figure 3). For those sequences having a significant hit to a known domain/protein, gene ontology (GO) analysis was performed in order to summarize changes on the biological processes and molecular functions due to the observed expression patterns of the enriched sets of transcripts. Those transcripts were classified according to the cell type in which they were mostly expressed, then their significant GO annotations were clustered (also taking into account their parent nodes in the ontology), to calculate the terms abundance log-odds ratio. Comparison of GO categories between transcripts predominantly expressed in X1, X2 or Xin cell fractions revealed significant patterns of enrichment as indicated in Additional file 6 (see also the “Transcriptomes” tables available from the web site—planarian.bio.ub.edu/SmedDGE—for specific GO terms assigned to each transcript).

Figure 3
figure 3

Predicted functional domains for several of the selected transcript candidates. Functional domains annotation based on Pfam hidden Markov models. Legend box shows a classification of the domain hits based on its match to complete domain model; the boxes height is proportional to the E-value score provided for each match. Significant matches were considered for HMMER [117] E-value < 0.001; however, low-significance matches are also shown, as well as hits to Pfam-B models produced by automated alignment protocols. Further annotation over Smed454 transcripts is already available at the GBrowse2 URL planarian.bio.ub.edu/gbrowse/smed454_transcriptome; an example can also be found on Figure 4.

The GO comparison between the neoblast population (X1) and the differentiated cells (Xin) reflects distinct functional signatures: X1 is enriched in ubiquitin-dependent protein catabolic process, nucleic acid binding, RNA-binding, helicase activity, ATP binding, translation, and nucleosome assembly; Xin most represented categories include actin binding, actin cytoskeleton organization, small GTPase mediated signal transduction, proteolysis, and calcium ion binding; whereas in X2, markers of secretory activity such as vacuolar transport are more abundant.

Browsing data

All tag mappings over the different transcriptome versions are available in the form of dynamic tables from our web site (planarian.bio.ub.edu/SmedDGE, Figure 4A). The relationship between Smed454, along with their domains and functional annotation, with the other reference transcriptomes described in this manuscript can be browsed on a subset of those tables. In order to establish the correspondence between the transcriptomes, a megablast—NCBI BLAST+ 2.2.29 [51]—was performed, filtering the resulting hits afterwards by three levels of coverage (90%, 95% and 98%). Although the focus is set on Smed454, the user can reorder those tables by columns containing identifiers for other transcriptome versions or she can choose to jump to the transcriptome version specific summary table.

Figure 4
figure 4

Online data sets and DGE data on Smed454 GBrowse2.A - To facilitate browsing of mapped tags over the transcripts we have worked with, we provide a dynamic table interface that paginates through the huge lists of records. This jQuery [112] interface allows the user to easily sort the output table by a given column—just by clicking on the column label—or to search for specific values on the cells—using either the form box just below the column labels or the advanced search available from the magnifying glass icon at the bottom of the table. Three tables, like the one in the background, contain the equivalences between contigs from different transcriptomes, as well as functional annotations, always focusing on the Smed454 data set. The other tables, like the one in the foreground, contain the tag mappings for each single transcriptomes considered to date. B - Previously published Smed454 database [10] has been ported to GBrowse2 in order to facilitate navigating through the transcripts annotations, such as predicted domains from Pfam, assembly reads mapping, etc. This panel shows the annotations on Smed-wi-3 homologous contig as an example. A customized track allows the integration of information about mapped DGE tags into single or combined tracks; tags are represented as boxes with height proportional to log of the normalized tag counts, the rank and the strand for the tag hit are shown in the label just below that box. Bottom left blue box zooms into one of those combined tracks to visualize the pop-up box that the user can recover when moving the mouse over a given tag feature. In addition, bottom right red box displays the details page one can get when clicking on a tag feature.

Moreover, the Smed454 contig browser [10,52]has been revamped into a more flexible interface based on GBrowse2 (planarian.bio.ub.edu/gbrowse/smed454_transcriptome). One can find there different types of annotation tracks: reads coverage, homology to known genes/proteins, hits to Pfam domains, and also the information of the tags mapped over the sequence. One track-specific GBrowse2 Perl module was modified to display DGE tags data, such as the sequence, counts and rank position. Further customization of the GBrowse2 configuration facilitates the access to most of that information in the form of pop-up summary boxes, but also by means of additional “Details” page (see yellow panel on the right side of Figure 4B).

This browser has been developed under the principle of easy accessibility, in the hope that it will become a useful and informative user friendly tool for experimental researchers in their daily work.

Experimental validation

The validity of our approach is corroborated by the expression levels detected in 40 already known and well-characterized neoblast genes (Table 2), plus another 29 genes described in the literature with evidence of also being neoblast related (Table 3). As can be observed in Figure 5, both sets of genes show the expected expression pattern along the vertical right hyperbola, indicating a clear X1 specificity, with two exceptions overrepresented in X2: Smed-nlk-1 and Smed-prog-1, which is described to be found in postmitotic cells [53]. Smed-dlx and Smed-sp6-9 are key genes in eye formation [54]; despite their localized activation, DGE was sensitive enough to identify both of them predominantly in the X1 subfraction. Moreover, we could detect expression of genes such as Smed-smg-1—which is described as broadly expressed through all tissues, including neoblasts [55]—in both neoblasts and differentiated cells. Finally, 133 clones from two different studies [6,56] focussing on regeneration, stemness and tissue homeostasis are, indeed, significantly overexpressed in neoblasts (Additional file 7).

Table 2 Neoblast genes
Table 3 Likely neoblast genes
Figure 5
figure 5

Splashplot projection of the X1/X2 versus Xin expression changes. X-axis represents tags fold change of X1 with respect to Xin, while Y-axis corresponds to fold change differences between X2 and Xin. Fold change is here calculated as the log base 2 of absolute value of difference between X1, or X2, and Xin, while the direction of the change will be given by the sign of that subtraction. Each of the figure quadrants provide insights on tags expression considering the three cell fractions simultaneously. Upper right quadrant contain tags being overexpressed in both X1 and X2 with respect to Xin; bottom left quadrant has those tags overexpressed in Xin versus the other two fractions. Points over the X-axis or Y-axis correspond to tags for which expression levels change only in one cell fraction, X1 or X2, with respect to Xin. The shift trend on most points towards the right vertical hyperbola reflects a higher expression level in X1 when compared to X2 or Xin (otherwise points will fit closer to both diagonals).

Based on their X1/Xin expression ratio, we selected a collection of potential new neoblast genes among the most represented in the X1 population. With the chosen candidates we performed expression pattern analysis by whole mount in situ hybridization (WISH) in irradiated animals. At different times after irradiation, as the neoblasts and its progeny decline, the hybridizationsignal disappears [57]. The expression of 42 out of 47 genes tested was diminished or completely lost in irradiated animals (Table 4 and Additionalfile 8).

Table 4 New neoblast genes experimentally validated

Although neoblasts are essential also during homeostasis for normal cell renewal, the phenotype becomes more evident during regeneration. Functional analyses were therefore carried out by RNAi followed by head and tail amputation in order to visualize defects in the regenerating process. From the 42 genes whose expression was affected by irradiation, 24 showed a phenotype after RNAi (Additional file 9), most of them preventing a successful regeneration and leading to the death of the animals, the usual phenotype for neoblast genes [58,59].

New neoblast genes

Interestingly, several of the new genes identified as neoblast genes correspond to transcription factors, which are key elements implicated in cell fate decisions. Furthermore, many are also homologous to cancer related genes. We briefly describe those that produce planarian regeneration impairment after RNAi (Additional file 9). The inhibition of six of them produce a reduced blastema with defective head and eyes. Smed-atf6A, is a cyclic AMP-dependent transcription factor, which interacts with the Nuclear Transcription Factor Y (NF-Y) complex (further analyzed later). Smed-ccar1, is a perinuclear phospho-protein that functions as a p53 coactivator modulating apoptosis and cell cycle arrest [60]. Smed-hnrnpA1/A2B1, a component of the ribonucleosome, is involved in the packaging of pre-mRNA into hnRNP particles in embryonic invertebrate development [61] and in stem cells [62]. Smed-srrt, modulates arsenic sensitivity, a carcinogenic compound that inhibits DNA repair [63]. Smed-med7 and Smed-med27 belong to a mediator complex essential for the assembly of general transcription factors. Smed-ranbp2 is a member of the nuclear pore complex and is implicated in nuclear protein import. Within the same family, Smed-nup50 shows also a stronger phenotype. The knockdown of the other 14 genes prevents the formation of the blastema completely. Smed-gtf2E1 and Smed-gtf2F1, are components of the general transcription factors IIE and IIF. Smed-ncapD2 is necessary for the chromosome condensation during mitosis [64]. Smed-pes1, is required in zebrafish for embryonic stem cell proliferation [65]. Smed-rack1, is an intracellular adaptor of the protein kinase C in a variety of signaling processes. Smed-lin9, is related to the retinoblastoma pathway interacting with Retinoblastoma 1, which is required for cell cycle progression [66]. All six different retinoblastoma binding proteins produce a non-blastema phenotype. The retinoblastoma pathway has been described to regulate stem cell proliferation in planarians [67] and some of its genes are already identified. Despite that, most of them are yet to be analyzed. Finally, Smed-rrM2B, is a subunit of the ribonucleotide reductase (RNR) complex required for DNA repair [68]. Details on these genes as well as the rest of the genes tested from the X1 population can be examined in the Additional file 10.

The four remaining genes presenting an aberrant phenotype during regeneration when inhibited by RNAi are described in detail in the following two sections: the Smed-meis-like, a new member of the Meis family, and the three components of the Nuclear Factor Y complex, all of them found to be overexpressed in neoblasts.

Smed-meis-like

Smed-meis-like is a member of the TALE-class homeobox family, similar to Meis genes, which was found to be overexpressed in the X1 subpopulation. This gene family is characterized by the presence of a homeobox domain with three extra amino acids between helices 1 and 2 [69]. Some of its members can act as cofactors for Hox genes [32]. In S. mediterranea, other members of the family have been described: Smed-prep [70], Smed-meis [54] and Smed-pbx [71,72].

WISH on intact animals shows that it is expressed in the cephalic ganglia, the pharynx, the tip of the head, and the parenchyma (Figure 6A). The downregulation observed three days after irradiation suggests that the parenchyma-associated expression is related to neoblasts and early postmitotic cells. To corroborate this, a double fluorescence in situ hybridization (FISH) together with the neoblast marker Smed-h2b [59] has been carried out (Figure 6B and Additional file 11A). Confocal microscopy shows colocalization of both genes in some cells, which confirms the expression of Smed-meis-like in neoblasts and, thus, the DGE results. Nevertheless, not all Smed-meis-like positive cells are expressing Smed-h2b, reinforcing the idea that Smed-meis-like is not exclusive of neoblasts.

Figure 6
figure 6

Smed-meis-like is essential for anterior regeneration.A - WISH reveals that Smed-meis-like is expressed in the cephalic ganglia, the pharynx, the tip of the head (arrowhead) and the parenchyma, from where it is downregulated three days after irradiation. B - Double FISH of Smed-meis-like together with the neoblast marker Smed-h2b, shows that Smed-meis-like is expressed in neoblasts (arrowheads) as well as in differentiated cells (asterisk). DAPI labels the cell nuclei. See Additional file 11A for the separate channels of fluorescence. C - Smed-meis-like(RNAi) produce defects in anterior regeneration, which range from an squared head with elongated eyes, cyclops, to complete loss of anterior regeneration. The marker of brain branches Smed-gpas also shows this different penetrance. D - Double FISH with Smed-opsin and Smed-tph shows aberrant eyes in the less severe phenotype. E - The anterior markers Smed-notum, Smed-sfrp1, Smed-cintillo, and the eye progenitor marker Smed-ovo disappear after Smed-meis-like(RNAi), while the posterior marker Smed-wnt-1 remains. F - Quantification of mitotic cells by α-H3P immunohistochemistry in the whole animal (p < 0.001, t-test). All the experiments are done on bipolar regenerating trunks, at 11 days of regeneration after three rounds of injection.

Knockdown of Smed-meis-like through RNAi produced a diverse range of anterior regeneration phenotypes (Figure 6C), which can be explained by a different penetrance. The mildest phenotype produced a squared head with elongated and disorganized eyes. This phenotype was also clearly visible with fluorescence in situ hybridization (FISH) against Smed-opsin [5] and Smed-tph [73], which label the photoreceptor and the pigment cells of the eye (Figure 6D). In an intermediate phenotype, cyclopic animals are obtained, whereas in the strongest one there is no anterior blastema formation. This range of phenotypes can also be observed with the marker of brain branches Smed-gpas [74], which shows a gradual reduction of brain regeneration after Smed-meis-like inhibition. These results are also confirmed by the reduction of the brain signal of the pan-neural marker α-SYNAPSIN (Additional file 11B). Posterior regeneration was normal.

In the strongest phenotype, there is also no expression of the anterior markers Smed-notum [75] and Smed-sfrp-1 [76,77], and the marker of sensory-related cells Smed-cintillo (Figure 6E) [78]. This indicates that Smed-meis-like is necessary for anterior identity. In contrast, expression of the posterior marker Smed-wnt-1 [77] remains after Smed-meis-like inhibition. Thus, we can conclude that Smed-meis-like is necessary for anterior, but not for posterior regeneration.

Finally, immunohistochemistry against H3P (Figure 6F) shows a slight—but significant—decrease in proliferation in the whole animal (133.8 ±5.22 mitosis/mm2 in n=9 controls versus 94.6 ±4.06 cells/mm2 in n=9Smed-meis-like(RNAi), mean ±s.e.m.). This decline in mitosis is matched by the lack of progenitors of some anterior structures, indicating also defects in differentiation. Thus, eye progenitor cells, which are labeled with Smed-ovo [54], are not present in Smed-meis-like(RNAi) animals (Figure 6E).

The requirement for Smed-meis-like in anterior regeneration is similar to another member of the family, Smed-prep [70]. This differential phenotype is also observed after the inhibition of other genes, such as Smed-egr4 [79], Smed-zicA [80,81] and Smed-FoxD [82]. The milder phenotype, showing elongated eyes, is similar to the effect of Smed-meis(RNAi) [54], and also to the mild inhibition of Smed-bmp4 [83]. Altogether, these results suggest that Smed-meis-like is important for eye and anterior regeneration, similarly to other members of the TALE-class homeobox family. However, given the lack of expression of Smed-meis-like in the eyes, the abnormal eye formation could be a consequence of the anomalous brain regeneration.

Nuclear Factor Y complex

The Nuclear Factor Y complex (NF-Y) is an important transcription factor composed by three subunits (NF-YA, NF-YB and NF-YC), each one encoded by a different gene. This heterotrimeric complex acts as both an activator and a repressor, and it regulates other transcription factors, including several growth-related genes, through the recognition of the consensus sequence CCAAT localized in the promoter region [84-88]. In addition, it has been reported that the NF-Y complex regulates the transcription of many important genes like Hoxb4, y-globin, TGF-beta receptor II, or the Major Histocompatibility Complex class II and Sox gene families [89]. This large number of interactions makes the NF-Y complex an important mediator in a wide range of processes, from cell-cycle regulation and apoptosis-induced proliferation to development and several kinds of cancer [90].

In the sexual strain of S. mediterranea, an NF-YB is necessary to maintain spermatogonial stem cells [91]. We have isolated a different NF-YB subunit (NF-YB-2), and also a member of the other two subunits (NF-YA and NF-YC). WISH shows that the three genes are expressed ubiquitously and in the cephalic ganglia (Figure 7A). Moreover, the expression decrease one day after irradiation indicating a linkage with stem cells, as described in other organisms [92]. Double FISH of each NF-Y subunit together with Smed-h2b confirms the expression of this complex in neoblasts and also in some determined cells (Figure 7B and Additional file 12A).

Figure 7
figure 7

Smed-nf-Y gene complex is required for the proper neoblast differentiation and localization.A - WISH shows that the three Smed-nf-Y genes are expressed ubiquitously and in the cephalic ganglia, and one day after irradiation their expressions decrease. B - Double FISH of Smed-nf-YA, Smed-nf-YB-2, and Smed-nf-YC together with the neoblast marker Smed-h2b shows colocalization with the NF-Y subunits (arrowheads), demonstrating the expression of this complex in neoblasts as well as in differentiated cells (asterisk). DAPI labels the cell nuclei. See Additional file 12A to check each channel of fluorescence separately. C - Smed-nf-Y(RNAi) animals regenerate thinner blastemas with non well formed eyes and shape defects, and fail to differentiate a proper brain, with reduced cephalic ganglia as revealed with Smed-gpas. FISH with the neoblast marker Smed-h2b shows an accumulation of neoblasts in the region in front of the eyes while the early progeny marker Smed-nb.21.11e reveals a decrease of early postmitotic cells in Smed-nf-Y(RNAi) animals. D - Immunohistochemistry with the mitotic marker α-H3P shows a reduction in the number of mitosis. E - Quantification with category markers indicate a significant increase of Smed-h2b + cells in Smed-nf-YB-2(RNAi) and Smed-nf-YC(RNAi) animals and a significant decrease of nb.21.11e + cells in all of the RNAi animals, whereas Smed-agat-1 + cells do not show significant changes (p < 0.001, t-test). Counts are referred to the whole body. ph: pharynx. All the experiments are done on bipolar regenerating trunks, at 11 days of regeneration after one round of injection.

It has been suggested that each NF-Y component could have a specific role [93]. Therefore, to better understand the function of this complex, we knocked down each subunit separately. Although the penetrance varies depending on the subunit inhibited, the phenotype observed after RNAi treatment is the same. In intact non-regenerating animals, RNAi resulted in head regression, ventral curling and, finally, death by lysis (data not shown), as described for other neoblast-related genes [58,59]. After 11 days, head and tail amputated animals failed to regenerate properly, with a smaller brain and fewer brain ramifications as revealed by Smed-gpas (Figure 7C) and by α-SYNAPSIN (Additional file 12B). Furthermore, we observe an increase in the number of Smed-h2b + cells (Figure 7C,E), also in the area in front of the eyes, where there should not be undifferentiated neoblasts, even though mitosis are reduced (Figure 7D). There is also a decrease in the number of early postmitotic cells (Smed-nb.21.11e +) (Figure 7C,E), whereas late postmitotic cells (Smed-agat-1 +) do not present significant differences (Figure 7E) [53]. These early progeny markers have recently been associated with epidermal renewal [94]. Hence, the accumulation of neoblasts and the decrease of the subepidermal postmitotic population suggest a defect in the early stages of the differentiation process affecting the epidermal linage. The neural lineage may also be compromised according to the atrophied cephalic ganglia.

Conclusions

This work presents experimental validation of a collection of putative neoblast genes obtained from a DGE assay on cell fractions. As clearly depicted in the splashplot for the comparison of expression levels between X1, X2 and Xin fractions (Figure 5 and Additional file 13A), there are only a few transcripts specific to X2. The plot produced with the data provided by Labbé [14] from their RNA-Seq analysis on X1, X2 and Xin cell fractions for S. mediterranea shows a similar pattern (Additional file 13B). Moreover, comparison among the three sets using Pearson and Spearman correlations indicates that X1 and X2 are the most correlated populations (Additional file 14). Following these results, most of the transcripts expressed in X2 are also expressed in X1. Hence, X2 is a heterogeneous population that cannot be transcriptionally differentiated from X1 without a deeper discrimination method. In this regard, the strategy recently applied by van Wolfswinkel and collaborators using the last sequencing technology to obtain the transcriptome of individual cells [94], represents the most promising approach to deciphering the heterogenity of the neoblast progeny.

Randomization simulations also illustrate the specificity of the 21bp tags to detect real transcripts, corroborating previous estimations [29,46,48,49,95,96]. Furthermore, those results reinforce the assumption that most of the non-mapping tags will correspond to real transcripts [46-49], still lacking from reference data sets for this species. Antisense transcription was also detected, confirming previous reports [25,36,49]. Although further analysis will be required to determine whether this could explain a fraction of the “novel” tags, our primary focus was to characterize the canonical protein-coding transcripts. Due to the heterogeneity of this species genome, we would expect some variability-both at sequence and expression arising from individuals (the pool of animals taken for the samples), and cells (as they do not come from a cell culture). This could explain another fraction of tags not mapping onto the reference transcriptomes. Consequently, we were quite strict in the current manuscript to look for exact tag matches, taking into account that one or more mismatches represents a mappability issue even for finished transcriptomes of the quality of human [97] or Drosophila melanogaster [98].

DGE has proven to be reliable for transcript quantification and new gene identification in planaria. In this work, we have described a new member of the TALE-class homeobox family, Smed-meis-like. Similar to other members of this family, this gene seems to be involved exclusively on anterior polarity determination during regeneration. Given that the expression of this gene is not restricted to neoblasts, its role can also be important in committed cells. Our results with the NF-Y complex suggest that the knockdown of this complex blocks early differentiation of the epidermal and, probably, neural lineages, both belonging to the ectodermal line, generating a neoblast accumulation and deregulation. This effect has been shown in other organisms such as Drosophila, in which NF-Y knockout blocks differentiation of R7 neurons through senseless [89,99]. The majority of the new neoblast genes reported and validated in this study were found to participate in cell proliferation, cell cycle regulation, embryogenesis or development in other models, and many of them are involved in processes related to cancer. The pathways participating in tumorigenic processes and stem cell regulation are often the same, as has been proposed previously for planarians [100]. These genes are probably fundamental for stem cell maintenance andthe control of proliferation in organisms with the capacity to regenerate [101], thus reinforcing the potential valueof S. mediterranea as an in vivo model for stem cell research [102].

Our DGE analysis pointed out a high resemblance among all the transcriptomes available for S. mediterranea. We have also shown the redundancy of the transcriptomes currently available for S. mediterranea in agreement with Kao [17], together with their incompleteness under the light of the DGE data. Although our results provide a comprehensive comparison among them, it would be desirable to agree on a unique transcriptome to be used by the whole community. To this end, the PlanMine initiative [103] is attempting to obtain consensus among the researchers on an appropriate reference. Nonetheless, the need for a completely sequenced and well-annotated genome remains. The DGE strategy can help in this endeavour, since short sequences can be rapidly projected over the reference genome or the transcriptome, even from different laboratories, in order to improve their annotation [46]. Similarly, DGE allows the data generated to be reassessed as many times as required, as a more complete genome and transcriptome references for this species become available. Hence, the quantitative data provided here by DGE will prove useful in order to recover and annotate more undescribed genes in the future.

Methods

Animal samples

Planarians used in this study were from the asexual clonal line of S. mediterranea BCN10. Animals were maintained in artificial water and were starved at least seven days prior to experimentation.

Cell dissociation, cell sorting and RNA extraction

To trigger neoblast proliferation and differentation, two days head and tail regenerating animals were used for the preparation of the libraries. Three animals per library were used in order to obtain the required amount of RNA. Cell dissociation and FACS were carried out as described by Möritz [31] and Hayashi [30]. Briefly, after cell staining with Calcein AM and Hoechst 33342 (Molecular Probes, Life Technologies), one million cells were separated for each population in a FACSAria sorter (Becton Dickinson) at the Scientific and Technological Centers of the University of Barcelona (CCiTUB) cytometry facilities. A representative plot of the cell populations after the sorting can be seen in Additional file 1A. Cells were directly collected in TRIzol LS (Life Technologies) at 4°C and maintained in ice to preserve RNA integrity. RNA extraction followed to obtain 1 μg of total RNA for each library. Quantification of RNA was assessed with a Nanodrop ND-1000 spectrophotometer (Thermo Scientific) and quality check was performed by capillary electrophoresis in an Agilent 2100 Bioanalyzer (Agilent Technologies) prior to librarypreparation.

DGE sequencing

Unlike RNA-Seq, this method only sequences a short read of a fixed length, named tag, derived from a single site proximal to the 3’-end of polyadenylated transcripts. This short read is later used to identify the full transcript. The number of times that the very same tag has been sequenced—its number of occurrences—is proportional to the abundance of the transcript which it belongs to. Since it only counts one sequence per transcript, its ability to quantify is not affected by the transcript length. For that reason, DGE is better suited for the detection of short transcripts and low expressed genes when compared with RNA-Seq [20-22].

Sequence tag preparation was done with Illumina’s DGE Tag Profiling Kit according to the manufacturer’s protocol as described [104]. In short, the most relevant steps included the incubation of 1 μg of total RNA with oligo-dT beads to capture the polyadenlyated RNA fraction followed by cDNA synthesis. Then, samples were digested with NlaIII to retain a cDNA fragment from the most 3’ CATG proximal site to the poly(A)-tail. Subsequently, a second digestion with MmeI was performed, which cuts 17 bp downstream of the CATG site, generating, thus, the 21 bp tags.

Cluster generation was performed after applying 4pM of each sample to the individual lanes of the Illumina 1G flowcell. After hybridization of the sequencing primer to the single-stranded products, 18 cycles of base incorporation were carried out on the 1G analyzer according to the manufacturer’s instructions. Image analysis and base calling were performed using the Illumina pipeline, where tag sequences were obtained after purity filtering. Generation of expression matrices, data annotation, filtering and processing were performed by using the Biotag software (SkuldTech, France) [104].

Raw sequencing data in FASTQ format as well as processed tag sequences and their associated expressions have been deposited at NCBI Gene Expression Omnibus (GEO) [105] and are accessible through GEO Series accession number GSE51681 [106].

Comparison of expression data

Tag raw expression was normalized to counts per million (cpm). The statistical value of DGE data comparisons, as a function of tag counts, was calculated by assuming that each tag has an equal chance to be detected, in fair agreement with a binomial law. An internal algorithm allows the comparison between different libraries and measures the significance threshold for the observed variations and p-value calculation (see Mathematical Appendix of Piquemal et al. 2002 [104]).

Different Perl [107] scripts were designed for the subsequent analyses. All of them are available from the web site planarian.bio.ub.edu/SmedDGE.

Tag mapping

A database with all the possible CATG + 17bp theoretical tag sequences was constructed for each one of the reference data sets. Tags were compared to these databases to identify all perfect matches and, when more than one tag mapped over the same transcript, only the tag closer to the 3’-end was considered. For the genome reference, 2 mismatches were also considered for unmappable tags with the SeqMap mapper [108,109].

In addition, tags were also mapped against a database of 8,662,308 CDS and 5,189 genomic sequences from bacteria directly downloaded from GenBank [110] repositories to check sample contaminations. Only two tags mapped on bacterial transcripts, confirming the purity of our libraries.

For the 3’-UTR prediction, all 23,020 contigs of the transcriptome from Kao et al. 2013 [17], were mapped over the genome using Exonerate 2.2.0 [111] to characterize the putative 3’-UTR ends (poly-A sites were not predicted though). Apart from aligning the transcripts to the genomic contigs, the strand for the longest ORF contained was also considered to ensure proper transcript orientation. For each transcript, 1,000bp upstream and downstream regions around the genomic coordinate for the putative 3’-UTR ends found were considered to retrieve DGE tags (noted as transcripts 3’-end relative position in Additional file 5).

Libraries and reference sequence data sets randomization

Libraries and reference data sets were randomized using Perl [107] scripts and the Inline::C library to generate analogous sets of random sequences. This method resembles the original data sets in terms of size and nucleotide abundance in comparison with other approximations which generate virtual sequences based on mathematical distributions [49]. 500 and 100 randomizations for each library and data sets respectively were generated. Mapping was performed using cutoffs of 1, 5, 10, 15 and 20 occurrences (Additional file 3).

Browsing data sets

Mapped tags are also available from the web site through a set of dynamic tables (Figure 4A). They were implemented using the jQuery jqGrid-4.5.2 [112] library, an Ajax-enabled JavaScript control to represent and manipulate tabular data on the web. Those tables summarize the tags along with their mappings on the different transcriptomes publicly available (which were downloaded from the locations cited at the respective papers [10-17]), their correspondence with the Smed454 transcriptome, and their annotation.

The transcriptome browser shown in Figure 4B was initialized with the Smed454 [10] contigs using the GBrowse2 engine [113]. The browser also includeshigh-scoring segment pairs (HSPs) from whole-transcriptome BLAST searches performed over the UniProt database [114] (NCBI BLAST+ 2.2.29 [51] with default parameters), as well as the Pfam [115,116] domains mapped by HMMER—with E-val =1 and domain E-val =1—[117] on the six-frame translations for the contigs sequences. DGE tag sequences—together with the corresponding counts, normalized scores, their ranks, etc.—were uploaded to the GBrowse2 MySQL database, and they are shown in the browser using a customized version of the Bio::Graphics::Glyph::xyplot module.

Functional annotation was projected from the UniProt GO annotations over the homologous Smed454 contig sequences. Two-tailed hypergeometric test, which accounts for significant overrepresented (positive-tail) or under-represented (negative-tail), was performed by comparing the set of GO assigned to transcriptome contigs over-represented on each of the cell fractions against the set of GO annotations for the whole set of contigs. Significance threshold was set to p < 10-5 and the results are summarized in Additional file 6 for the different cell fraction sets.

Gene nomenclature

New genes were named following the nomenclature proposed for S. mediterranea [118] based on their BLASTx homology—NCBI BLAST+ 2.2.29 [51] with default parameters against the UniProt database [114]—to its human homologous gene according to the official gene name approved by the HUGO Gene Nomenclature Committee (HGNC) [119] whenever possible, and trying to honor the names of other members of the family if they were already stated for S. mediterranea. When no significant homology for the corresponding gene was available, its characteristic domain found at the Pfam site [115,116] was used to identify it.

Gene sequences and primers used for cloning are deposited at the GenBank [110] site—see Table 4 for the accession numbers of the sequences.

Irradiation

For experimental protocols requiring irradiated animals, irradiation was carried out at 75 Gy (1,66 Gy/minute) in a X-ray cabinet MaxiShot 200 (Yxlon Int.) at the facilities of the Scientific and Technological Centers of the University of Barcelona (CCiTUB).

In situ hybridization

WISH was conducted for gene expression analysis, as previously described [120,121]. Images from representative organisms of each experiment were captured with a ProgRes C3 camera (Jenoptik) through a Leica MZ16F stereomicroscope. Animals were fixed and hybridized at the indicated time points.

Fluorescence in situ hybridization

For double FISH animals were treated as described elsewhere [122]. Confocal laser scanning microscopy was performed with a Leica SP2.

Immunohistochemistry

Immunostaining was carried out as described previously [123]. The following antibodies were used: α-SYNORF-1, a monoclonal antibody specific for SYNAPSIN, which was used as a pan-neural marker [124] (1:50; Developmental Studies Hybridoma Bank); and α-phospho-histone H3 (H3P), which was used to detect mitotic cells (1:500; Cell Signaling Technology). Alexa 488-conjugated goat α-mouse (1:400) and Alexa 568-conjugated goat α-rabbit (1:1000; Molecular Probes) were used as secondary antibodies.

RNAi experiments

Double-stranded RNAs (dsRNA) were produced by in vitro transcription (Roche) and injected into the gut of the planarians as previously described [5]. Three aliquots of 32 nl (400-800ng/ μl) were injected on three consecutive days with a Drummond Scientific Nanoject II injector. Head and tail ablation pre- and post-pharyngeally followed the fourth day. If no phenotype was observed after two weeks, a second round of injection and amputation was carried out in the same manner, unless otherwise stated. Control organisms were injected with gfp dsRNA.

Availability of supporting data

All data sets are fully available without restriction. Yet relevant data sets were already included within this article and its additional files, further supporting material, as well as updates, will be publicly available through the project web site [https://planarian.bio.ub.edu/SmedDGE].

Raw sequencing data in FASTQ format, along with processed tag sequences and their associated expressions, have been deposited at NCBI Gene Expression Omnibus (GEO) [105]; they are accessible through GEO Series accession number GSE51681 [106]. Gene sequences and primers used for cloning are deposited at the GenBank [110] repository, the corresponding accession numbers for the gene sequences are listed on Table 4.