Background

The proper development and viability of an organism is dependent on a group of genes called essential genes. In humans, gene essentiality has been long associated with many diseases such as miscarriages [1, 2], heritable diseases, and cancer [3]. Recent studies have shown that over-expression of some essential genes promotes cell proliferation in cancer [4]. Due to its importance for survival, essential genes have been targets for new therapeutics or antimicrobials [5]. To effectively study essential genes, generating lethal alleles in model systems is required. In the nematode Caenorhabditis elegans, the essential gene set is the largest set of genes and is estimated to contain 25% of all the genes [69]. Using RNAi, about 3500 genes have been annotated as essential (data collected from WormBase [10, 11]). Inparanoid, a sequence based orthology inference tool, detects about 40% of the C. elegans genes are orthologous to the human genes. But approximately 60% of the essential genes show clear human orthologs, showing high conservation of essential genes, which makes C. elegans an excellent platform for examination of essential gene functions that are relevant to human health. Many important genes, such as let-60/Ras [12] and let-740/dcr-1[13, 14], were first discovered through C. elegans genetics. However, the genetic resource for studying these genes is severely lacking. Even with the concerted community effort such as the C. elegans Deletion Mutant Consortium [15], mutations in many essential genes are still lacking in the knock-out collection. The consortium have generated close to 6000 knock-out strains since 1998, but only 1436 essential genes are in the current collection [16, 17]. In addition to the considerable time and effort required to generate a single knock-out allele, an outstanding disadvantage of the targeted deletion approach is that extra effort is needed to balance the lethal mutation [18]. Recently, the Consortium has adopted a procedure of random mutagenesis followed by whole genome sequencing (WGS) to generate and identify a large number of mutations [15]. Although this project can generate more mutations in shorter time, their method does not capture mutations that exhibit lethal phenotypes, and thus, essential genes are selected against. This outcome indicates thousands of essential genes do not have knockout alleles.

To complement the effort of the C. elegans community, we took advantage of the balancer system, which was developed 30 years ago for capturing and maintaining lethal mutations, with the next-generation DNA sequencing technologies. Almost 70% of the C. elegans genome have been successfully balanced by large genomic rearrangements [18]. By mutagenizing a pre-balanced strain removes the need to perform additional genetic crosses to balance a lethal mutation. The balancer system, designed specifically to capture and maintain lethal mutations, is the system of choice for generating mutations in essential genes. Such screens have been carried out for regions in chromosome I [19, 20], II [21], III [22], IV [2326], V [27], and X [28, 29]. In our laboratories, we have generated over 1350 lethal mutations that fall into 486 complementation groups.

The next hurdle in the analysis of essential genes is the molecular identification of the genomic lesion, which to date has involved an enormous effort. Traditional methods of gene cloning that rely upon candidate identification of mapped mutations can take months or years. This gene-by-gene approach was only able to characterize 30 essential genes from our library to date. This problem has been difficult to solve until the recent advances in sequencing technology. To address the problem of coding region identification, we have recently developed a fast and scalable pipeline that takes advantage of whole genome sequencing and bioinformatics analysis to identify the causal mutation responsible for the lethal phenotype [30]. Recent studies, including our initial analysis of let-504[30], have shown that whole genome sequencing is an efficient and cost-effective approach to identifying the encoded gene product especially when there are additional alleles that can be sequenced to provide confirmation [3034]. In this paper, we describe our approach of combining an established mutagenesis technique with the latest sequencing technology in order to close the gap in the essential gene knock-out collection.

Results and discussion

Chromosome I left has a high percentage of essential genes

The leftmost 7.3 Mbp of chromosome I has the highest percentage of mapped essential genes and closest to saturation with 237 essential genes isolated and mapped [19]. The mutant strains were derived by mutagenizing KR235 [dpy-5 (e61), +, unc-13 (e450)/dpy-5(e61), unc-15(e73), +; sDp2] with a low dose of EMS and isolating let-x dpy-5 unc-13 homozygotes rescued with a third wild-type allele of dpy-5 and let-x balanced by free duplication sDp2[35] (see schematic in Additional file 1). In order to position the genes, mutations were mapped into 60 zones using a combination of three-factor mapping and complementation to a series of duplications and deficiencies [19]. Within zones, lethal mutants were inter-complementation tested. The earliest developmental arrest stages were determined for each complementation group [19]. The candidate lesions are present in two copies and rescued by a third wild-type allele on sDp2. Thus, our high throughput identification method focused on finding heterozygous mutations that exhibit an allelic ratio between the range of 40% to 90% [30]. In order to assess the accuracy of our recently developed high throughput method [30], we selected 81 genes from this set with the criteria that they formed a complementation group having more than one allele (Additional file 2). The extra alleles provide an added resource for validation. We sequenced 10 indexed genomic DNA samples per Illumina HiSeq lane and obtained a total of 385 Gbp of sequence. The sequencing reads were aligned using BWA [36] to the WS200 C. elegans reference sequence. We achieved 30X coverage on average across the whole genome and an average of 35X coverage in coding elements. In the case of two strains, only 6X coverage was obtained: let-369(h125) and let-594(h407). Genomes from these two strains were removed from subsequent analysis.

The mutational landscape provided a quality check

Our first analytical step, as a quality check, was to confirm the presence of the dpy-5 (e61) and unc-13 (e450) mutations in each genome. For unc-13, the expected variant ratio should be 100% because the duplication does not extend far enough to provide an additional wild-type allele. For dpy-5 however, there is a wild-type allele on sDp2, and thus we would expect to see a 66% variant ratio. We found the expected ratios in 76 of the 79 genomes. Three genomes deviated from the norm: let-516(h144) is missing both e450 and e61 (all the reads supported the reference sequence); let-388(h88) is missing e61; let-393(h225) has e61, but with a 33% ratio rather than the expected 66%. We examined these strains for the presence of the duplication sDp2. When the duplication is present, the read depth is 33% greater in the first 7Mbp of chromosome I than for rest of the chromosome. Our analysis showed that none of these three genomes showed any depth difference (Additional file 3). It is likely these strains do not carry sDp2. Although sDp2 does not crossover with the normal homologs at a readily detectable frequency, rare exchange events can occur resulting in subsequent loss of the duplicated fragment [37]. These three strains were not analyzed further.

Coding sequence correlated with high confidence

We analyzed the parental strain KR235 and identified 571 SNVs and 167 small indels that show >40% read support on Chromosome I when compared to the C. elegans WS200 reference using VarScan (see Methods). These mutations represent the background mutations in which the lethal mutations were maintained. For the remaining 76 genomes, we filtered out the background mutations and found on average 44 SNVs that show >40% allelic ratio in the sDp2 region. Most of the SNVs are G > A or C > T changes as expected and previously observed after EMS treatment [30, 38]. We also found an average of 7 small indels of 1–2 bps. We categorized each mutation as either nonsense, missense, synonymous, splice signal disruption, frame shifting indel, frame preserving indel, or noncoding mutation. Noncoding mutations were defined as any mutation located outside of coding regions. A full list of SNVs and indels, for each strain, is available on our server at http://lethal.mbb.sfu.ca/jschu/essential_genes.

We identified candidate mutations for the 76 genomes using our bioinformatics pipeline that we developed previously [30] (see also Methods) and validated a subset of our candidates by sequencing a second allele or by complementation testing (Table 1). Nine of our candidate lesions were in genes that had been previously identified and published. In a few cases, candidates expected to be in separate genes were located in the same coding region. These observations were confirmed by further genetic complementation tests (Additional file 4). Previously identified let-631 and let-103 were found to be allelic to let-363. As a result, let-363 gains three new sDp2-balanced alleles (h216, h451, h502) in addition to the nine existing ones. let-519 and let-104 are allelic to let-526 and thus let-526 gains four new alleles: h799, h373, h405 and h526. let-630 fails to complement let-596 and now has five alleles: h355, h702, h432, h782, and h258. Thirty-five candidates were tested by sequencing a second allele using previously published complementation data [19]. Of these, we confirmed 29 identities. All in all, we tested 48 candidates and confirmed 42 (87.5%). For the remaining 28 genomes, we have high confidence in the identity of 22 genes based on their map position. Thus, including previously described let-504, we now have coding region assignments for 64 let- genes in the sDp2 region. Because the genes in this study all have multiple alleles, thus by inference, we have confidently identified the coding regions affected in a total of 259 mutant strains (Additional file 2).

Table 1 Coding DNA Sequence (CDS) identifications of let- genes

Seven of these genes have been molecularly identified and phenotypically described. let-603, an aurora kinase [46], and let-605, the cyclin E, had severe gonadal defects [56]. let-355, a DEAD box helicase, and let-384, an integrator subunit, failed to develop gametes [56]. let-370, let-599, and let-604 produced malformed embryos that were not laid or hatched [56]. let-370 encodes a hexaprenyl pyrophosphate synthetase that is associated with Parkinson’s disease [51]. let-599 encodes the N-acetyl transferase nath-10. let-604 encodes mdt-18, a mediator subunit. A comprehensive summary of the let- encoded products is given in Table 1.

Novel knock-out alleles provide new genetic resources

We have generated new alleles for 13 genes that currently have no knock-out alleles available: let-595 (imb-1), let-362 (Y71G12B.8), rnp-6 (let-147), aars-2 (let-366), let-598 (F27C1.6), let-355 (T05E8.3), let-384 (C06A5.1), fars-1 (let-396), let-611 (C48E7.2), mdt-18 (let-604), acdh-5 (let-383), rpb-5 (let-397), and let-630 (Y110A7A.19). Eight of these genes are predicted to have roles in essential basic functions such as transcription or translation. This is not surprising, because we expect genes that function in basic cellular processes to be essential and are best captured using balancer systems. Besides these novel alleles, we have provided additional loss of function alleles for many characterized genes (Table 1). Additional alleles affecting different parts of the gene may disrupt different domains providing an allelic series correlating with different phenotypes.

Genetic strains carrying heritable mutational changes provide a lasting resource that can be used in a variety of experimental conditions and compared to information gained from RNAi knock-down experiments. We cross-checked our high confidence list with the RNAi data annotated in WormBase to see if the lethal phenotype was observed in at least two RNAi experiments. Although for the most part, RNAi data agrees with our mutational data, not every gene was supported by RNAi. We found nine genes showing no lethal phenotype with RNAi and three genes showing lethal phenotype of variable penetrance (Table 1). Of the nine genes that show no RNAi lethal phenotype, six (inx-12, coq-1, lim-7, tag-146, let-381, and let-503) have additional knock-out alleles that are lethal, suggesting RNAi did not reveal the null phenotype of these genes. The additional information provided by genetic mutation highlights the importance of our collection.

Essential genes in sDp2function in cell cycle and cytokinesis, transcriptional regulation, and RNA processing

To identify the processes that are essential, we investigated the function of our high confidence gene set along with their orthologs in D. melanogaster (fly), S. cerevisiae (yeast), and H. sapiens (humans). Essential genes are often conserved due to their important biological roles. Fifty-four of our identified essential genes have readily identifiable orthologs in humans [57] (Table 1). We further categorized each gene into at least one of eight functional groups based on their GO annotations (Figure 1). To have a better picture of the roles of different essential genes, multi-functional genes were categorized into more than one functional group. The cell cycle & cytokinesis, transcriptional regulation, transport, RNA processing, and transcription categories contained more genes than did the groups representing translation, signal transduction, and the other groups that includes metabolic and structural processes.

Figure 1
figure 1

Functional categorization of essential genes identified in this study using GO terms. The Y-axis indicates the GO term categories. The X-axis represents the number of genes in each category. Random sampling of 1000 iterations was done by selecting equal number of genes from either all sDp2 genes or the set of all essential genes identified by RNAi. Error bars represent standard error.

Of these eight functional groups, we found three groups that were significantly enriched in the sDp2 region when compared to the non-essential genes in sDp2: cell cycle & cytokinesis (p = 3.61e-9, χ2 test), regulation of transcription (p = 6.21e-8, χ2 test), and RNA processing (p = 6.35e-12, χ2 test). Our analysis indicates that members of these processes are enriched in essential genes. We have previously shown that components of the spindle assembly checkpoint are essential for survival [58]. Here we showed that genes in the sDp2 region function in various phases of the cell cycle. For instance, let-380 (knl-2) is critical for loading hcp-3 (CENP-A) to chromatin and forming the kinetochore [59]. let-603 (air-2), let-597 (hcp-4), and let-106 (hcp-6) remove cohesions for proper resolution of centromeric connections and segregation of homologous chromosomes during meiosis [6062]. let-365 (sep-1) is essential for chromatid separation and proper anaphase. In addition, let-364 (mat-1), a member of the anaphase promoting complex (APC), is crucial for the transition from metaphase to anaphase [63]. lin-6 (mcm-4) is required for DNA replication and activates a checkpoint when entering into M phase [39]. let-599 (nath-10) and let-354 (dhc-1) are crucial for cytokinesis during cell division [64, 65]. let-385 (teg-4) is a component of splicing complex A that functions in the meiosis entry decision [66, 67]. Our data indicate that disrupting any phase of the cell cycle process can lead to lethality.

Are functions of the essential genes identified in this study representative of all essential genes? Random sampling simulation from 3500 essential genes indicated by RNAi shows a very different GO term distribution (Figure 1). In the larger set samples, we observed that cell cycle and cytokinesis (p = 1.02e-22, χ2 test), regulation of transcription (p = 2.48e-20, χ2 test), and RNA processing (p = 5.43e-10, χ2 test) are under-represented compared to our sequenced set. Although we acknowledge that comparing lethal mutants to RNAi phenocopies is not fully equivalent, at the present time there is not a large enough mutant essential gene collection to do this comparison. It is intriquing nevertheless to raise the question of regional differences in essential gene functions and we look forward to having a more complete dataset that can be used to address this issue.

Essential gene transcripts are supplied maternally

From the set of 59 essential genes, 34 of them arrest development as embryos or early larvae, indicating that they are important early in development. To test this hypothesis, we analyzed the temporal expression of these genes using RNA-seq divided into 23 separate 30-minute embryonic stages, 4 larval stages, pre-gravid young adult stage, and the young adult stage. The normalized RNA-seq data was obtained from the modENCODE project [68, 69].

Seven distinct patterns were seen from the heatmap (Figure 2). Five genes (colored red) express highly during mid-embryonic stage (300 min – 600 min), six genes (colored blue) express highly during late-embryonic stage (600 min – hatch), and seven genes (colored green) express highly in both mid-embryonic and late-embryonic stages. Eighteen genes (colored purple) show elevated expression very early in embryonic development (0 min – 300 min). Most of these genes, however, had a dramatic drop in expression level at 150 min, which is when gastrulation occurs [70]. Observing that many of these genes also show strong expression in young adults but not in larval stages suggests that these messages are highly transcribed in the germline and are likely maternally derived in the embryo. On the other hand, nine genes (colored brown) show some early embryonic expression but have their strongest expression during mid-embryonic stages. A group of four genes (colored orange) show specific expression during gastrulation. Lastly, eight genes (colored black) have elevated expression during specific larval stages.

Figure 2
figure 2

This figure represents the normalized transcript level (read number per coding length per million reads) for each gene across the developmental stages including 23 embryo stages separated by 30 minute interval, four larval stages (L1-L4), pre-gravid young adult, and gravid young adult. For comparing germline expression, we’ve included the transcript level from JK1107 carrying a mutation in glp-1, which is essential for mitotic germ cell proliferation [71]. The heatmap represents normalized transcript level from high (yellow) to low (blue). Seven distinct clusters that are based on their expression pattern are shown by colored branches. Purple: early-embryonic; Brown: early- and mid-embryonic; Red: mid-embryonic; Blue: late-embryonic; Green: mid- and late-embryonic; Orange: gastrulation; Black: larval.

From the RNAseq data, we observed 18 genes with expression patterns that indicated maternal contribution during early embryogenesis. This ratio is not significantly different from the set of all essential genes. However, when compared with the set of non-essential genes, our essential gene list is significantly enriched for genes with strong maternal contribution (1.24e-5, χ2 test). These data indicate that many essential genes important for early embryonic development have maternal contribution.

Conclusions

The function of essential genes is poorly understood. Having a combination of genetic strains for which the molecular identity is known would provide a powerful resource for their study. However, even in the model system C. elegans, only about 25% of the essential genes have a knockout alleles. RNAi has also been used to identify essential genes [72, 73]. Despite the success of these studies, only a small subset (~800 genes) have been profiled phenotypically [72]. We have a large collection of mutant strains, but only now has it been technically feasible to easily identify their corresponding coding regions. Our library currently consists of 1350 lethal mutations maintained by balancers in chromosomes I, III, IV, and V, of which chromosome I is the closest to saturation [19]. Recent whole genome screening experiments using the CRISPR/Cas9 system have opened up the possibility of identifying essential genes using this targeted approach. However, targeted approaches directed towards identifying essential genes in an intact multicellular organism are still limited in terms of recovery and maintenance of lethal mutations and impractical for large scale screens. The relative ease of capturing and maintaining lethal mutations makes balancer systems the method of choice for essential gene studies. However, using random mutagenesis is not possible to achieve 100% saturation (finding all essential genes). Small targets have a smaller chance of being mutated and are likely missed in mutagenesis experiments. Also, finding new essential genes in subsequent screenings becomes more and more difficult because the screens follow (approximately) a Poisson distribution giving diminishing returns. Thus, a combination of targeted and forward mutational approaches is best.

We previously developed a pipeline and applied it to the identification of let-504[30]. In the analysis presented here, we applied the pipeline to further analyze 76 essential genes on Chromosome I and produced high confidence identification for 64 genes. Some of the confirmed candidates were found outside the mapped region suggesting that the boundaries of the genetically identified zones can be further refined. We have shown that our approach is much more efficient and cost-effective than the traditional method. Assessments from this study will help us improve our identification pipeline and give us the confidence to apply this technique to the rest of our collection of essential genes.

Our results here provide additional alleles to known genes as well as provide new alleles. The added alleles will be valuable for establishing allelic series that may exhibit different phenotypes. For instance let-147/rnp-6 has 4 alleles each showing a different arrest stage [19], suggesting different protein domains are being disrupted. More importantly, our results provided 13 new alleles in essential genes where no alleles existed. The genetic resources provided with our method will be beneficial to the field of essential gene research.

We have demonstrated here that Let mutants can be used, not only individually to study the gene’s function, but analyzed as a group to better understand the functions a living multi-cellular animal needs for survival. Understanding the function of individual essential genes has applications for medicine. Essential genes in bacteria have been exploited to develop new antimicrobials [5]. An understanding of essential genes can be exploited for new medical uses. For example, the human ortholog of let-400/prpf-4, has been found to induce G1/S arrest and may function as a cancer suppressor [55]. Therefore, a resource such as described here for identifying and studying essential genes in model organisms has direct benefit.

We have shown that essential genes in the left half of chromosome I in C. elegans function in cell cycle control, transcriptional regulation, and RNA processing. Previous reports studying other genomic regions have shown different gene classes such as those regulated by the GATA transcription factor [74] and the sex-regulated genes [75] are non-randomly distributed in the genome. Thus, we believe the organization of these genes within the genome is also non-random. With our method, it is now possible to generate genetic resources to capture the majority of the essential genes. The study of which will provide us with a global picture of the minimum set of genes and pathways that is needed for the survival of a multi-cellular organism, and their organization in the genome. An increased understanding of the nature of essential genes is relevant not only to our knowledge of the biological survival of the organism but also has the potential for better medical procedures.

Methods

Strains

The strains used in this study are listed in Table 1. We have listed all the other available alleles for each let- gene in Additional file 2. The strains were grown and maintained on nematode growth medium streaked with E. coli OP50 [76]. The strains used in this study were generated by mutagenizing KR235 [dpy-5 (e61), +, unc-13 (e450)/dpy-5(e61), unc-15(e73), +; sDp2] with 12 mM EMS [35]. Briefly, the treated gravid wildtypes were individually plated on 5 cm plates and wildtype gravid F1s were also individually plated 5 days later. Their progeny (F2s) were screened for the absence of Dpy-5 Unc-13 individuals (Additional file 1). A single Unc-13 animal was transferred to confirm the existence of a lethal mutation. A balanced lethal would exhibit Unc-13 and developmentally arrested Dpy-5 Unc-13 [35]. All the strains were maintained at 20°C and by selecting Unc animals. Each strain was grown from one hermaphrodite and expanded to 20 2-inch plates. The worms were collected by rinsing the plates with M9 (6 g Na2HPO4, 3 g KH2PO4, 5 g NaCl, 0.2 g MgSO4 in 1 L of H20). The worms were washed with 12 ml of M9 three times and incubated at room temperature for 2 hours. The final pellet was frozen in -80°C.

Genomic DNA extraction and sequencing

Genomic DNA was extracted by phenol/chloroform as described previously [30]. Briefly, the worm pellet was lysed in 0.5% SDS and 100ug of Proteinase K in 50°C for two hours. DNA was extracted with phenol/chloroform three times and precipitated with 100% ethanol. 20 ug of RNase A was added to the eluted sample to remove RNA contaminants and this was followed by three more rounds of phenol/chloroform extraction and ethanol precipitation. 10 ug of purified genomic DNA was sequenced at the BC Cancer Agency Genome Sciences Centre using Illumina PET HiSeq technology.

Mutation identification procedure

Sequencing reads were aligned to the WS200 C. elegans genome using BWA [36] under default settings. Duplicated reads were filtered with GATK [77]. Further realignment around indels was also done with GATK. The BAM files were analyzed for SNV and small indels using Varscan [78]. The SNVs or indels returned by Varscan were filtered by 1) mutations in the parental strain KR235 mutation, 2) variant ratio (90% > x > 40%), and 3) genomic location (in coding sequences only). Allelic ratio was calculated as the ratio of mutant allele:reference allele. The effect for each CDS from the accumulate effect of the mutations in the genome was analyzed using Coovar [79]. Mutational landscape analysis was done using SNVs exhibiting G > A or C > T transitions as described previously [30]. Each genes in the sDp2 carrying a non-synonymous mutation was considered and ranked according to the severity of the mutation. Mapping information from [19] was used as a guide to find the most likely mutation. The mutations for each strain can be downloaded from http://lethal.mbb.sfu.ca/jschu/essential_genes.

Sequencing of a second allele was done with Sanger sequencing or WGS. PCR primers were designed using Primer3 [80, 81] spaced 250 bp apart with staggered orientation. This allowed sufficient overlap so that each position was covered at least twice. The Sanger reads were aligned to the wildtype transcript sequence using Clustal [82]. The alignments from each Sanger read were merged and analyzed with Bioedit. A mutation was confirmed if it was supported by all the Sanger reads and the sequencing traces show a clear double peak. A prediction was also confirmed when WGS of a second allele has a different mutation in the same gene.

Confirmation by complementation testing

Allelic combinations were established previously by complementation testing as described in [19] with the following exceptions. In a few cases, candidate SNVs were found for mutations, which were previously described as mapping to separate zones, in a single coding region. In these cases complementation testing was done between mutations predicted to be in the candidate coding region and confirmed that they did form a single complementation group as shown in Additional file 3.

Strains carrying a lethal mutation were selected for complementation testing with other lethal-carrying strains based on the identification of candidate mutations in the same gene. In order to determine allelism, let-x dpy-5 unc-13/let-x dpy-5 unc-13; sDp2 hermaphrodites were mated to wild-type males. F1 males (let-x dpy-5 unc-13/ + + +) were crossed to hermaphrodites carrying a second lethal (let-y dpy-5 unc-13/let-y dpy-5 unc-13; sDp2). The diagnostic phenotype indicating complementation in the progeny of the cross was Dpy Unc males and fertile hermaphrodites (let-x dpy-5 unc-13/let-y dpy-5 unc-13). A minimum of ten wild-type males on one plate was considered sufficient to conclude that the absence of Dpy Unc animals was not due to poor mating.

Gene ontology analysis

Orthologs were predicted by a set of programs consisting of Inparanoid [83], OrthoMCL [84], and Ensembl-Compara [85] with methods as previously described [57]. The protein sets used were: C. elegans (WS230), S. cerevisiae (64-1-1), D. melanogaster (r5.46), and H. sapiens (GRCh37.66). GO annotation was done using Blast2GO [86]. GO profile comparison was done using all the genes under sDp2 and all the essential genes as identified by RNAi collected from WormBase WS230.

RNA-seq expression analysis

Normalized RNA-seq data were downloaded from the modEncode website (http://www.modencode.org). The average normalized read count for each CDS was calculated as the total normalized read count of all coding base-pairs divided by the length of CDS. The expression profile clustering was done using agnes clustering in R.