Background

Genome sequencing provides the opportunity to assess fundamental biological processes of genome evolution [1]. With the increasing of finished genome sequences, the field of genome evolution is experiencing a renaissance of activity and many questions of genome architecture as well as genome evolution are resolved using computational studies. However, the identification and characterization of highly homologous sequences in the genome remain problematic. Segmental duplications (SDs), defined as low-copy repeats of DNA segments (blocks of sequence ≥ 1 kb in length and showing ≥ 90% sequence identity), are a class of homologous sequences. Since SDs are hotspots of copy number variance (CNV) as well as pools of gene innovation and disease-causing rearrangement [215], they have long been regarded to be involved in functional redundancy, adaptive evolution, and structure dynamics of chromosomal evolution. Thus, identification and annotation of SDs are important for understanding the structure and evolution of a genome.

Up to now, although the analyses of SDs have been done in many organisms whose genome sequences were completed [211], no analysis has ever been performed in the domesticated silkworm, Bombyx mori. The silkworm genome sequence has been released [16, 17] and the amounts of hierarchical bacterial artificial chromosome (BAC) data are available, this provides us an opportunity to identify and annotate SDs in the silkworm genome. In this study, we used two computational methods to identify the SDs. The first one, named whole-genome assembly comparison (WGAC), is a BLAST-based approach that performs an all-by-all comparison of assembled genome sequence [18]. The second one, whole-genome shotgun detection (WSSD), develops a model for distinguishing unique and duplicated sequence on the basis of the depth of coverage after whole-genome shotgun sequence reads were aligned to a reference genomic segment [4]. Duplication regions would display a higher reads depth than depths-of-average. Experiments (real time fluorescent quantitative PCR (q-PCR)) have been used to validate these large duplication sequences [1921]. Here, we present a set of the silkworm SDs that provides a framework for future evolutionary study. In addition, this resource also provides invaluable information in finishing the silkworm genome.

Results

Genome-wide identification of the silkworm SDs

We used two well-established computational methods, whole genome assembly comparison (WGAC) and whole genome shotgun sequence detection (WSSD), to detect putative SDs in the silkworm genome. The central aspect of WGAC is to generate a compact version of the silkworm genome sequence by firstly removing high-copy repeats from the genome using RepeatMasker (http://www.repeatmasker.org/). This pipeline has two advantages: (1) the BLAST search is faster because of the overall reduction in sequence content; (2) it enhances the ability to detect duplications riddled with high-copy repeats that would be missed. Remarkably, we identified 5.17 M or 73937 pairwise alignments as duplications by WGAC analysis (≥90% identity and ≥1 kb in size) (Figure 1). And about 42% of these duplications were mapped to the unassigned chromosome – ChrUn.

Figure 1
figure 1

The comparison of silkworm SDs identified by WGAC and WSSD pipelines. There are 5.18 Mb (orange) and 1.44 Mb (Green) putative duplications predicted by WGAC and WSSD algorithms, respectively. The overlapped region was SDs predicted using both analyses. The shaded orange was SDs defined by WGAC with sequence identity less than 94%.

We also applied the WSSD strategy separately to the silkworm genome (432 Mb) and BACs (8.2 Mb) sequences. This step is to map 3.8 million reads against both the genome sequence and BACs data to assess the reads depth (RD) and divergent read ratio in 5-kb windows sliding in 1 kb steps (see Methods). We found that there were 12 large duplication segments in BACs and 117 in the genome (≥94% identity and ≥10 kb in size; Figure 1; Table 1). Like the SDs in Drosophila genome [22], the silkworm genome seems to be significantly poor in large block (>10 kb; Figure 1).

Table 1 q-PCR validation of a subset of WSSD duplications in BACs

In this study, we totally detected 6.6 Mb SDs in B. mori, which cover ~1.4% of the silkworm genome (6.6 Mb/432 Mb; Figure 1), size ranging from 1 kb to 23 kb (Additional file 1). Previous studies suggested that high-identity duplications (identity > 94%) frequently collapsed within working draft sequence assemblies [23] and may represent artificial duplications within an assembly [18]. We compared the WGAC results to those detected by WSSD approach and found that 45.1% of the SDs identified by WSSD were not detected by WGAC, which may be caused by collapsed duplications (Figure 1). In addition, we also found that 0.79 Mb of the duplications detected by WGAC were also detected by WSSD, and these are the high-confidence SDs in the silkworm genome.

Our results showed a bias toward interchromosomal duplications compared to intrachromosomal alignments (Figure 2), which greatly differs from previous observations [6, 22, 24]. However, this is not enough for us to make a conclusion that the silkworm genome had a bias for interchromosomal duplication since about 39.5% of these interchromosomal duplications were assigned to the ChrUn (Figure 3; Additional file 2). We noted that a fraction of the silkworm genome sequence (~60 Mb, 13% of the genome) has not been assigned to the chromosomes, which is about six silkworm chromosomes long and relatively larger compared to other sequenced genomes [4, 6, 24]. In this study, we treated the unmapped sequences as a separate chromosome, ChrUn. The intrachromosomal duplications can be further categorized as tandem duplications and interspersed duplications. We found that there were about 29.5% of intrachromosomal duplications (n = 450) mapped within 1 Mb of one another and within these duplications, the majority of them were tandem duplications with no one gene separating them.

Figure 2
figure 2

Silkworm SDs and its paralogous regions in the genome. ( A ) SDs in the silkworm (only one paralogous region of SDs was shown). ( B ) Patterns of intrachromosomal segmental duplications.

Figure 3
figure 3

Sequence property of the silkworm SDs. The proportions intrachromosomal and interchromosomal duplications are shown. The shaded yellow region represented the SDs whose paralogous sequences distributed on ChrUn.

Patterns of the silkworm SDs

The distribution of SDs on chromosomes (not including ChrUn) is largely nonrandom (Figure 4). Firstly, as expected, the “uncharacterized chromosome” (ChrUn), which can not be uniquely mapped to the genome, contained the majority of SDs (~42%; Figure 3; Additional file 2). And different chromosomes contained various SDs contents (Figure 2; Figure 4). Chromosomes 1, 5, 22 and 27 had the highest SDs densities (Figure 2; Figure 4) with >1.5 folds of the duplication content of the genome average (unplaced contigs were excluded), while the values for chromosomes 8, 19, 20 were much less than half of those (Figure 2; Figure 4). Besides, previous studies demonstrated that SDs are enriched in pericentromeric and subtelomeric regions [22, 24]. Although the silkworm chromosomes are holocentromere, we did find increased contents of SDs in some regions along some chromosomes (p-value < 0.05, Chi-square tests) (Figure 4; Additional file 3). Further survey showed that some gene families are enriched in these SDs, such as odorant receptor gene cluster, ras-related protein and alkaline phosphatase gene cluster, which is basically consistent with the findings in Drosophila melanogaster[22].

Figure 4
figure 4

Silkworm SDs for each chromosome. Different color bars represented different lengths of SDs. The large gaps between each scaffold were shown in black rectangle. Genes embedded in each SDs were shown as black bars located under each chromosome.

Among duplication segments, there is a class of large tracts, termed as “duplication blocks” (if some other SDs were identified within 100 kb from the coordinates of a SD, this whole large region is termed as a duplication block and gaps were excluded) [25]. We found that such duplication blocks contained protein-coding genes (Figure 4). The SDs are distributed near the gaps (within 1 Mb of the gaps) of the reference genome sequence for some chromosomes, for example, chromosomes 6, 14, 17 and 27 (45.5%-83.3%), indicating that these gaps themselves would be high-copy duplications. Furthermore, our results showed that a large proportion of SDs in the silkworm genome were on ChrUn. Thus, probably SDs may be the problems of the silkworm genome assembling.

Sequence properties of the silkworm segmental duplications

We analyzed the composition of genes in the SDs. In total, 320 putative genes were identified in the SD regions. Among these 320 genes, 304 were located in the SDs identified by WGAC, while only 65 were in SDs identified by WSSD. 49 genes were overlapping between the two methods. Besides, 50% (159/320) of the silkworm segmental duplication intervals identified by WGAC and WSSD contain gene duplicates (Additional file 4). Although functions of some genes are unknown or hypothetical, a large proportion of 320 genes belong to multigene families, such as Lipoprotein receptors, Histone and P450s (Additional file 5). In order to test the hypothesis that particular gene classes were overrepresented in the SDs [24], we used Gene Ontology (GO) to annotate all the genes. The genes with the functions of binding, catalytic and genes related to metabolic process were enriched in the SDs (Figure 5). Pfam was also used to predict the functions of genes in the SDs and the results showed a similar trend (Additional file 6).

Figure 5
figure 5

GO annotation of the genes in SDs. Gene classes overrepresented in SDs were indicated by blue arrow. In molecular function, binding and catalytic were much higher than other categories (**p < 0.01, T-test). In biological process, metabolic process was much higher than other categories (*p < 0.05, T-test).

On the basis of annotated functions, the genes in the SDs are classified into three categories. The first category includes the genes associated with detoxification (i.e. cytochrome P450); the second one contains the genes related to innate immune response (i.e. serine protease) and the last category includes the genes with functions of environmental signal recognition (i.e. olfactory receptor) (Table 2). Indeed, a previous study suggested that as many as 50 cytochrome P450 genes were present in gene cluster on chromosomes and 78 cytochrome P450 genes were functional in the silkworm genome [26]. Recent studies revealed that the glucose-methanol-choline (GMC) oxidoreductases and 30 K proteins (30 KPs) families experienced lineage-specific expansions in the silkworm [27, 28]. For the GMC gene family, two members (BGIBMGA012997-TA and BGIBMGA012998-TA) of the GMC β subfamily which made a major contribution to expansion of the silkworm GMC genes are located in the SDs. Furthermore, such lineage-specific GMC β subfamily expansion was associated with immunity [27]. In addition, there is a lepidopteran-specific Lipoprotein_11 family in the silkworm, whose members were involved in various physiological processes such as energy storage, embryomic development and immune response [28]. We identified 9 lepidopteran-specific Lipoprotein_11 genes in the SDs. These results showed that SDs might play important roles in the evolution of the silkworm lineage-specific functions.

Table 2 Repertoires and evolutionary mechanisms of selected duplicated genes or gene families in mammals and silkworm

We also analyzed transposable elements (TEs) composition by comparing SDs to the sequences drawn nearby with identical sizes (Table 3, Methods). Strikingly, we found that the content of short interspersed elements (SINEs) in the SDs is much lower than the genome average (2.57% vs. 12.8%). And, SINE content increased when SDs’ flanking sequences were taken into consideration (Table 3). An opposite trend was observed with respect to DNA transposons and long terminal repeat (LTR) retrotransposons. Unlike segmental duplications in human, which were rich in SINE [12], the silkworm SDs were characterized by enrichment of DNA transposons and LTR retrotransposons (Table 3). A similar trend was also observed in the flanking regions of SDs. The high TE enrichment in SDs suggests a potential implication of repeats in SD formation, as described previously [22].

Table 3 Repeat properties of the silkworm genome, duplication, and flanking region

Experimental validation of a subset of SDs

SDs, defined as low-copy repeats of genome segments, can be detected by qRT-PCR-based copy number screening [21]. By qRT-PCR, we validated the SDs in 11 BACs that were determined by WSSD strategy (Table 1). The lengths of these SDs in the BACs range from 11 kb to 13 kb and the corresponding reads depths are listed in Table 1. Previous studies showed that the copy number of SDs should be less than that of TEs [22]. Our qRT-PCR and reads mapping results both confirmed this (Figure 6, Additional file 7), except for the BAC-AP009006.1 (Figure 6). This BAC clone contained a SD whose qPCR result is similar to the TEs. We examined the sequences of this SD and found that this SD contained a large CR1 transposable element (NonLTR). We aligned them with the repeat database and the identity was less than 90%. Consequently, we did not mask this region by RepeatMasker (the cutoff we set before, see Methods).

Figure 6
figure 6

qPCR confirmation. X-axis shows the different BAC clones while Y-axis represents the relative quantification (RQ) value of the reference region.

The SDs in (91%, 10/11) BACs were confirmed to be positive duplications by qRT-PCR (Figure 6). It should be emphasized that not all true duplications could be detected by qRT-PCR, especially low-copy duplications with divergent reads ratio > 0.8 are difficult to be detected. Thus, 9% for false positive rate is a conserved estimate in our WSSD strategy.

Discussion

Quality of SD detection

SDs have been extensively studied in many organisms including vertebrates and invertebrate [5, 6, 8, 18, 22, 24, 25, 29]. Here we performed a systematic analysis of segmental duplication in the silkworm genome using two different approaches, a sequence assembly-based approach (WGAC) and a whole-genome shotgun sequence detection method (WSSD). The power of SD detection depends largely on the quality of the underlying sequence assembly and strategy used. There are four factors that would influence the detection of SDs in genome assembly: (1) depth of genome sequencing, (2) methodology of assembly, (3) quality of genome annotation and (4) level of allelic variation. In order to take advantage of low level of allelic variation, we implemented a modification of WSSD approach described before which is entailed a quality assessment of underlying reads to calculate percent identity and determine the proportion of variants within a certain region in a genome [6].

It should be noted that there are some limitations in this study. On one hands, the number of SDs may be underestimated. For example, regions of extremely high sequence identity may collapse during assembly, which may result in the underestimation of fraction of genome showing relatively high identity. That is why some highly homologous gene families, such as carboxylesterases [30], were not detected in our study, although other highly homologous gene families were detected (i.e., cytochrome P450 genes, serine protease, histones). Besides, the power of SD analysis depends largely on genomic sequence and its annotation. The presence of sequence gaps as well as contig orientation may influence the detection of SDs in a genome. The current silkworm genome sequence only covers about 85% of the genome size and has many gaps. Thus, this study may underestimate the SD content in the silkworm. On the other hand, there may be some false positives in the SDs identification using WSSD. This may be due to the incomplete annotation of repeats in the current silkworm genome. In order to get the accurate information about the large SDs and exclude false positive cases, further annotation of repeats as well as FISH hybridization are needed in further study. In the silkworm genome, about 85% of the SDs are shorter than 2 kb. This suggests that SDs in the silkworm are much smaller that in mammals, which is consistent with other invertebrates such as D. melanogaster[22]. Thus, PCR validation would be more favorable.

Despite these limitations, some other important trends regarding the SDs in the silkworm appeared. Our estimation of the SD content is consistent with that in D. melanogaster[22] but much lower than mammals [6, 15, 24] (Additional file 8). We proposed that this difference may be due to biological reasons to be investigated. A previous study also supposed that SDs in invertebrates are much less than that in vertebrates [31].

Based on a new assembly of the silkworm genome [32], we found that the SDs were distributed in a nonuniform fashion across the genome (Figure 4). For example, there are some SD enrichments on chromosomes (Additional file 3). And there are some SDs that reside within 1 Mb of the “gaps” on chromosomes (Chrs 6, 14, 17 and 27) (Figure 4), suggesting that SDs may be the problematic regions for both clone-based and whole-genome shotgun sequencing methods.

Enlightenment for genome assembling

The published silkworm genome sequence represents one of the first attempts to sequence and assemble a lepidopteran genome mainly based on shotgun sequencing read data. One of the greatest challenges of genome assembly lies in the segmental duplications, because of high degree of sequence identity comparing with each other [31, 3335]. There are three possibilities when SDs are encountered during sequencing and assembling: (1) these SDs may be recognized distinctly and resolved properly; (2) because of the presence of virtually identical sequence reads in the database, the sequences may be underrepresented and (3) SDs may be mistakenly assembled into the genome. The second and the third outcomes create numerous gaps [31]. Thus, genome-wide studies of segmental duplication contents become an effective measure to assess the quality of whole-genome sequence assemblies [36] and provide important information for the users of genome sequence.

There are a few important conclusions drawn from this study with respect to genome assembling. The complex, highly duplicated nature of SDs is not amenable to high-throughput assembly methods without refinement. For example, some whole-shotgun sequence approach, such as Arachne, would collapse highly identical duplications [37]. Currently, three types of gaps are recognized within the working draft sequence [31, 38]. The first type, named as trivial gaps, is no more than 100 bp in length. Gaps between ordered clones or sequence contigs are the second type, which is easily closed by sequencing of bridging clones obtained from pair-end sequence data. However, the third type is more complicated because it is associated with SDs. The solution for this kind of gaps is difficult because we should recognize the SDs first in the genome. Some gaps in the silkworm genome belong to this type, since some SDs are distributed in the flanks of these gaps (Figure 4). The “unplaced” chromosome (ChrUn) showed a significant enrichment for SDs (Additional file 2), with almost 42% of the duplications assigned to ChrUn. Further efforts should target on these regions if we want to get the better sequence of the silkworm genome. Figure 1 showed the comparison of SDs detected by WSSD and WGAC and the results suggested that 9.82% of SDs could only be detected by WSSD. If we use the experimental qRT-PCR data to estimate false positive rate (9%), we conclude that 0.065 Mb SDs have not been resolved within the genome. Thus, our results suggest that, at present, clone-ordered-based approaches for sequence assembly appear to be a more effective resolution for identifying the true locations, organization and complexity of SDs. Furthermore, the intrachromosomal SDs are comparatively less based on the current silkworm genome assembly. Two reasons would contribute to this: (1) as many as 39.5% of interchrmosomal duplications were found to have paraloguous sequences on ChrUn. The gaps on the chromosome might lead to underestimate of the intrachromosomal SDs; (2) the silkworm genome has some distinctive features: there are 28 chromosomes while the genome is only about 432 Mb. The chromosome sizes are relatively small (about average 15.4 Mb for each chromosome); and TEs content is large in the genome (~35%). There is another possibility. Because of short chromosomes, intrachromosomal duplications are so few. A previous study showed that interchromosomal duplications are shorter (median length 2.5 kb) while intrachromosomal duplications are much larger (median length 20 kb) in the bovine genome [24]. However, the silkworm genome is lack of large duplications and most of the duplications were less than 2.5 kb.

SD content analysis

The correct assembly of SDs is not considered to be high priority, especially the draft phase of a genome sequence, due to the gene-poor content of such regions [6]. However, in some organisms, such as human, highly segmental duplications (~6%-7%) were rich in TEs and genes [12]. A similar pattern is also found in the silkworm. The gene content in the silkworm SDs occupies about ~2% of the genome but the SDs constitute only 1.4% of the genome sequence. In addition, some TEs were enriched in the SDs, such as DNA transposons and LTR retrotransposons (Table 3). Comparing with other insects (e.g. fruit fly), the silkworm genome harbors a lot of TEs, about 35% of the genome [39] and LTR retrotransposons are the most common TEs in B. mori[40]. Thus, TEs could be involved in the formation of SDs in the silkworm. Besides, many duplicated genes and gene families were found to reside in the SDs and some of them were implied in lineage-specific adaptations of organisms to a particular environment. Antimicrobial peptide (AMPs) genes, which play important roles in innate immune system in insects [41], were found to be enriched in the silkworm SDs (Additional file 6). Some of GMC genes, which expanded in the silkworm and associated with immunity, were also found in the SDs. The members of the lepidopteran-specific Lipoprotein_11 family and serine protease gene family related to immune response were enriched in the SDs [42]. Furthermore, since frequently encountered a wide variety of secondary products in the mulberry leaves, such as plant allelochemicals, the silkworm has evolved special enzymes to adapt to the digestion of secondary products in mulberry leaves [26, 43]. For example, cytochrome P450 enzymes are involved in such biological processes in the silkworm. In this study, we found that some members of cytochrome P450 gene family are located in the silkworm SDs. Besides, some genes which were involved in silk production were also found in SDs, such as proteasome. In this sense, SDs may play important roles in the evolution of species specific functions.

There are some practical and biological implications for the identification of genes in SDs. Previous studies showed that SDs are candidates for the evolution of organism-specific genes [44, 45]. Some gene families under selection in vertebrates were identified, such as cytochrome P-450, olfactory receptor [46, 47]. However, the functions of many genes in the silkworm SDs are still unclear on the basis of BLASTP searching against nr databases. We used these unannotated genes located in SDs as references to search against the protein sets of related insects, especially Lepidopteran species. We found that most of these unannotated genes had orthologs in other insects, especially in Lepidoptera (Additional file 9). For example, BGIBMGA003910-PA, which is poorly annotated in the silkworm database, has orthologus in other insects (such as monarch butterfly, Danaus plexippus, Heliconius melpomene, Dendroctonus ponderosae, Nasonia vitripennis), but the identity was much higher in Lepidoptera (Additional file 10). The silkworm is an important economic insect and it is also the model organism for molecular genetic and genomic studies of order Lepidoptera [48]. Our study presented invaluable information for the SDs in the silkworm, which facilitates understanding the evolution of the silkworm genome as well as the biology of the silkworm.

Conclusion

We for the first time analyzed the SDs in the silkworm genome and found that SDs constitute ~1.4% of the silkworm genome sequence (≥1 kb in length and ≥90% in the identity of sequence). This number is similar to that in D. melanogaster but smaller than mammalian organisms. Almost half (42%) of the SD sequences are not assigned to chromosomes, suggesting that the SDs are challenges for the assembling of genome sequences. Large duplications were also validated by qPCR experiments. The genes related to immunity, detoxification, reproduction, and environmental signal recognition are significantly enriched in the silkworm SDs, implying that SDs may have important biological significances in the above physiological processes. Our results provide insight into the evolution of the silkworm genome and an invaluable resource for insect genomics research.

Methods

Genome resources

We downloaded the silkworm genomic sequence (9×) from the silkworm genome database (SilkDB, http://silkworm.genomics.org.cn/) and the whole genome shotgun sequence (WGS) reads from [49]. The source of the BAC library DNA was NCBI http://www.ncbi.nlm.nih.gov/. This BAC library contained 46 clones which are distributed in 22 chromosomes, representing 1.8% of the silkworm genome.

Whole-genome alignment comparison (WGAC)

We performed a combination of sequence analysis software and a list of Perl scripts to optimize the detection of large segmental duplications (length ≥ 1 kb and identity ≥ 90%) [6].

The large contigs in the silkworm genome were broken into tractable 400 kb segments. Using RepeatMasker (Smit and Green http://www.repeatmasker.org/, version open-3.3.0), we identified the high-copy repeats. The silkworm genome is rich in TEs (~35%) [39]. We used our own TE dataset as repeat database (http://gene.cqu.edu.cn/BmTEdb/) in the running of RepeatMasker. These reference contigs were masked at 10% divergence level from TEs. After that, all these high-copy repeats were deleted out of the sequences. The resulting unique genome DNAs then underwent global BLASTN searches with reduced affine gap extension parameters, which allowed large gaps up to 1000 bp to be traversed. Alignments between these 400 kb segments were generated using the parameters (−G 180 –E 1 –q 80 –r 30 –z 3 × 10-9 –Y 3 × 10-9 –e 1e-20 –F F). We discarded self-alignments, and wrote a list of perl scripts to reinsert the high-copy repeats back to these alignments. BLASTN results were parsed for alignments with length ≥1 kb and identity ≥88%. These initial seed alignments were subsequently reintroduced to create local alignments and then trimmed to define their end points. We then performed an optimal global alignment to generate accurate alignment statistics. Only alignments with length ≥ 1 kb and identity ≥ 90% were considered in our analysis.

Whole-genome shotgun sequence detection of duplications (WSSD)

We used the WSSD strategy previously developed during the analysis of human genome [4] to assess duplication content in the silkworm. For a given genomic sequence, this method assesses depth-of-coverage and compares it with the average coverage depth. In regions of duplications, depth-of-coverage shows a statistically significant increase due to recruitment of paralogous reads. WSSD prefers to identify large SDs (≥10 kb in length, ≥ 94% sequence identity). We used two classes of sequences: (1) all finished BACs sequences deposited in GenBank; (2) whole silkworm genome sequence.

Firstly, short genome reads (<50 bp) and vector sequences were filtered out. After filtration, there were ~1.83 G clean reads left (size ranging from 52–964 bp long, ~4.5 converge of the genome) (Additional file 11). Each reference silkworm genome sequence masked for repeat sequences was compared by Megablast against the entire set of the silkworm whole-genome shotgun sequence reads (WGS, 3,810,411 reads). Our analysis was on the basis of a comparison of 3,810,411 WGS reads against the 432 Mb silkworm genome sequences. About 86.4% of (3,290,836) reads were remapped to the assembly. We used the following parameters (−D 3 –J F –P 93 –U T –F m –s 220), which allows for greed-algorithm extension into adjacent repetitive regions [6]. We wrote a perl script to detect every segment. Alignments were considered if they represented 90% of the reads with a rescored similarity of > 94%.

We used sliding window method in WSSD pipeline to calculate the reads depth (RD) value. Reads were firstly counted in overlapping (1 kb), sliding 5 kb windows. Initial calls were selected if six of seven or more sequential 5 kb overlapping windows had RD values that differ significantly from the average. Since the reads length varied significantly, the STDEV (~ 380) of the reads length was high in the silkworm. Furthermore, no segmental reference was previously reported in the silkworm, and it is impossible to identify the accurate RD value in SDs in the silkworm. And there is no information about a set of unique regions validated by FISH or other experiments. Thus, we removed the SDs regions identified by WGAC and 10 kb flanking regions of SDs. We defined significant alignment depth that greater than 3 standard deviations from the mean (Additional file 11). Only SD calls greater than 10 kb in length were kept in the final dataset. Because the silkworm strain Dazao (the sequenced strain of silkworm) is an experimental line and highly inbred, there is a reduced allelic variation in Dazao. We used a more sensitive metric for the detection of SDs [6]. This method increased sensitivity for detecting large single-duplications events (including recent, but low-frequency tandem duplications). In this way, we kept candidate segmental regions in which the divergent read (defined as those with identity higher than 99.8% aligned to the reference sequence with ratio higher than 0.5.

Gene content

Gene content of the silkworm segmental duplications was assessed using the glean consensus gene set (http://silkworm.genomics.org.cn/) [17]. We obtained a total of 14,623 silkworm peptides from SilkDB. In addition, using Gene Ontology (GO) [50], we tested the whether the molecular function, biological process, and pathway terms were under- or overrepresented in SDs [24]. Pfam was also used to annotate the function of the genes in the SDs [51].

We also investigated the distribution of genes and segmental duplications on genomic sequences. It should be noted that a portion of genes in the silkworm have been not well-annotated or have been annotated with the designation “Unknown function”, which may result in the underestimation of the influence of genes in SDs.

Quantitative real-time PCR (qRT-PCR) validation

Primer Premier 5.0 was used to design primers for qRT-PCR experiments (Additional file 12). Each PCR reaction was prepared as follows: 10 μl of SYBR-Green PCR master mix, 1 μl of each primer (10 μM), 7 μl of water, and 1 μl of genome template (whole genome DNA). Quantitative real-time PCR was carried out using the ABI Stepone plus system. The thermocycler program had an initial 95 denaturation step followed by 40 cycles consisting of a 10-s denaturation at 95, a 40-s annealing at 60, and a 30-s extension step at 72. At the end of each reaction, a disassociation curve was created, which was used to help to detect the presence of primer dimers of other unwanted amplification products that may produce a detectable cycle threshold (Ct) value.

We chose three regions (control_1, control_2, control_3) as controls for all qRT-PCR experiments, which represented single copy, 4 copies and TEs. Copy number was analyzed according to comparative Ct method. The Δ CT and ΔΔ CT were calculated by the formulas Δ CT = CT target – CT control (single copy) and ΔΔ CT = Δ CT SD samples - Δ CT single copy sample, respectively. To detect the accuracy of this method, we used the pipeline [52] to calculate the copy number of control_2, which was identified to be 4 copies in silico. The result showed that this gene was ~3.95 copies based on our method. Thus, it is reasonable to apply this pipeline to assess the segmental duplications.