Development and analysis of a germline BAC resource for the sea lamprey, a vertebrate that undergoes substantial chromatin diminution
- First Online:
- Cite this article as:
- Smith, J.J., Stuart, A.B., Sauka-Spengler, T. et al. Chromosoma (2010) 119: 381. doi:10.1007/s00412-010-0263-z
- 157 Views
Over the last several years, the sea lamprey (Petromyzon marinus) has grown substantially as a model for understanding the evolutionary fundaments and capacity of vertebrate developmental and genome biology. Recent work on the lamprey genome has resulted in a preliminary assembly of the lamprey genome and led to the realization that nearly all somatic cell lineages undergo extensive programmed rearrangements. Here we describe the development of a bacterial artificial chromosome (BAC) resource for lamprey germline DNA and use sequence information from this resource to probe the subchromosomal structure of the lamprey genome. The arrayed germline BAC library represents ∼10× coverage of the lamprey genome. Analyses of BAC-end sequences reveal that the lamprey genome possesses a high content of repetitive sequences (relative to human), which show strong clustering at the subchromosomal level. This pattern is not unexpected given that the sea lamprey genome is dispersed across a large number of chromosomes (n ∼ 99) and suggests a low-copy DNA targeting strategy for efficiently generating informative paired-BAC-end linkages from highly repetitive genomes. This library therefore represents a new and biologically informed resource for understanding the structure of the lamprey genome and the biology of programmed genome rearrangement.
Lampreys are a vestige of an ancient vertebrate group that branched from the majority of extant vertebrate lineages prior to the advent of jaws and paired appendages, approximately 500 million years ago (Janvier 2006). The lamprey is therefore positioned in the vertebrate tree of life wherein it can provide unique insight into the cellular and developmental processes that define the fundaments of vertebrate biology. For example, recent studies on lamprey have revealed fundamental features of the vertebrate immune system (Amemiya et al. 2007), neural crest regulatory network (Sauka-Spengler et al. 2007), and the diversification of mesodermal derivatives (Kusakabe and Kuratani 2007). The deep evolutionary history of lamprey also makes it an attractive system for understanding how basal cellular and developmental processes can be modified at the molecular level. This is because the extensive conservation of the basic vertebrate cellular and developmental mechanisms is seasoned by the evolution of novel genes and genetic pathways, which have been selected to regulate these processes over the last 500 MY of lamprey evolution. The lamprey genome thus represents a vast source of information regarding basal aspects of vertebrate cellular and developmental processes and novel genetic strategies for manipulating these deeply conserved processes. Consequently, the National Institutes of Health invested in the sequencing of the sea lamprey genome. Whole genome shotgun (WGS) sequencing was performed on liver DNA to approximately 7× genome coverage (Washington University Genome Sequencing Center 2007). Several attempts have been made to assemble this WGS dataset into a contiguous genome assembly; however, the current version remains highly fragmented (Washington University Genome Sequencing Center 2007; Rogozin et al. 2007; Libants et al. 2009). Elucidation of the broad-scale structure of the lamprey genome will presumably require the development of additional computational and genomic resources.
The biology of the lamprey genome differs significantly from that of other known vertebrate genomes. All vertebrate species undergo a small number of programmed local rearrangements during development (e.g., remodeling of immune receptors) (Dudley et al. 2005; Kapitonov and Jurka 2005; Kim et al. 2007; Rogozin et al. 2007), though a limited number of species are known to undergo much more extensive reorganizations (Kubota et al. 1997; Goto et al. 1998; Kubota et al. 2001; Smith et al. 2009). These changes mimic the dysregulated changes in genome architecture that give rise to cancers or other genomic disorders (Ye et al. 2007; Mitelman et al. 2007) but are presumably highly regulated and reproducible from generation to generation (Smith et al. 2009). We have recently reported the existence of widespread programmed genome rearrangements (PGRs) in the sea lamprey (Petromyzon marinus) (Smith et al. 2009). These rearrangements are tightly regulated, occur early in development, and result in the loss of transcribed genes. This discovery is significant with respect to existing lamprey genome resources because the large WGS dataset was derived from a somatic tissue (liver), which is missing approximately 20% of the DNA that is present in the germline progenitor lineages. This new understanding of the dynamic nature of the lamprey genome and the fragmentary status of the existing assembly argue strongly for the development of genomic resources that are targeted at the germline and effectively span assembly gaps (including gene-encoding regions that are discarded due to PGRs).
The bacterial artificial chromosome (BAC) system can stably accommodate exogenous inserts that are very large (100–300 kilobases, kb), allowing entire eukaryotic genes (including flanking regulatory regions) to be encompassed in a single clone. The BAC system is based on plasmid vectors that are essentially composed of an F-factor origin of replication and an antibiotic resistance gene (Shizuya et al. 1992; Osoegawa et al. 1998; Amemiya et al. 1999). The F-factor replicon allows propagation of the bacterial plasmid as a single-copy entity in Escherichia coli, thus permitting stable propagation of cloned inserts greater than 100 kb pairs (kb). The ability to accommodate such large inserts is advantageous for many applications in genome biology, including positional cloning, targeted genomic sequencing, and as vehicles for generating transgenic animals. The entire procedure is conceptually simple although the actual generation and arraying of a library is technically challenging, highly empirical, and labor intensive (Osoegawa et al. 1998; Miyake and Amemiya 2004).
In this paper, we report the construction of a germline-specific BAC library from the lamprey. This represents the first lamprey genomic resource that is specifically targeted to the definitive germline genome and therefore provides representation of ∼20% of single- and multi-copy genomic sequence that is not represented in any other existing genomic resource for this species. Indeed sequences from this library have proven valuable for identifying germline-specific DNAs for the sea lamprey and demonstrating that the species undergoes PGR on a global scale (Smith et al. 2009). The library contains 168,960 clones with an average insert size of ∼140 kb, corresponding to ∼10× coverage of the 2.31 gb sperm genome (Smith et al. 2009). Analysis of 3,072 clone-end reads from this library reveals that (1) relative to the human genome, the lamprey genome contains a large amount of long-repetitive DNA, (2) low-copy regions (e.g., containing single-copy genes) are strongly clustered and distributed non-randomly relative to high-copy regions, and (3) many repetitive sequences are unique to lamprey or are vestiges of repeats that were present in the early chordate lineage but lost in “higher” vertebrates. These observations are consistent with expectations given the lamprey’s evolutionary history and complex karyotype (n ∼ 99) and indicate that this BAC resource can provide critical long-distance linkages that will be necessary to improve the existing and highly fragmented lamprey genome assembly (Washington University Genome Sequencing Center 2007). Moreover, the resource provides access to long-insert clones that contain germline-specific sequences thereby filling a “biological” gap in the existing WGS dataset.
Materials and methods
BAC library construction
Preparation of high molecular weight (HMW) DNA
A BAC library was constructed from agarose-embedded sperm nuclei that were isolated from a single individual. Sperm was isolated from the testes of a single male adult lamprey. The specimen was first anesthetized in MS222 [1 g/l in 0.5× Marcs modified ringers solution] (Nikitina et al. 2009), and the testes were removed and immediately minced in 1× lamprey PBS (7.0 g/l NaCl, 0.2 g/l KCl, 0.29 g/l MgSO4 7H2O, 0.21 g/l MgCl2 6H2O, 0.46 g/l KH2PO4, 3.82 g/l Na2HPO4 7H2O, and 0.13 g/l CaCl2 2H20). Sperm cells were dispersed from minced testes by extensively triturating in fresh PBS. The sperm cell suspension was filtered through 20-µm mesh to remove connective tissue, and spermatozoa were pelleted by centrifugation at 1,000×g for 15 min at 4°C. The pellet of sperm was diluted in PBS to a concentration of 20 million cells per milliliter, equilibrated to 45°C for 5 min. and embedded in agarose. Preparation of DNA-embedded agarose plugs for library construction was performed using previously described methods (Amemiya et al. 1996).
Partial digestion of HMW DNA
Prior to partial digestion, plugs were equilibrated to 0.5× TE for 48 h at 4°C, then to 0.5× TBE overnight at 4°C. A pulsed field gel electrophoresis (PFGE) prerun was performed in order to remove unwanted, smaller-sized DNA molecules prior to restriction digestion (Osoegawa et al. 1998). Plugs were recovered from the wells of the gel then equilibrated to 0.5× TE overnight at 4°C. Pilot partial digestions using varying amounts of HindIII were carried out in order to optimize the digestion conditions prior to scale-up (Amemiya et al. 1996). DNA fragments were separated by PFGE on a 1% agarose gel (Pulse-Field Certified, Bio-Rad) using a CHEF XA Mapper (Bio-Rad) in 1/2× TBE buffer using previously described methods (Osoegawa et al. 1998). Gel slices were taken from the preparative lane that contained the HMW DNA fragments. A total of eight fractions ranging from 50–300 kb were excised from the gel. A sliver from each fraction was used in a step ladder gel to determine the size range of the DNA fragments. Fractions 3 (∼110–140 kb) and 4 (∼125–180 kb) were chosen for further processing and were equilibrated in 1/2× TBE. Electroelution, ligation into the pCC1BAC vector (Epicentre), and transformation were performed as previously described (Strong et al. 1997; Lang et al. 2006).
Insert size screening
An initial screening of clones was performed using Epilyse (Epicentre) on 52 random white colonies (26 per fraction), which determines the frequency of inserts and a rough estimate of size. For further analysis, DNA from 24 clones was isolated using a standard alkaline lysis miniprep procedure. Each clone was digested using NotI, and sizing was accomplished using PFGE (15 h and 1 s initial time, 20 s final time, 14°C, field angle 120°, and 6 V/cm) with the low-range PFG marker.
Transformants were plated on Luria Bertani media (LB)/1.5% agar plates that were supplemented with 12.5 μg/ml chloramphenicol, 0.1 mM IPTG, and 120 μg/ml X-Gal. These were incubated overnight at 37°C and picked into 384-well microtiter plates (Genetix) containing LB supplemented with 12.5 μg/ml chloramphenicol and 5% v/v glycerol, using a colony-picking robot (Norgren Systems). A total of 440 plates were picked for this library. A Total Array System (BioRobotics) machine was used to spot high-density nylon filter sets (22 cm × 22 cm) containing BAC DNA.
Sequencing and analysis
Four representative 384-well plates of BAC clones were sequenced by the Washington University Genome Sequencing Center. Base calls were generated, and sequences were quality-trimmed to Q20 using phred (Ewing and Green 1998; Ewing et al. 1998) and were vector-trimmed using phrap (Green 1994). Lamprey WGS reads were downloaded from the NCBI TraceArchives database (http://www.ncbi.nlm.nih.gov/Traces/trace.cgi?), trimmed in the same manner, and formatted into a blast database consisting of 18,506,949 reads totaling 9,799,055,754 nucleotides. All BAC-end reads were aligned to all lamprey WGS reads using megablast (Zhang et al. 2000). Alignments were post-processed to select long/high-identity alignments (≥400 bp in length, ≥95% nucleotide identity) that approximate a relevant range of lengths and sequence identities for whole genome assembly. Depth of WGS coverage was calculated for all end reads, and these coverage estimates were used to define sequences as low or high copy. Here, depth of coverage is defined as the average number of aligning reads per nucleotide unit length along the entire length of the query (BAC-end) sequence.
A second set of representative BAC and WGS datasets from a single diploid vertebrate was also selected for comparison with the lamprey genome. Paired-end reads from BAC and WGS libraries representing a single human genome were downloaded from the TraceArchives database and trimmed to remove vector and low-quality sequence (BAC dataset: n = 229,578 reads; WGS dataset: n = 12,110,821 reads). Depth of coverage for human BAC ends was calculated using the same methods that were used for lamprey.
Lamprey repetitive reads that were identified in our ab initio screen were further characterized on the basis of sequence similarity to other sequences. Reads that corresponded to the Germ1 element were identified using blastn (Altschul et al. 1990), and the extended sequence of Germ1 was generated by assembling these sequences with the known fragment of Germ1, using ContigExpress (Vector NTI v11, Invitrogen). Other repeats were characterized by searching for similarity to a database of known repetitive elements (RepBase Update 20080801) (Jurka et al. 2005) using RepeatMasker (version open-3.2.5) (Smit et al. 2004).
Chromosomes were prepared from lamprey testes and gill by first disaggregating the tissues in hypotonic KCl (75 mM) via gentle grinding in a Dounce homogenizer with a loose pestle. Single cells were allowed to swell in suspension for 1 h, prefixed by adding an equal volume of 3:1 methanol/glacial acetic acid (Farmer’s solution), then fixed through three changes of Farmer’s solution. Suspensions of fixed cells were dropped onto microscope slides and permitted to air dry at room temperature. Chromosome spreads were counterstained with DAPI (4′,6-diamidino-2-phenylindole).
Results and discussion
Insert length and genome coverage
It is notable that lamprey appears to possess a much higher fraction of long/high-identity repetitive DNA (≥400 bp in length, ≥95% nucleotide identity) than does the human genome. The proportions of repetitive reads in the lamprey and human genomes are 0.581 and 0.045, respectively. Taken at face value, this extremely high repeat content might be interpreted as evidence that it will be extremely difficult to generate contigs from Sanger WGS sequencing data and existing automated assembly algorithms. However, it is also important to consider how these repetitive sequences are distributed throughout the genome. Essentially every vertebrate chromosome contains obligatory large stretches of highly repetitive DNA at the centromeres and near the telomeres, and the lamprey genome is no exception (Boan et al. 1996). Moreover, lamprey possess a karyotype (reported n ∼ 82–84) (Potter and Rothwell 1970) that is more complex than most vertebrates, including human (n = 23).
Chromosome counts (1N) for eight metaphase spreads from gill (mitotic) and eight metaphase spreads from testes (meiotic metaphase 1)
In light of lamprey’s karyotypic complexity, it seems well within reason that the lamprey should carry an additional burden of repetitive DNA. This is because of the simple fact that the lamprey genome contains several times the number of centromeres and telomeres than are present in the typical vertebrate genome. Moreover, these chromosomes are parsed from a genome that is only two thirds the size of the human genome. Importantly though, this architecturally obligatory repetitive DNA is expected to cluster distinctly from the majority of (assembly-relevant) low-copy DNA and should therefore prove much less disruptive to assembly of genic regions than if it were randomly distributed throughout the genome.
Analysis of paired-end depths for human and lamprey BACs
Content of the repetitive fraction
As the repetitive sequences represent a major component of the lamprey genome, we sought to further classify these sequences. It is known that sequence element Germ1 is enriched in germline, relative to soma, and represents a substantial fraction of the germline genome (Smith et al. 2009). As our BAC resource provides the most extensive sequence survey of lamprey germline to date, we reasoned that it might be possible to extend the known sequence of Germ1. By aligning the known 10,120-bp sequence of Germ1 to our BAC-end sequences, we were able to assign 357 BAC-end sequences to this repeat. Assembling these with the known sequence allowed us to extend Germ1 an additional 1,194-bp 5′ and 576-bp 3′ for a final length of 11,900 bp.
Classification of repetitive reads that were identified among lamprey germline BAC-end sequences
Here we describe a BAC resource for the lamprey germline genome. This resource provides access to long-insert clones spanning the majority of the lamprey genome, and represents the only existing clone/sequence resource that provides representation of the ∼20% of the lamprey genome that is lost during early development, including transcribed genes (Smith et al. 2009). Analysis of end reads from this library reveals that the lamprey genome possesses an exceptionally large fraction of repetitive DNA and that this repetitive DNA is strongly clustered at the subchromosomal level. This library represents a new and biologically informed resource for dissecting the structure of the lamprey genome and the biology of programmed genome rearrangement.
This work was supported by the National Institutes of Health [grant number GM079492] and the National Science Foundation [grant number MCB-0719558] to CTA. This work was supported by the National Institutes of Health [grant number T32-HG00035, F32-GM087919] to JJS. Its contents are solely the responsibility of the authors and do not necessarily represent the official views of NIH.