Sequencing the Full-Length of the Phosphatase and Tensin Homolog (PTEN) Gene in Hepatocellular Carcinoma (HCC) Using the 454 GS20 and Illumina GA DNA Sequencing Platforms
- First Online:
- Cite this article as:
- Rodriguez, J.A., Guiteau, J.J., Nazareth, L. et al. World J Surg (2009) 33: 647. doi:10.1007/s00268-008-9852-x
- 204 Views
Phosphatase and tensin homolog (PTEN) is a tumor-suppressor gene that is mutated in cancer of the liver, pancreas, endometrium, and prostate. PTEN-dependent pathways are involved in mediating cell growth and invasion. To sequence the whole gene (including introns and exons), we have taken advantage of new technologies that allow for rapid, inexpensive sequencing to great depth.
DNA from 15 HCC specimens were pooled, and long-range PCR was performed by using the GeneAmp XL PCR kit. Primer parameters included: length of 20–30 base pairs (bp), melting temperature of −68°C, and G/C content of 50–60%. PCR products were then column-purified and pooled, and DNA libraries were prepared for “shotgun sequencing” on both the 454 GS and Illumina GA sequencing platforms.
We successfully amplified approximately 98.9% of the PTEN gene by using one long-range PCR protocol applied to 24 primer sets, resulting in 20 amplicons ~6.5 kilobases (kb) in length, 2 amplicons ~10 kb in length, and 2 amplicons ~2.5 kb in length. Sequencing of fragmented PCR products on both sequencing platforms identified six high-frequency SNPs that were catalogued in dbSNP as known variants.
Shotgun sequencing based on a single long-range PCR protocol in pooled samples is an efficient and relatively inexpensive way to sequence an entire gene.
Until now, most large-scale genomic sequencing projects have focused on exonic regions because of the implications of amino acid change on protein function [1, 2]. However, the significance of intron mutations is becoming increasingly recognized because of their potential effects on gene splicing and transcription factor binding. Additionally, genetic alterations of introns are important to catalogue because they may be linked to alterations in other regions.
The last few years have seen a revolution in sequencing technology, including the development of the 454 GS and Illumina GA sequencing platforms. The 454 technology utilizes emulsion-based sample preparation to sequence DNA. In essence, randomly fragmented segments of the sequence in question are attached to specialized adaptors, which are then captured individually on beads. The beads are then emulsified in oil and the fragments are amplified . These steps are performed in open wells of a fiber-optic slide. Each well is 55 μm in depth with a center-to-center distance of 50 μm allowing for 480 wells per mm2 . A single emulsified bead containing one amplified fragment of DNA is present in each well. Each slide contains approximately 1.6 million wells and after the slide is filled, it is placed in a flow chamber in which sequencing reagents are run over the wells. These reagents simultaneously extend the DNA in each well and with each reaction, emit photons from the bottom of the wells . These photons are imaged by the 454 instrument and translated into DNA sequence. In this manner, large amounts of sequence information can be collected in a single run.
The Illumina technology incorporates many of the same principles as the 454; however, it differs in its use of a bridge amplification method to sequence DNA. In this method, DNA is randomly fragmented and ligation adaptors are attached to both ends. The fragments are then attached to a planar, optically transparent surface, which also has fixed primers attached. Unlabeled nucleotides and enzymes are added, initiating bridging of the fragments, attached on one end, with primer to their free end. The fragments become double-stranded after this series of reactions. After denaturing the fragments back to single-stranded DNA, the process is repeated, amplifying the sequences to several million. The first sequencing cycle is initiated by adding the four labeled reversible terminators (DNA bases), primers, and DNA polymerase. Laser excitation is then used to cause the reversible terminators to emit a detectable fluorescence, which is read as the first base in each cluster. This cycle is continued one base at a time until the sequence of all the fragments is known. By incorporating parallel sequencing, the Illumina technology, much like the 454, is capable of sequencing enormous quantities of DNA quickly.
Hepatocellular carcinoma, like most human cancers, represents a complex and gradual interaction between environmental and genetic factors culminating in clinical disease. Through increased susceptibility to an environmental insult or direct stimulation of a malignant phenotype, genes are thought to play a crucial role in numerous oncogenic pathways. Furthermore, this genetic variation may be congenital or acquired in nature. Similar to the exponential growth in sequencing technology, interest in the genomics of cancer has grown rapidly in recent years, demanding a cost-effective and efficient mechanism for gene sequencing.
Phosphatase and tensin homolog (PTEN) is a 105 kb tumor-suppressor gene found on chromosome 10q23. PTEN-associated pathways have been implicated in many cancers, including hepatocellular carcinoma (HCC), pancreatic cancer, glioblastoma, melanoma, prostate cancer, and endometrial cancer . Down-regulation has been consistently observed in these tumors, but the mechanisms underlying the decreased expression in PTEN are unclear. In HCC, deletions, inactivating mutations, and promoter methylation have been detected [3–8]. The degree of PTEN expression in HCC has been linked to tumor differentiation, invasion and metastasis, and patient survival . This differential expression in HCC versus noncancerous tissue may be the result of specific alterations within the gene itself.
To examine the feasibility of whole-gene sequencing in assessment of genetic variation in exons and introns, we amplified the entire PTEN gene using long-range PCR and then sequenced fragmented PCR products using the 454 sequencing technology and the Illumina GA platform.
Material and methods
Tumor samples were collected by the Baylor College of Medicine (BCM) Department of Pathology under strict protocol guidelines after therapeutic tumor resection. Samples were deidentified, coded, and stored in a freezer at −80°C. Representative slides were made from each sample and reexamined; 15 specimens with histologically confirmed HCC were selected for inclusion. DNA was isolated from the tumor samples by first digesting the tissue in a lysis solution. Then phenol solution was added and centrifuged. The top layer was removed and placed in a new centrifuge tube into which a phenol:chloroform solution was added and centrifuged again. After this step, the top layer was again removed and a chloroform solution was added before the contents of the tube were centrifuged. The top layer was removed and isopropanol added. This solution was rocked until DNA strands could be seen and then recentrifuged. All the solution was then removed and 70% ethanol added and centrifuged. This step was repeated with 100% ethanol. The ethanol was poured off and the tube was left to dry on the bench before dissolving the DNA in TE buffer. DNA concentration was measured by using the Quant-iT PicoGreen dsDNA assay kit per manufacturer instructions (Invitrogen, Carlsbad, CA). An equivalent amount of each DNA sample was combined to achieve a final pooled concentration of 20 ng/μl.
To design an efficient and gene-specific primer set, known SNPs and regions of the gene that contain repeated sequences were masked. Primers were designed that were 20–30 base pairs in length, had a melting temperature (Tm) of 64–70°C, and had a G/C content of 50–60%. Once designed, primer specificity was tested with In-Silico PCR (http://genome.ucsc.edu/cgi-bin/hgPcr), a software package that simulates the PCR process given a genomic build and primer set. Consecutive primer sets were designed to overlap by at least 300 base pairs to ensure good coverage and continuity at the ends of amplicon sequences (the distinct region of DNA amplified by a particular primer set is called an amplicon).
Specific PCR was performed on the pooled DNA with each of the 24 primer sets using the GeneAmp XL PCR kit per manufacturer instructions (Applied Biosystems, Foster City, CA). Thermocycling conditions were as follows: (1) 94°C for 5 min; (2) 18 cycles of 94°C for 15 s followed by 64°C for 10 min; (3) 19 cycles of 94°C for 15 s followed by 64°C for 10 min with an increment of 15 s per cycle; (4) 72°C for 10 min; and (5) 4°C for storage. A single PCR protocol was applied to all 24 primer sets.
454 library creation and sequencing
PCR products were purified on QIAquick PCR columns (Qiagen, Germantown, MD) and pooled in equal amounts after determining the concentration. DNA libraries were prepared from the PCR products for sequencing on the 454 platform using a standard protocol from the vendor (454 Life Sciences, Branford, CT). In brief, the pooled PCR products were nebulized to an average size of 800 bp. DNA fragments were end polished by using T4 DNA polymerase and T4 polynucleotide kinase. The 454-specific “A” and “B” adaptors were ligated to each DNA fragment. The ligated reaction products were purified by immobilizing them onto magnetic streptavidin-coated beads via the biotin moiety in the B adaptor. Fragments with B adaptors at their 5′ and 3′ ends were bound to the bead at both ends. The beads were washed twice with A adaptors to remove adaptor dimers and fragments. The bound fragments with A adaptors at their free end were submitted to a fill-in reaction. The immobilized library (double-stranded) was then made single-stranded by treating with a melt solution containing NaOH. The supernatant containing the single-stranded DNA library was collected and neutralized. The library was cleaned using MinElute PCR purification columns (Qiagen). The library was QC’d on an Agilent 2100 Bioanalyzer using an RNA Pico 6000 Lab Chip (Agilent Technologies, Santa Clara, CA). The concentration was determined using a Ribogreen assay (Invitrogen, Carlsbad, CA), and the library was diluted to 1010 molecules. The dilutions of the stock library were prepared for emulsion PCR (emPCR). Briefly, the pooled fragments were bound to Capture Beads at a resolution of one DNA molecule per bead. The DNA-bound beads were then resuspended separately with the amplification mix. The mixture was emulsified as a water-in-oil mixture by vigorous mechanical shaking to form aqueous phase “microreactors” of 50–100 μm in diameter containing a full amplification mix and no more than a single bead, insulated from other beads by the surrounding oil, allowing for the maintenance of the clonality during the amplification. The emulsified beads were then thermocycled (emPCR), and the amplification products were clonally captured by the oligonucleotides in excess on the beads. After amplification, the emulsion was broken chemically and the beads recovered, washed, and enriched. The second strands of the amplification products were melted away, leaving single-stranded DNA bound to the beads. The sequencing primers were annealed to the immobilized DNA templates, and the beads and the enzymes were loaded to a PicoTiterPlate (Roche Diagnostics, Mannheim, Germany) at a resolution of 1 bead/well. The PicoTiterPlate was inserted into the Genome Sequencer 20 (GS20) instrument (454 Life Sciences) and the nucleotides were sequentially flowed over the plate, one at a time, in a cyclical order (TACG) for a total of 42 dNTP flow cycles. The signal generated by each nucleotide addition was captured by a camera and processed by an integrated computer to determine the base sequence and quality score in each well.
Illumina GA library creation and sequencing
PCR products were purified on QIAquick PCR columns (Qiagen) and pooled in equal amounts after checking the concentration. Libraries were prepared from the pooled PCR products for sequencing on the Illumina GA platforms using the manufacturer’s protocol (Illumina, San Diego, CA). The PCR products were nebulized to <800 bp, the DNA fragments were end repaired using T4 DNA polymerase (Klenow), and T4 polynucleotide kinase. An “A” base was then added to the 3′ end of the blunt phosphorylated DNA fragments by Klenow. This allowed for subsequent ligation of the Illumina GA-specific adaptors, which have a single “T” base overhang at their 3′ end with DNA ligase. The ligated products were run on a 2% agarose gel and a band was cut at 125–175 bp . DNA was eluted from the gel using Qiagen’s Gel extraction kit. The DNA was then amplified by PCR according to the manufacturer’s recommendations. The library was QC’d on a DNA chip using the Agilent 2100 Bioanalyzer and the concentration was determined using a Picogreen assay. The library was then diluted to 10 nM.
Two pM sequencing libraries were then used in cluster generation on the Illumina flow cell using Illumina cluster station according to the manufacturer’s protocol. This single molecule clonal amplification involved six steps: template hybridization, template amplification, linearization, blocking 3′ ends, denaturation, and primer hybridization. For each single molecule, a few thousand identical copies were formed into a “cluster” with densities of up to 10 million per square centimeter. The flow cell was then sequenced on Illumina GA using Illumina SBS (Sequencing-by-Synthesis) technology. Thirty-six cycles of sequencing were performed according to the manufacturer’s specifications. Imaging analysis and base calling were performed with Illumina GAPipeline. On average, approximately 2.5 million successful reads, consisting of 36 bases of each fragment, were generated on each lane of a flow cell. Finally, the sequences were mapped using Eland, the alignment tool in GAPipeline, to the PTEN region (hg18_knownGene_NM_000314 chr10:89610925-89717632).
Read analysis and SNP-calling
Reads from the 454 platform were aligned to the reference sequences with the MosaikAligner (http://bioinformatics.bc.edu/marthlab/Mosaik). Candidate SNPs were selected if they exhibited a frequency that was clearly above the “background noise” of systematic error, or inaccurate base-calling. SNPs were rejected if the fraction of reads with the variant allele was less than 20% or the coverage at that position was less than 100 reads. Candidate SNPs resulting from mapping errors, homopolymer errors, and/or sequencing errors were also rejected. The specific positions of putative SNPs identified in the analysis of 454 reads were then analyzed for similar variation in the Illumina GA read data. Putative SNPs were then referenced in the Single Nucleotide Polymorphism database (dbSNP), an online database established by The National Center for Biotechnology Information (NCBI) in collaboration with the National Human Genome Research Institute (NHGRI). The database serves as a central archive of both single base nucleotide subsitutions and short deletion and insertion polymorphisms.
454 read distribution for putative SNP positions
No. of 454 reads
Illumina GA read distribution for putative SNP positions
No. of Illumina GA reads
We successfully amplified approximately 98.4% of the PTEN gene using long-range PCR. This was performed with only one set of thermocycling conditions, therefore, a single protocol was efficiently applied to all 24 primer sets. Sequencing of fragmented PCR products on the 454 sequencing platform allowed us to identify six high-frequency SNPs that were confirmed with the Illumina GA instrument and that were catalogued in dbSNP. The detection of known PTEN variants validated our approach to whole-gene sequencing.
The 454 sequencing platform was used for primary SNP detection because of its ability to generate hundreds of thousands of sequencing reads in a single, 7.5-h run of the instrument. The multiple reads generated for each base pair provided greater coverage of each SNP position. With a larger sample number at each position, low-frequency SNPs can more easily be detected with statistical significance. This feature is particularly important when sequencing cancer types, such as HCC, in which the large amounts of necrosis in the specimens dilute the amount and quality of cancer-associated DNA and, therefore, makes cancer-associated mutation more difficult to detect.
Although both the 454 GS20 and Illumina GA sequencing platforms provide deeper coverage at each position, they are limited by the fact that sequencing reads are short—typically fewer than 100 base pairs and 35 base pairs, respectively. This poses a significant challenge when attempting to sequence long segments of DNA. With such a limitation, one option would have been to divide the PTEN gene into more than 500 amplicons for PCR amplification. Optimization of PCR conditions for each of the hundreds of primer sets would have been time-consuming and expensive. Instead, we used a technique called “shotgun sequencing.” We amplified the gene in long segments using only 24 primer sets, and then we split the resultant long-PCR products into random smaller fragments, each containing approximately 200 base pairs. Once these smaller fragments were sequenced, reads were aligned to the reference sequence and SNPs were called. This approach relies on sophisticated, bioinformatics-based data analysis and SNP calling, in that thousands of randomly generated reads must be aligned, and systematic and statistical error must be filtered.
The technology behind DNA sequencing is constantly changing and improving. Both technologies used in this study have seen vast improvements even within the past few months much like the tremendous growth in the field in general. The 454 technology has utilized changing buffers, newer, more stable reagents, and updated software. This alone has greatly increased efficiency from a maximum fragment size of 100 base pairs and maximum bases per run of 50 million with the original model GS20 to 400 base pairs per fragment and 400 million bases per run with their newest model, the XLR. Similar advances have been seen with the Illumina technology increasing their maximum fragment size from 36 to 75 base pairs. Also, the Illumina technology can run up to 4 billion bases per run depending on the mode of sequencing chosen. These changes have increased the information available from each run as well as helped lower costs. Increasing the size of the fragments decreases necessary overlap of sequences for determination of the gene in question or entire genome. Also, increasing bases per run allows for greater depth in insuring the accuracy of the sequences. With time the ability to sequence genes or whole genomes will continue to become more efficient and cost-effective, making these technologies ideal for the continued development of genomic medicine.
Additional characterization of each SNP identified in the study is warranted and includes the resequencing of that particular locus in individual tumor samples to obtain a genotype for the SNP (i.e., whether the mutation was homo- or heterozygous). Second, resequencing of specific SNP loci in normal samples from the same patients would determine whether the SNP was germline or somatic. Finally, functional studies of each SNP are needed to determine the effects on protein structure, intracellular signaling pathways, and tumor proliferation, invasion, and metastasis.
The project was supported by grants from the NIH–National Human Genome Research Institute (Grant #2 U54 HG003273) and the Effie and Wofford Cain Foundation. The authors thank the HGSC faculties who contributed to the completion of this work, including Donna M. Muzny, Yi Han, John McPherson, David A. Wheeler, R. Gerald Fowler, and David N. Parker; Huyen H. Dinh, Sandra Lee, Christie L. Kovar, and Michael Holder from the HGSC 454 Group; James C. Durbin, Anthony San Lucas, and Adam M. Dunn from the HGSC Bioinformatics Group; Kaiyi Li for assistance with DNA extraction of banked liver tissue; and Drs. Changyi Chen and F. Charles Brunicardi for their support and advice.