Abstract
We developed an automated pipeline for the detection of single nucleotide polymorphisms (SNPs) in expressed sequence tag (EST) data sets, by combining three DNA sequence analysis programs:Phred, Phrap and PolyBayes. This application requires access to the individual electrophoregram traces. First, a reference set of 65 SNPs was obtained from the sequencing of 30 gametes in 13 maritime pine (Pinus pinaster Ait.) gene fragments (6671 bp), resulting in a frequency of 1 SNP every 102.6 bp. Second, parameters of the three programs were optimized in order to retrieve as many true SNPs, while keeping the rate of false positive as low as possible. Overall, the efficiency of detection of true SNPs was 83.1%. However, this rate varied largely as a function of the rare SNP allele frequency: down to 41% for rare SNP alleles (frequency ` 10%), up to 98% for allele frequencies above 10%. Third, the detection method was applied to the 18498 assembled maritime pine (Pinus pinaster Ait.) ESTs, allowing to identify a total of 1400 candidate SNPs, in contigs containing between 4 and 20 sequence reads. These genetic resources, described for the first time in a forest tree species, were made available at http://www.pierroton.inra/genetics/Pinesnps. We also derived an analytical expression for the SNP detection probability as a function of the SNP allele frequency, the number of haploid genomes used to generate the EST sequence database, and the sample size of the contigs considered for SNP detection. The frequency of the SNP allele was shown to be the main factor influencing the probability of SNP detection.
Similar content being viewed by others
References
Batley, J., Barker, G., O'Sullivan, H., Edwards, K.J. and Edwards, D. 2003. Mining for single nucleotide polymorphisms and insertions/deletions in maize expressed sequence tag data. Plant Physiol. 132: 84-91.
Bebenek, K., Abbotts, J., Wilson, S.H. and Kunkel, T.A. 1993. Error-prone polymerization by HIV-1 reverse transcriptase. J. Biol. Chem. 268: 10324-10334.
Brookes, A.J. 1999. The essence of SNPs. Gene 234: 177-186.
Brumfield, R.T., Beerli, P., Nickerson, D.A. and Edwards, S.V. 2003. The utility of single nucleotide polymorphisms in inferences of population history. Trends Ecol. Evol. 18:249- 256.
Cantón, F.R., Le Provost, G., Garcia, V., Barré, A., Frigério, J.-M., Paiva, J., Fevereiro, P., Á vila, C., Mouret, J.-F., de Daruvar, A., Cánovas, F.M. and Plomion C. in press. Transcriptome analysis of wood formation in maritime pine. In: Sustainable Forestry, Wood products & Biotechnology, BIOFOR Proceeding.
Chagné, D., Lalanne, C., Madur, D., Kumar, S., Frigerio, J.-M., Krier, C., Decroocq, S., Savoure, A., Bou-Dagher, K.-M., Bertocchi, E., Brach, J. and Plomion, C. 2002. A high density genetic map of maritime pine based on AFLPs. Ann. For. Sci. 59: 627-636.
Chagné, D., Brown, G., Lalanne, C., Madur, D., Pot, D., Neale, D. and Plomion, C. 2003. Comparative genome and QTL mapping between maritime and loblolly pines. Mol. Breeding 12: 185-195.
Ching, A., Caldwell, K.S., Jung, M., Dolan, M., Smith, O.S., Tingey, S., Morgante, M. and Rafalski, A.J. 2002. SNP frequency, haplotype structure and linkage disequilibrium in elite maize inbred lines. BMC Genet. 3: 1-19.
Cho, R.J., Mindrinos, M., Richards, D.R., Sapolsky, R.J., Anderson, M., Drenkard, E., Dewdney, J., Reuber, T.L., Stammers, M., Federspiel, N., Theologis, A., Yang, W.H., Hubbell, E., Au, M., Chung, E.Y., Lashkari, D., Lemieux, B., Dean, C., Lipshutz, R.J., Ausubel, F.M., Davis, R.W. and Oefner, P.J. 1999. Genome-wide mapping with biallelic markers in Arabidopsis thaliana. Nat. Genet. 23: 203-207.
Christoffels, A., van Gelder, A., Greyling, G., Miller, R., Hide, T. and Hide, W. 2001. STACK: sequence tag alignment and consensus knowledgebase. Nuc. Ac. Res. 29(1): 238-238.
Collins, F.S., Guyer, M.S. and Charkravarti, A. 1997. Variations on a theme: cataloging human DNA sequence variation. Science 278: 1580-1581.
Collins, A., Lonjou, C. and Morton, N.E. 1999. Genetic epidemiology of single-nucleotide polymorphisms. Proc. Natl. Acad. Sci. USA 96: 15173-15177.
Emahazion, T., Feuk, L., Jobs, M., Sawyer, S.L., Fredman, D., St Clair, D., Prince, J.A. and Brookes, A.J. 2001. SNP association studies in Alzheimer's disease highlight problems for complex disease analysis. Trends Genet. 17: 407-413.
Ewing, B. and Green, P. 1998. Base calling of automated sequencer traces using Phred. II. Error probabilities. Genome Res. 8: 186-194.
Ewing, B., Hiller, L.D., Wendl, M.C. and Green, P. 1998. Base calling of automated sequencer traces using Phred. II. Accuracy assessment. Genome Res. 8: 175-185.
Frigerio, J.-M., Dubos, C., Chaumeil, P., Salin, F., Garcia, V., Barré, A. and Plomion, C. in press. Using transcriptome analysis to identify osmotic stress candidate genes in maritime pine (Pinus pinaster Ait.). In: Sustainable Forestry, Wood products & Biotechnology, BIOFOR Proceeding.
Gallagher, S.R. (Ed.), 1992. Gus Protocols: using the GUS Gene as a Reporter of Gene Expression. Academic Press, New York, 221 pp.
Gordon, D., Abajian, C. and Green P. 1998. Consed: a graphical tool for sequence finishing. Genome Res. 8: 195-202.
Grivet, L., Glaszmann, J.-C., Vincentz, M., da Silva, F. and Arruda, P. 2003. ESTs as a source for sequence polymorphism discovery in sugarcane: example of the Adh genes. Theor. Appl. Genet. 106: 190-197.
Gray, I.C., Campbell, D.A. and Spurr, N.K. 2000. Single nucleotide polymorphisms as tools in human genetics. Hum. Mol. Genet. 9: 2403-2408.
Kinlaw and Neale, 1997.
Kota, R., Rudd, S., Facius, A., Kolesov, G., Thiel, T., Zhang, H., Stein, N., Mayer, K. and Graner, A. in press. Snipping polymorphisms from large EST collections in barley (Hordeum vulgare L.). Mol. Gen. Genome.
Kruglyak, L. 1997. The use of a genetic map of biallelic markers in linkage studies. Nat. Genet. 17: 21-24.
Le Provost, G., Paiva, J., Pot, D., Brach, J. and Plomion, C. 2003. Seasonal variation in transcript accumulation in wood forming tissues of maritime pine (Pinus pinaster Ait.) with emphasis on a cell wall glycin rich protein. Planta 217: 820-830.
Letondal, C. 2001. A Web interface generator for moleular biology programs in Unix, Bioinformatics 17: 73-82.
Marth, G.T., Korf, I., Yandell, M.D., Yeh, R.T., Gu, Z., Zakeri, H., Stitziel, N.O., Hillier, L., Kwok, P. and Gish, W.R. 1999. A general approach to single-nucleotide polymorphism discovery. Nat. Genet. 23: 452-456.
Nickerson, D.A., Tobe, V.O. and Taylor, S.L. 1997. PolyPhred: automating the detection and genotyping of single nucleotide substitutions using fluorescence-based resequencing. Nuc. Ac. Res. 25: 2745-2751.
Nordborg, M., Borevitz, J.O., Bergelson, J., Berry, C.C., Chory, J., Hagenblad, J., Kreitman, M., Maloof, J.N., Noyes, T., Oefner, P.J., Stahl, E.A. and Weigel, D. 2002. The extent of linkage disequilibrium in Arabidopsis thaliana. Nat. Genet. 30: 190-193.
Picoult-Newberg, L., Ideker, T.E., Pohl, M.G., Taylor, S.L., Donaldson, M.A., Nickerson, D.A. and Boyce-Jacino, M. 1999. Mining SNPs from EST databases. Genome Res. 9: 167-174.
Rounsley, S., Xiaoying, L. and Ketchum, K.A. 1998. Largescale sequencing of plant genomes. Curr. Opin. Plant Biol. 1: 136-141.
Riley, J.H., Allan, C.J., Lai, E. and Roses, A. 2000. The use of single nucleotide polymorphisms in the isolation of common disease genes. Pharmacogenomics 1: 39-47.
Sachidanandam, R., Weissman, D., Schmidt, S.C., Kakol, J.M., Stein, L.D., Marth, G., Sherry, S., Mullikin, J.C., Mortimore, B.J., Willey, D.L., Hunt, S.E., Cole, C.G., Coggill, P.C., Rice, C.M., Ning, Z., Rogers, J., Bentley, D.R., Kwok, P.Y., Mardis, E.R., Yeh, R.T., Schultz, B., Cook, L., Davenport, R., Dante, M., Fulton, L.
Hillier, L., Waterston, R.H., McPherson, J.D., Gilman, B., Schaffner, S., Van Etten, W.J., Reich, D., Higgins, J., Daly, M.J., Blumenstiel, B., Baldwin, J., Stange-Thomann, N., Zody, M.C., Linton, L., Lander, E.S. and Atshuler, D. 2001. A map of human genome sequence variation containing 1.42 million single nucleotide polymorphisms. Nature 409: 928- 933.
Somers, D.L., Kirkpatrick, R., Moniwa, M. and Walsh, A. 2003. Mining single-nucleotide polymorphisms from hexaploid wheat ESTs. Genome 49: 431-437.
Thornsberry, J.M., Goodman, M.M., Doebley, J., Kresovich, S., Nielsen, D. and Buckler, E.S. 2001. Dwarf polymorphisms associate with variation in flowering time. Nat. Genet. 28: 286-289.
Useche, F.J., Gao, G., Harafey, M. and Rafalski, A. 2001. High-throughput identification, database storage and analysis of SNPs in EST sequences. Genome Inform. Ser. Workshop Genome Inform. 12: 194-203.
Wilson, M.R., Di Zinno, J.A., Polanskey, D., Replogle, J. and Budowle, B. 1995. Validation of mitochondrial DNA sequencing for forensic casework analysis. Int. J. Legal Med. 108: 68-74.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Dantec, L.L., Chagné, D., Pot, D. et al. Automated SNP Detection in Expressed Sequence Tags: Statistical Considerations and Application to Maritime Pine Sequences. Plant Mol Biol 54, 461–470 (2004). https://doi.org/10.1023/B:PLAN.0000036376.11710.6f
Issue Date:
DOI: https://doi.org/10.1023/B:PLAN.0000036376.11710.6f