Identification of protein-coding regions in Arabidopsis thaliana genome based on quadratic discriminant analysis


A new method (MZEF) for predicting internal coding exons in genomic DNA sequences has been developed. This method is based on a prediction algorithm that uses the quadratic discriminant function for multivariate statistical pattern recognition. With improved feature measures, an Arabidopsis thaliana-specific implementation of MZEF is completed and made available to the plant genome community.

This is a preview of subscription content, access via your institution.


  1. 1.

    Wooster R et al.: Identification of the breast cancer susceptibility gene BRCA2. Nature 378: 789–792 (1995).

    PubMed  Google Scholar 

  2. 2.

    Tartaglia LA et al.: Identification and expression cloning of a leptin receptor, OB-R. Cell 83: 1263–1271 (1995).

    Google Scholar 

  3. 3.

    Editorial: Capitalizing on the genome. Nature Genet 13: 1–5 (1995).

  4. 4.

    Collins F, Galas D: A new five-year plan for the U.S. Human Genome Project. Science 267: 43–46 (1993).

    Google Scholar 

  5. 5.

    Zhang, MQ: Identification of protein coding regions in the human genome based on quadratic discriminant analysis. Proc Natl Acad Sci USA 94: 565–568 (1997).

    PubMed  Google Scholar 

  6. 6.

    Solovyev VV, Salamov AA, Lawrence CB: Predicting internal exons by oligonucleotide composition and discriminant analysis of spliceable open reading frames. Nucl Acids Res 22: 5156–5163 (1994).

    PubMed  Google Scholar 

  7. 7.

    Uberbacher EC, Mural RJ: Locating protein-coding regions in human DNA sequences by a multiple sensor-neural network approach. Proc Natl Acad Sci USA 88: 1261–1265 (1991).

    PubMed  Google Scholar 

  8. 8.

    Claverie J-M: Computational methods for the identification of genes in vertebrate genomic sequence. Hum Mol Genet 6: 1735–1744 (1997).

    PubMed  Google Scholar 

  9. 9.

    Kramer D: First plant genome sequencing planned. Nature 383: 208 (1996).

    Google Scholar 

  10. 10.

    McLachlan GJ: Discriminant Analysis and Statistical Pattern Recognition, John Wiley, New York (1992).

    Google Scholar 

  11. 11.

    Fisher RA: The use of multiple measurements in taxonomic problems. Ann Eugen 7: 79–188 (1936).

    Google Scholar 

  12. 12.

    Krzanowski WJ: Principles of Multivariate Analysis, p. 347. Clarendon Press, Oxford (1993).

    Google Scholar 

  13. 13.

    Korning PG, Hebsgaard SM, Rouze P, Brunak S: Cleaning the GenBank Arabidopsis thaliana data set. Nucl Acids Res 24: 316–320.

  14. 14.

    Wiebauer K, Herrero J-J, Filipowicz W: Nuclear pre-mRNA processing in plants: distinct modes of 3′-splice-site selection in plants and animals. Mol Cell Biol 8: 2042–2051 (1988).

    PubMed  Google Scholar 

  15. 15.

    Waigmann E, Barta A: Processing of chimeric introns in dicot plants: evidence for a close cooperation between 5′ and 3′ splice sites. Nucl Acids Res 20: 75–81 (1992).

    PubMed  Google Scholar 

  16. 16.

    Goodall GJ, Filipowicz W: The AU-rich sequences present in the introns of plant nuclear pre-mRNAs are required for splicing. Cell 58: 473–483 (1989).

    Article  PubMed  Google Scholar 

  17. 17.

    Hebsgaard SM, Korning PG, Tolstrup N, Engelbrecht J, Rouze P, Brunak S: Splice site prediction in Arabidopsis thaliana pre-mRNA by combining local and global sequence information. Nucl Acids Res 24: 3439–3452 (1996).

    PubMed  Google Scholar 

  18. 18.

    Tolstrup N, Rouze P, Brunak S: A branch point consensus from Arabidopsis found by non-circular analysis allows for better prediction of acceptor sites. Nucl Acids Res 25: 3159–3163 (1997).

    PubMed  Google Scholar 

  19. 19.

    Borodovsky M, Mclninch JD: GENMARK: Parallel gene prediction for both DNA strand. Comput Chem 17: 123–133 (1993).

    Google Scholar 

  20. 20.

    Green P: Genefinder. Unpublished.

  21. 21.

    Parnell L, Dedhia N, McCombie WR: A statistical analysis of the success of exon prediction algorithims. The 1997 Biolotechnology Conference on the Arabidopsis Genome: From Sequence to Function, Cold Spring Harbor, NY, 11–14, Dec. 1997.

  22. 22.

    Burge C, Karlin S: Prediction of complete gene structure in human genomic DNA: J Mol Biol 268: 1–17 (1997).

    PubMed  Google Scholar 

  23. 23.

    Solovyev V, Salamov A: The Gene-Finder computer tools for analysis of human and model organisms genome sequences. ISMB 5: 294–302 (1997).

    PubMed  Google Scholar 

Download references

Author information



Rights and permissions

Reprints and Permissions

About this article

Cite this article

Zhang, M. Identification of protein-coding regions in Arabidopsis thaliana genome based on quadratic discriminant analysis. Plant Mol Biol 37, 803–806 (1998).

Download citation

  • exon prediction
  • quadratic discriminant analysis
  • Arabidopsis thaliana