Abstract
The accuracy of prediction methods based on power spectrum analysis depends on the threshold that is used to discriminate between protein coding and non-coding sequences in the genomes of eukaryotes. Because the structure of genes vary among different eukaryotes, it is difficult to determine the best prediction threshold for a eukaryote relying only on prior biological knowledge. To improve the accuracy of prediction methods based on power spectral analysis, we developed a novel method based on a bootstrap algorithm to infer organism-specific optimal thresholds for eukaryotes. As prior information, our method requires the input of only a few annotated protein coding regions from the organism being studied. Our results show that using the calculated optimal thresholds for our test datasets, the average prediction accuracy of our method is 81%, an increase of 19% over that obtained using the same empirical threshold P = 4 for all datasets. The proposed method is simple and convenient and easily applied to infer optimal thresholds that can be used to predict coding regions in the genomes of most organisms.
Similar content being viewed by others
References
Birney E, Clamp M, Durbin R (2004) GeneWise and Genomewise. Genome Res 14:988–995
Efron B, Tibshirani RJ (1994) An introduction to the Bootstrap. Chapman and Hall, London, pp 45–53
Fickett JW (1982) Recognition of protein coding regions in DNA sequences. Nucleic Acids Res 10:5303–5318
Howe KL, Chothia T, Durbin R (2002) GAZE: A generic framework for the integration of gene-prediction data by dynamic programming. Genome Res 12:1418–1427
Korf I, Flicek P, Duan D, Brent MR (2001) Integrating genomic homology into gene structure prediction. Bioinformatics 17(Suppl 1):S140–S148
Kotlar D, Lavner Y (2003) Gene prediction by spectral rotation measure: a new method for identifying protein-coding regions. Genome Res 13:1930–1937
Lukashin AV, Borodovsky M (1998) GeneMark.hmm: new solutions for gene finding. Nucleic Acids Res 26:1107–1115
Mott R (1997) EST_GENOME: a program to align spliced DNA sequences to unspliced genomic DNA. Comput Appl Biosci 13:477–478
Rao N, Lei X, Guo J, Huang H, Ren Z (2009) An efficient sliding window strategy for accurate location of eukaryotic protein coding regions. Comput Biol Med 39:392–395
Tiwari S, Ramachandran S, Bhattacharya A, Bhattacharya S, Ramaswamy R (1997) Prediction of probable genes by Fourier analysis of genomic sequences. Comput Appl Biosci 13:263–270
Tsonis AA, Elsner JB, Tsonis PA (1991) Periodicity in DNA coding sequences: implications in gene evolution. J Theor Biol 151:323–331
Uberbacher EC, Mural RJ (1991) Locating protein-coding regions in human DNA sequences by a multiple sensor-neural network approach. Proc Natl Acad Sci USA 88:11261–11265
Voss RF (1992) Evolution of long-range fractal correlations and 1/f noise in DNA base sequences. Phys Rev Lett 68:3805–3808
Zhang MQ (1997) Identification of protein coding regions in the human genome by quadratic discriminant analysis. Proc Natl Acad Sci USA 94:565–568
Zhu H, Hu GQ, Yang YF, Wang J, She ZS (2007) MED: a new non-supervised gene prediction algorithm for bacterial and archaeal genomes. BMC Bioinformatics 8:97
Zoubir AM, Iskander DR (2004) Bootstrap techniques for signal processing. Cambridge University Press, Cambridge, pp 1–15
Acknowledgment
This work is supported by National Natural Science Foundation in China (Grant No. 60571047).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Xu, S., Rao, N., Chen, X. et al. Inferring an organism-specific optimal threshold for predicting protein coding regions in eukaryotes based on a bootstrapping algorithm. Biotechnol Lett 33, 889–896 (2011). https://doi.org/10.1007/s10529-011-0525-8
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10529-011-0525-8