Skip to main content
Log in

Inferring an organism-specific optimal threshold for predicting protein coding regions in eukaryotes based on a bootstrapping algorithm

  • Original Research Paper
  • Published:
Biotechnology Letters Aims and scope Submit manuscript

Abstract

The accuracy of prediction methods based on power spectrum analysis depends on the threshold that is used to discriminate between protein coding and non-coding sequences in the genomes of eukaryotes. Because the structure of genes vary among different eukaryotes, it is difficult to determine the best prediction threshold for a eukaryote relying only on prior biological knowledge. To improve the accuracy of prediction methods based on power spectral analysis, we developed a novel method based on a bootstrap algorithm to infer organism-specific optimal thresholds for eukaryotes. As prior information, our method requires the input of only a few annotated protein coding regions from the organism being studied. Our results show that using the calculated optimal thresholds for our test datasets, the average prediction accuracy of our method is 81%, an increase of 19% over that obtained using the same empirical threshold P = 4 for all datasets. The proposed method is simple and convenient and easily applied to infer optimal thresholds that can be used to predict coding regions in the genomes of most organisms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

References

  • Birney E, Clamp M, Durbin R (2004) GeneWise and Genomewise. Genome Res 14:988–995

    Article  PubMed  CAS  Google Scholar 

  • Efron B, Tibshirani RJ (1994) An introduction to the Bootstrap. Chapman and Hall, London, pp 45–53

    Google Scholar 

  • Fickett JW (1982) Recognition of protein coding regions in DNA sequences. Nucleic Acids Res 10:5303–5318

    Article  PubMed  CAS  Google Scholar 

  • Howe KL, Chothia T, Durbin R (2002) GAZE: A generic framework for the integration of gene-prediction data by dynamic programming. Genome Res 12:1418–1427

    Article  PubMed  CAS  Google Scholar 

  • Korf I, Flicek P, Duan D, Brent MR (2001) Integrating genomic homology into gene structure prediction. Bioinformatics 17(Suppl 1):S140–S148

    PubMed  Google Scholar 

  • Kotlar D, Lavner Y (2003) Gene prediction by spectral rotation measure: a new method for identifying protein-coding regions. Genome Res 13:1930–1937

    PubMed  CAS  Google Scholar 

  • Lukashin AV, Borodovsky M (1998) GeneMark.hmm: new solutions for gene finding. Nucleic Acids Res 26:1107–1115

    Article  PubMed  CAS  Google Scholar 

  • Mott R (1997) EST_GENOME: a program to align spliced DNA sequences to unspliced genomic DNA. Comput Appl Biosci 13:477–478

    PubMed  CAS  Google Scholar 

  • Rao N, Lei X, Guo J, Huang H, Ren Z (2009) An efficient sliding window strategy for accurate location of eukaryotic protein coding regions. Comput Biol Med 39:392–395

    Article  PubMed  CAS  Google Scholar 

  • Tiwari S, Ramachandran S, Bhattacharya A, Bhattacharya S, Ramaswamy R (1997) Prediction of probable genes by Fourier analysis of genomic sequences. Comput Appl Biosci 13:263–270

    PubMed  CAS  Google Scholar 

  • Tsonis AA, Elsner JB, Tsonis PA (1991) Periodicity in DNA coding sequences: implications in gene evolution. J Theor Biol 151:323–331

    Article  PubMed  CAS  Google Scholar 

  • Uberbacher EC, Mural RJ (1991) Locating protein-coding regions in human DNA sequences by a multiple sensor-neural network approach. Proc Natl Acad Sci USA 88:11261–11265

    Article  PubMed  CAS  Google Scholar 

  • Voss RF (1992) Evolution of long-range fractal correlations and 1/f noise in DNA base sequences. Phys Rev Lett 68:3805–3808

    Article  PubMed  CAS  Google Scholar 

  • Zhang MQ (1997) Identification of protein coding regions in the human genome by quadratic discriminant analysis. Proc Natl Acad Sci USA 94:565–568

    Article  PubMed  CAS  Google Scholar 

  • Zhu H, Hu GQ, Yang YF, Wang J, She ZS (2007) MED: a new non-supervised gene prediction algorithm for bacterial and archaeal genomes. BMC Bioinformatics 8:97

    Article  PubMed  Google Scholar 

  • Zoubir AM, Iskander DR (2004) Bootstrap techniques for signal processing. Cambridge University Press, Cambridge, pp 1–15

    Google Scholar 

Download references

Acknowledgment

This work is supported by National Natural Science Foundation in China (Grant No. 60571047).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Nini Rao.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Xu, S., Rao, N., Chen, X. et al. Inferring an organism-specific optimal threshold for predicting protein coding regions in eukaryotes based on a bootstrapping algorithm. Biotechnol Lett 33, 889–896 (2011). https://doi.org/10.1007/s10529-011-0525-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10529-011-0525-8

Keywords

Navigation