Quantitative Biology

, Volume 1, Issue 4, pp 261–271 | Cite as

NPEST: a nonparametric method and a database for transcription start site prediction

  • Tatiana Tatarinova
  • Alona Kryshchenko
  • Martin Triska
  • Mehedi Hassan
  • Denis Murphy
  • Michael Neely
  • Alan Schumitzky
Research Article


In this paper we present NPEST, a novel tool for the analysis of expressed sequence tags (EST) distributions and transcription start site (TSS) prediction. This method estimates an unknown probability distribution of ESTs using a maximum likelihood (ML) approach, which is then used to predict positions of TSS. Accurate identification of TSS is an important genomics task, since the position of regulatory elements with respect to the TSS can have large effects on gene regulation, and performance of promoter motif-finding methods depends on correct identification of TSSs. Our probabilistic approach expands recognition capabilities to multiple TSS per locus that may be a useful tool to enhance the understanding of alternative splicing mechanisms. This paper presents analysis of simulated data as well as statistical analysis of promoter regions of a model dicot plant Arabidopsis thaliana. Using our statistical tool we analyzed 16520 loci and developed a database of TSS, which is now publicly available at


transcription start site (TSS) nonparametric maximum likelihood 


  1. 1.
    Berendzen, K. W., Stüber, K., Harter, K. and Wanke, D. (2006) Cismotifs upstream of the transcription and translation initiation sites are effectively revealed by their positional disequilibrium in eukaryote genomes using frequency distribution curves. BMC Bioinformatics, 7, 522PubMedCentralPubMedCrossRefGoogle Scholar
  2. 2.
    Pritsker, M., Liu, Y.-C., Beer, M. A. and Tavazoie, S. (2004) Wholegenome discovery of transcription factor binding sites by network-level conservation. Genome Res., 14, 99–108PubMedCentralPubMedCrossRefGoogle Scholar
  3. 3.
    Ohler, U., Liao, G. C., Niemann, H. and Rubin, G. M. (2002) Computational analysis of core promoters in the Drosophila genome. Genome Biol., 3, H0087CrossRefGoogle Scholar
  4. 4.
    Ohler, U. (2006) Identification of core promoter modules in Drosophila and their application in accurate transcription start site prediction. Nucleic Acids Res., 34, 5943–5950PubMedCentralPubMedCrossRefGoogle Scholar
  5. 5.
    Suzuki, Y. and Sugano, S. (1997) Generation of the 5′ EST using 5′-end enriched cDNA library. Tanpakushitsu Kakusan Koso, 42, 2836–2843PubMedGoogle Scholar
  6. 6.
    Fickett, J. W. and Hatzigeorgiou, A. G. (1997) Eukaryotic promoter recognition. Genome Res., 7, 861–878PubMedGoogle Scholar
  7. 7.
    Down, T. A. and Hubbard, T. J. (2002) Computational detection and location of transcription start sites in mammalian genomic DNA. Genome Res., 12, 458–461PubMedCentralPubMedCrossRefGoogle Scholar
  8. 8.
    King, O. D. and Roth, F. P. (2003) A non-parametric model for transcription factor binding sites. Nucleic Acids Res., 31, e116PubMedCentralPubMedCrossRefGoogle Scholar
  9. 9.
    Abeel, T., Peer, Y. and Saeys, Y. (2009) Toward a gold standard for promoter prediction evaluation. Bioinformatics, 25.Google Scholar
  10. 10.
    Gordon, L., Chervonenkis, A. Y., Gammerman, A. J., Shahmuradov, I. A. and Solovyev, V. V. (2003) Sequence alignment kernel for recognition of promoter regions. Bioinformatics, 19, 1964–1971PubMedCrossRefGoogle Scholar
  11. 11.
    Shahmuradov, I. A., Solovyev, V. V. and Gammerman, A. J. (2005) Plant promoter prediction with confidence estimation. Nucleic Acids Res., 33, 1069–1076PubMedCentralPubMedCrossRefGoogle Scholar
  12. 12.
    Anwar, F., Baker, S., Jabid, T., Hasan, M., Shoyaib, M., Khan, H. and Walshe, R. (2008) Pol II promoter prediction using characteristic 4-mer motifs: a machine learning approach. BMC Bioinformatics, 9, 414PubMedCentralPubMedCrossRefGoogle Scholar
  13. 13.
    Troukhan, M., Tatarinova, T., Bouck, J., Flavell, R., and Alexandrov, N. (2009) Genome-wide discovery of cis-elements in promoter sequences using gene expression data. OMICS: A Journal of Integrative Biolog, 13Google Scholar
  14. 14.
    Joun, H., Lanske, B., Karperien, M., Qian, F., Defize, L. and Abou-Samra, A. (1997) Tissue-specific transcription start sites and alternative splicing of the parathyroid hormone (PTH)/PTH-related peptide (PTHrP) receptor gene: a new PTH/PTHrP receptor splice variant that lacks the signal peptide. Endocrinology, 138, 1742–1749PubMedGoogle Scholar
  15. 15.
    Tran, P., Leclerc, D., Chan, M., Pai, A., Hiou-Tim, F., Wu, Q., Goyette, P., Artigas, C., Milos, R. and Rozen, R. (2002) Multiple transcription start sites and alternative splicing in the methylenetetrahydrofolate reductase gene result in two enzyme isoforms. Mamm. Genome, 13, 483–492PubMedCrossRefGoogle Scholar
  16. 16.
    Rach, E. A., Yuan, H.-Y., Majoros, W. H., Tomancak, P. and Ohler, U. (2009) Motif composition, conservation and condition-specificity of single and alternative transcription start sites in the Drosophila genome, Genome Biology, 10.Google Scholar
  17. 17.
    Lamesch, P., Berardini, T. Z., Li, D., Swarbreck, D., Wilks, C., Sasidharan, R., Muller, R., Dreher, K., Alexander, D. L., Garcia-Hernandez, M., et al. (2012) The Arabidopsis Information Resource (TAIR): improved gene annotation and new tools. Nucleic Acids Res., 40, D1202–D1210.PubMedCentralPubMedCrossRefGoogle Scholar
  18. 18.
    Camacho, C., Coulouris, G., Avagyan, V., Ma, N., Papadopoulos, J., Bealer, K. and Madden, T. L. (2009) BLAST+: architecture and applications. BMC Bioinformatics, 10, 421PubMedCentralPubMedCrossRefGoogle Scholar
  19. 19.
    Tatarinova, T., Neely, M., Bartroff, J., van Guilder, M., Yamada, W., Bayard, D., Jelliffe, R., Leary, R., Chubatiuk, A. and Schumitzky, A. (2013) Two general methods for population pharmacokinetic modeling: non-parametric adaptive grid and non-parametric Bayesian. J. Pharmacokinet Pharmacodyn, 40, 189–199PubMedCrossRefGoogle Scholar
  20. 20.
    Mallet, A. (1986) A maximum likelihood estimation method for random coefficient regression models. Biometrika, 73, 645–656.CrossRefGoogle Scholar
  21. 21.
    Schumitzky, A. (1991) Nonparametric EM algorithms for estimating prior distributions. Appl. Math. Comput., 45, 141–157.CrossRefGoogle Scholar
  22. 22.
    Lindsay, B. (1983) The geometry of mixture likelihoods: a general theory. Ann. Stat., 11, 86–94.CrossRefGoogle Scholar
  23. 23.
    MATLAB version 7.10.0, 2010.Google Scholar
  24. 24.
    Tora, L. (2002) A unified nomenclature for TATA box binding protein (TBP)-associated factors (TAFs) involved in RNA polymerase II transcription. Genes Dev., 16, 673–675PubMedCrossRefGoogle Scholar
  25. 25.
    Smale, S. T. (2001) Core promoters: active contributors to combinatorial gene regulation. Genes Dev., 15, 2503–2508PubMedCrossRefGoogle Scholar
  26. 26.
    Lenhard, B., Sandelin, A. and Carninci, P. (2012) Metazoan promoters: emerging characteristics and insights into transcriptional regulation. Nat. Rev. Genet., 13, 233–245PubMedGoogle Scholar
  27. 27.
    Shahmuradov, I. A., Gammerman, A. J., Hancock, J. M., Bramley, P.M. and Solovyev, V. V. (2003) PlantProm: a database of plant promoter sequences. Nucleic Acids Res., 31, 114–117PubMedCentralPubMedCrossRefGoogle Scholar
  28. 28.
    Yamamoto, Y. Y., Yoshitsugu, T., Sakurai, T., Seki, M., Shinozaki, K. and Obokata, J. (2009) Heterogeneity of Arabidopsis core promoters revealed by high-density TSS analysis. Plant J., 60, 350–362PubMedCrossRefGoogle Scholar
  29. 29.
    Chodavarapu, R. K., Feng, S., Bernatavichute, Y. V., Chen, P. Y., Stroud, H., Yu, Y., Hetzel, J. A., Kuo, F., Kim, J., Cokus, S. J., et al. (2010) Relationship between nucleosome positioning and DNA methylation. Nature, 466, 388–392PubMedCentralPubMedCrossRefGoogle Scholar
  30. 30.
    Triska, M., Grocutt, D., Southern, J., Murphy, D. J. and Tatarinova, T. (2013) cisExpress: motif detection in DNA sequences. Bioinformatics, 29, 2203–2205PubMedCrossRefGoogle Scholar
  31. 31.
    Tatarinova, T., Elhaik, E. and Pellegrini, M. (2013) Cross-species analysis of genic GC3 content and DNA methylation patterns. Genome Biol Evol, 5, 1443–1456PubMedCentralPubMedCrossRefGoogle Scholar
  32. 32.
    Alexandrov, N. N., Troukhan, M. E., Brover, V. V., Tatarinova, T., Flavell, R. B. and Feldmann, K. A. (2006) Features of Arabidopsis genes and genome discovered using full-length cDNAs. Plant Mol. Biol., 60, 69–85PubMedCrossRefGoogle Scholar

Copyright information

© Higher Education Press and Springer-Verlag GmbH 2013

Authors and Affiliations

  • Tatiana Tatarinova
    • 1
  • Alona Kryshchenko
    • 1
  • Martin Triska
    • 1
    • 2
  • Mehedi Hassan
    • 2
  • Denis Murphy
    • 2
  • Michael Neely
    • 1
  • Alan Schumitzky
    • 1
  1. 1.Children’s Hospital Los Angeles and Keck School of MedicineUniversity of Southern CaliforniaLos AngelesUSA
  2. 2.Genomics and Computational Biology research groupUniversity of South WalesTreforest, WalesUK

Personalised recommendations