Advertisement

Journal of Mathematical Biology

, Volume 69, Issue 2, pp 469–500 | Cite as

Robust \(k\)-mer frequency estimation using gapped \(k\)-mers

  • Mahmoud Ghandi
  • Morteza Mohammad-Noori
  • Michael A. BeerEmail author
Article

Abstract

Oligomers of fixed length, \(k\), commonly known as \(k\)-mers, are often used as fundamental elements in the description of DNA sequence features of diverse biological function, or as intermediate elements in the constuction of more complex descriptors of sequence features such as position weight matrices. \(k\)-mers are very useful as general sequence features because they constitute a complete and unbiased feature set, and do not require parameterization based on incomplete knowledge of biological mechanisms. However, a fundamental limitation in the use of \(k\)-mers as sequence features is that as \(k\) is increased, larger spatial correlations in DNA sequence elements can be described, but the frequency of observing any specific \(k\)-mer becomes very small, and rapidly approaches a sparse matrix of binary counts. Thus any statistical learning approach using \(k\)-mers will be susceptible to noisy estimation of \(k\)-mer frequencies once \(k\) becomes large. Because all molecular DNA interactions have limited spatial extent, gapped \(k\)-mers often carry the relevant biological signal. Here we use gapped \(k\)-mer counts to more robustly estimate the ungapped \(k\)-mer frequencies, by deriving an equation for the minimum norm estimate of \(k\)-mer frequencies given an observed set of gapped \(k\)-mer frequencies. We demonstrate that this approach provides a more accurate estimate of the \(k\)-mer frequencies in real biological sequences using a sample of CTCF binding sites in the human genome.

Keywords

DNA sequence Oligomer \(k\)-mer Frequency estimation  Statistical learning 

Mathematics Subject Classification

92D20 Protein sequences DNA sequences 92-08 Computational Methods 15A09 Matrix inversion generalized inverses 

Notes

Acknowledgments

We thank the reviewers for their comments and suggestions which significantly improved the manuscript. We also thank users of math.stackexchange.com online community, specifically users Joriki and Siva for their useful comments which helped us in the development of the proof. Dongwon Lee graciously provided the processed CTCF sequence data. The research of M.M. was in part supported by a grant from IPM (No. CS1390-4-07), and M.B. was supported by the Searle Scholars Program and in part by NIH grant NS062972.

Supplementary material

285_2013_705_MOESM1_ESM.r (7 kb)
Supplementary material 1 (r 7 KB)

References

  1. Albert AE (1972) Regression and the Moore-penrose Pseudoinverse. Academic Press, New YorkzbMATHGoogle Scholar
  2. Beer MA, Tavazoie S (2004) Predicting gene expression from sequence. Cell 117:185–198CrossRefGoogle Scholar
  3. Ben-Hur A, Ong CS, Sonnenburg S, Schölkopf B, Rätsch G (2008) Support vector machines and kernels for computational biology. PLoS Comput Biol 4:e1000173CrossRefGoogle Scholar
  4. Boyle AP, Song L, Lee B-K, London D, Keefe D, Birney E, Iyer VR, Crawford GE, Furey TS (2011) High-resolution genome-wide in vivo footprinting of diverse transcription factors in human cells. Genome Res 21:456–464CrossRefGoogle Scholar
  5. Cameron PJ (2003) Notes on Counting. http://www.maths.qmul.ac.uk/pjc/notes/counting.pdf. Accessed 25 Jan 2012
  6. Elemento O, Tavazoie S (2005) Fast and systematic genome-wide discovery of conserved regulatory elements using a non-alignment based approach. Genome Biol 6:R18CrossRefGoogle Scholar
  7. Göke J, Schulz MH, Lasserre J, Vingron M (2012) Estimation of pairwise sequence similarity of mammalian enhancers with word neighbourhood counts. Bioinformatics (Oxford, England)Google Scholar
  8. Graham RL, Knuth DE, Patashnik O (1994) Concrete mathematics: a foundation for computer science, 2nd edn. Addison Wesley Publishing Company, BostonzbMATHGoogle Scholar
  9. van Helden J (2004) Metrics for comparing regulatory sequences on the basis of pattern counts. Bioinformatics 20:399–406CrossRefGoogle Scholar
  10. Kantorovitz MR, Kazemian M, Kinston S, Miranda-Saavedra D, Zhu Q, Robinson GE, Göttgens B, Halfon MS, Sinha S (2009) Motif-blind, genome-wide discovery of cis-regulatory modules in Drosophila and mouse. Dev Cell 17:568–579CrossRefGoogle Scholar
  11. Lee D, Karchin R, Beer MA (2011) Discriminative prediction of mammalian enhancers from DNA sequence. Genome Res 21:2167–2180CrossRefGoogle Scholar
  12. Leslie CS, Eskin E, Cohen A, Weston J, Noble WS (2004) Mismatch string kernels for discriminative protein classification. Bioinformatics 20:467–476CrossRefGoogle Scholar
  13. Meinicke P, Tech M, Morgenstern B, Merkl R (2004) Oligo kernels for datamining on biological sequences: a case study on prokaryotic translation initiation sites. BMC Bioinformatics 5:169CrossRefGoogle Scholar
  14. Sonnenburg S, Schweikert G, Philips P, Behr J, Rätsch G (2007) Accurate splice site prediction using support vector machines. BMC Bioinformatics 8:S7CrossRefGoogle Scholar
  15. Sonnenburg S, Zien A, Rätsch G (2006) ARTS: accurate recognition of transcription starts in human. Bioinformatics 22:e472–480CrossRefGoogle Scholar
  16. Stormo GD (2000) DNA binding sites: representation and discovery. Bioinformatics 16:16–23CrossRefGoogle Scholar
  17. Wilson RM (1990) A diagonal form for the incidence matrices of \(t\)-subsets vs. \(k\)-subsets. Eur J Combin 11:609–615CrossRefzbMATHGoogle Scholar
  18. Xie X, Lu J, Kulbokas EJ, Golub TR, Mootha V, Lindblad-Toh K, Lander ES, Kellis M (2005) Systematic discovery of regulatory motifs in human promoters and 3’ UTRs by comparison of several mammals. Nature 434:338–45CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  • Mahmoud Ghandi
    • 1
    • 2
  • Morteza Mohammad-Noori
    • 3
    • 4
  • Michael A. Beer
    • 1
    Email author
  1. 1.Department of Biomedical Engineering and McKusick-Nathans Institute of Genetic MedicineJohns Hopkins UniversityBaltimoreUSA
  2. 2.Broad InstituteCambridgeUSA
  3. 3.School of Mathematics, Statistics and Computer ScienceUniversity of TehranTehranIran
  4. 4.School of Computer ScienceInstitute for Research in Fundamental SciencesTehranIran

Personalised recommendations