Abstract
Oligomers of fixed length, \(k\), commonly known as \(k\)-mers, are often used as fundamental elements in the description of DNA sequence features of diverse biological function, or as intermediate elements in the constuction of more complex descriptors of sequence features such as position weight matrices. \(k\)-mers are very useful as general sequence features because they constitute a complete and unbiased feature set, and do not require parameterization based on incomplete knowledge of biological mechanisms. However, a fundamental limitation in the use of \(k\)-mers as sequence features is that as \(k\) is increased, larger spatial correlations in DNA sequence elements can be described, but the frequency of observing any specific \(k\)-mer becomes very small, and rapidly approaches a sparse matrix of binary counts. Thus any statistical learning approach using \(k\)-mers will be susceptible to noisy estimation of \(k\)-mer frequencies once \(k\) becomes large. Because all molecular DNA interactions have limited spatial extent, gapped \(k\)-mers often carry the relevant biological signal. Here we use gapped \(k\)-mer counts to more robustly estimate the ungapped \(k\)-mer frequencies, by deriving an equation for the minimum norm estimate of \(k\)-mer frequencies given an observed set of gapped \(k\)-mer frequencies. We demonstrate that this approach provides a more accurate estimate of the \(k\)-mer frequencies in real biological sequences using a sample of CTCF binding sites in the human genome.
Similar content being viewed by others
References
Albert AE (1972) Regression and the Moore-penrose Pseudoinverse. Academic Press, New York
Beer MA, Tavazoie S (2004) Predicting gene expression from sequence. Cell 117:185–198
Ben-Hur A, Ong CS, Sonnenburg S, Schölkopf B, Rätsch G (2008) Support vector machines and kernels for computational biology. PLoS Comput Biol 4:e1000173
Boyle AP, Song L, Lee B-K, London D, Keefe D, Birney E, Iyer VR, Crawford GE, Furey TS (2011) High-resolution genome-wide in vivo footprinting of diverse transcription factors in human cells. Genome Res 21:456–464
Cameron PJ (2003) Notes on Counting. http://www.maths.qmul.ac.uk/pjc/notes/counting.pdf. Accessed 25 Jan 2012
Elemento O, Tavazoie S (2005) Fast and systematic genome-wide discovery of conserved regulatory elements using a non-alignment based approach. Genome Biol 6:R18
Göke J, Schulz MH, Lasserre J, Vingron M (2012) Estimation of pairwise sequence similarity of mammalian enhancers with word neighbourhood counts. Bioinformatics (Oxford, England)
Graham RL, Knuth DE, Patashnik O (1994) Concrete mathematics: a foundation for computer science, 2nd edn. Addison Wesley Publishing Company, Boston
van Helden J (2004) Metrics for comparing regulatory sequences on the basis of pattern counts. Bioinformatics 20:399–406
Kantorovitz MR, Kazemian M, Kinston S, Miranda-Saavedra D, Zhu Q, Robinson GE, Göttgens B, Halfon MS, Sinha S (2009) Motif-blind, genome-wide discovery of cis-regulatory modules in Drosophila and mouse. Dev Cell 17:568–579
Lee D, Karchin R, Beer MA (2011) Discriminative prediction of mammalian enhancers from DNA sequence. Genome Res 21:2167–2180
Leslie CS, Eskin E, Cohen A, Weston J, Noble WS (2004) Mismatch string kernels for discriminative protein classification. Bioinformatics 20:467–476
Meinicke P, Tech M, Morgenstern B, Merkl R (2004) Oligo kernels for datamining on biological sequences: a case study on prokaryotic translation initiation sites. BMC Bioinformatics 5:169
Sonnenburg S, Schweikert G, Philips P, Behr J, Rätsch G (2007) Accurate splice site prediction using support vector machines. BMC Bioinformatics 8:S7
Sonnenburg S, Zien A, Rätsch G (2006) ARTS: accurate recognition of transcription starts in human. Bioinformatics 22:e472–480
Stormo GD (2000) DNA binding sites: representation and discovery. Bioinformatics 16:16–23
Wilson RM (1990) A diagonal form for the incidence matrices of \(t\)-subsets vs. \(k\)-subsets. Eur J Combin 11:609–615
Xie X, Lu J, Kulbokas EJ, Golub TR, Mootha V, Lindblad-Toh K, Lander ES, Kellis M (2005) Systematic discovery of regulatory motifs in human promoters and 3’ UTRs by comparison of several mammals. Nature 434:338–45
Acknowledgments
We thank the reviewers for their comments and suggestions which significantly improved the manuscript. We also thank users of math.stackexchange.com online community, specifically users Joriki and Siva for their useful comments which helped us in the development of the proof. Dongwon Lee graciously provided the processed CTCF sequence data. The research of M.M. was in part supported by a grant from IPM (No. CS1390-4-07), and M.B. was supported by the Searle Scholars Program and in part by NIH grant NS062972.
Author information
Authors and Affiliations
Corresponding author
Electronic supplementary material
Below is the link to the electronic supplementary material.
Appendix: Proofs of some of the Propositions
Appendix: Proofs of some of the Propositions
Proof of Proposition 1
The first identity is proved as follows
To prove (16), denote the left side by \(\tau _{p,k,\ell }\), then using \(\left( {\begin{array}{c}k-n\\ p-t\end{array}}\right) =\left( {\begin{array}{c}k-n-1\\ p-t-1\end{array}}\right) +\left( {\begin{array}{c}k-n-1\\ p-t\end{array}}\right) \), we obtain
Then by replacing \(k\) by \(k^{\prime }\) in \(\tau _{p,k,\ell }-\tau _{p,k-1,\ell }=(-1)^{k-p}\left( {\begin{array}{c}\ell \\ p\end{array}}\right) \left( {\begin{array}{c}\ell -p\\ \,k-p\end{array}}\right) x^{p}+\tau _{p-1,k-1,\ell }\) and summing up over \(k^{\prime }\), \(0\le k^{\prime }\le k\), we get
whence
Now we prove the identity \(\tau _{p,k,\ell }={\small \left( \begin{array}{l}k-\ell \\ k-p\end{array}\right) }\sum _{n=0}^{p} {\small \left( \begin{array}{l}\ell \\ n\end{array}\right) }x^n\) for \(p=0,1,\cdots ,k\) by bounded induction on \(p\). For \(p=0\), using (34) we obtain \( \tau _{0,k,\ell }={\small \left( \begin{array}{c}k-\ell \\ k\end{array}\right) }\) which proves the required identity in this case. Now let \(0<p\le k\) and suppose that the result is true for \(p-1\). We prove the validity of the identity for \(p\) by using the induction hypothesis and (34), as follows:
\(\square \)
Proof of Proposition 2
-
(i)
It is enough to prove that the identity
$$\begin{aligned} \sum _{y\in M_{\ell k}(u)} \nu (y,v^{\prime })=\left( \begin{array}{l}\ell -n\\ \ell -k\end{array}\right) \nu (u,v^{\prime }) \end{aligned}$$(35)holds for any \(u\in U_\ell \). The nonzero summands of the summation on the left are obtained from the words \(y\in V_{\ell k}\) in which all symbols \(g\) appear in positions \(G_{v^{\prime }}\) (Using \(|G_{v^{\prime }}|=\ell -n\), we conclude that there are \(\left( {\begin{array}{c}\ell -n\\ \ell -k\end{array}}\right) \) such summands). On the other hand, since \(y\in M_{\ell k}(u)\), it is easily seen that for any such \(y\) we have \(\nu (y,v^{\prime })=\nu (u,v^{\prime })\). Thus the summation is simplified to \(\left( {\begin{array}{c}\ell -n\\ \ell -k\end{array}}\right) \nu (u,v^{\prime })\) as required.
-
(ii)
It is enough to prove that the identity
$$\begin{aligned} \sum _{u\in M^{\prime }_{\ell k}(v)} \nu (u,v^{\prime })=b^{\ell -k}\nu (v,v^{\prime }) \end{aligned}$$(36)holds for any \(v\in V_{\ell k}\). We prove this in two cases: Case (a) Suppose that \(G_{v}\subseteq G_{v^{\prime }}\). Then for any \(u\in M^{\prime }_{\ell k}(v)\) we have
$$\begin{aligned} \nu (u,v^{\prime })=\prod _{i\in \overline{G}_{v^{\prime }}}\nu (u_i,v^{\prime }_i)=\prod _{i\in \overline{G}_{v^{\prime }}}\nu (v_i,v^{\prime }_i)=\nu (v,v^{\prime }) \end{aligned}$$and since there are \(b^{\ell -k}\) such words \(u\), the equation (36) thus follows. Case (b) Suppose that \(G_{v}\not \subseteq G_{v^{\prime }}\), consequently \(G_v {\setminus } G_{v^{\prime }}\ne \emptyset \). Now for any \(i\in G_v {\setminus } G_{v^{\prime }}\) we have \(\nu (v_i,v^{\prime }_i)=0\), thus the right side of (36) is \(0\). We prove that the left side is also \(0\) as follows. The nonzero summands in the summation are obtained from elements \(u\in X\) where the subset \(X\subseteq U_\ell \) is given by
$$\begin{aligned} X=\{u\in U_\ell : u_i\le v_i \mathrm{\,\, for \,\,} i\in G_{v}{\setminus } G_{v^{\prime }} \mathrm{\,\,\quad and\,\,} u_i=v_i \mathrm{\,\, for \,\,} i\in \overline{G}_v \}. \end{aligned}$$Thus we obtain
$$\begin{aligned} \sum _{u\in M^{\prime }_{\ell k}(v)} \nu (u,v^{\prime })&= \sum _{u\in X} \nu (u,v^{\prime })\\&= \sum _{u\in X} \prod _{i=1}^\ell \nu (u_i,v^{\prime }_i)\\&= \sum _{u\in X} \left( \prod _{i\in \overline{G}_v} \nu (u_i,v^{\prime }_i) \prod _{i\in G_v \setminus G_{v^{\prime }}} \nu (u_i,v^{\prime }_i) \prod _{i\in G_v \cap G_{v^{\prime }}} \nu (u_i,v^{\prime }_i)\right) \\&= \sum _{u\in X} \left( \prod _{i\in \overline{G}_v} \nu (v_i,v^{\prime }_i) \prod _{i\in G_v \setminus G_{v^{\prime }}} \nu (u_i,v^{\prime }_i) \prod _{i\in G_v \cap G_{v^{\prime }}} \nu (u_i,g)\right) \\&= \prod _{i\in \overline{G}_v} \nu (v_i,v^{\prime }_i) \prod _{i\in G_v \setminus G_{v^{\prime }}} \sum _{u_i=0}^{v^{\prime }_i}\nu (u_i,v^{\prime }_i) \prod _{i\in G_v \cap G_{v^{\prime }}} \sum _{u_i=0}^{b-1}\nu (u_i,g)\\&= b^{|G_v \cap G_{v^{\prime }}|}\prod _{i\in \overline{G}_v} \nu (v_i,v^{\prime }_i) \prod _{i\in G_v \setminus G_{v^{\prime }}} \sum _{u_i=0}^{v^{\prime }_i}\nu (u_i,v^{\prime }_i)\\&= 0. \end{aligned}$$The last identity holds because for any \(i\in G_v{\setminus } G_{v^{\prime }}\) we have \(\sum _{u_i=0}^{v^{\prime }_i}\nu (u_i,v^{\prime }_i)=\sum _{u_i=0}^{v^{\prime }_i-1}1-v^{\prime }_i=0\).
-
(iii), (iv)
These are immediate consequences of parts (i) and (ii). \(\square \)
Proof of Proposition 3
-
(i)
If \(i\in P\) and \(v_i>0\) then
$$\begin{aligned} \phi _i(u,v)&= \sum _{j=v_i}^{b-1} \frac{\nu (v_i,j)\nu (u_i,j)}{j(j+1)}\\&= \sum _{j=v_i}^{b-1} \frac{\nu (v_i,j)^2}{j(j+1)}\\&= \frac{v_i^2}{v_i(v_i+1)}+\sum _{j=v_i+1}^{b-1} \frac{1}{j(j+1)}\\&= \frac{v_i}{v_i+1}+\bigg (\frac{1}{v_i+1}-\frac{1}{b}\bigg )\\&= \frac{b-1}{b}. \end{aligned}$$If \(i\in P\) and \(v_i=0\), then given \(j\ge 1\), we have \(\nu (u_i,j)=\nu (v_i,j)=1\), hence,
$$\begin{aligned} \phi _i(u,v)&= \sum _{j=1}^{b-1} \frac{1}{j(j+1)}\\&= \frac{b-1}{b}. \end{aligned}$$The case \(i\in Q\) is done similarly.
-
(ii)
Without loss of generality suppose that \(v=v_1\cdots v_k g^{\ell -k}\). Let \(P=\{1,\ldots ,p\}\) and \(Q=\{p+1,\ldots ,k\}\) and for a subset \(\overline{G}\in {\small \left( \begin{array}{l}\{1,\ldots ,\ell \}\\ \ell -n\end{array}\right) }\) let \(X^{\prime }_{\ell n}(\overline{G})=\{v^{\prime }\in V^{\prime }_{\ell n}: \overline{G}_{v^{\prime }}=\overline{G}\}\). Denote the left side of (17) by \(S\). Since the nonzero summands in (17) are obtained from the words \(v^{\prime }\in V^{\prime }_{\ell n}\) which have \(g\)’s at the \(\ell -k\) rightmost positions, we may just consider words \(v^{\prime }=v^{\prime }_1 \cdots v^{\prime }_k g^{\ell -k}\) with \(|v^{\prime }_1 \cdots v^{\prime }_k|_g=k-n\). Then we have
$$\begin{aligned} S&= \sum _{t=0}^n \sum _{T_1\in {P \atopwithdelims ()t}}\sum _{T_2\in {Q \atopwithdelims ()n-t}} \sum _{v^{\prime }\in X^{\prime }_{\ell n}(T_1\cup T_2)}\frac{\nu (v,v^{\prime }) \nu (u,v^{\prime })}{{\displaystyle \prod \nolimits _{i \in \overline{G}_{v^{\prime }}}v^{\prime }_i(v^{\prime }_i+1)}} \\&= \sum _{t=0}^n \sum _{T_1\in {P \atopwithdelims ()t}}\sum _{T_2 \in {Q \atopwithdelims ()n-t}}\sum _{v^{\prime }\in X^{\prime }_{\ell n}(T_1 \cup T_2)}\frac{ {\displaystyle \prod \nolimits _{i \in \overline{G}_{v^{\prime }}}\nu (v_i,v^{\prime }_i) \nu (u_i,v^{\prime }_i)}}{{\displaystyle \prod \nolimits _{i \in \overline{G}_{v^{\prime }}}v^{\prime }_i(v^{\prime }_i+1)}}\\&= \sum _{t=0}^n \sum _{T_1\in {P \atopwithdelims ()t}}\sum _{T_2 \in {Q \atopwithdelims ()n-t}}\sum _{v^{\prime }\in X^{\prime }_{\ell n}(T_1 \cup T_2)} {\displaystyle \prod _{i \in T_1\cup T_2} \frac{ \nu (v_i,v^{\prime }_i) \nu (u_i,v^{\prime }_i)}{v^{\prime }_i(v^{\prime }_i+1)}}\\&= \sum _{t=0}^n \sum _{T_1\in {P \atopwithdelims ()t}}\sum _{T_2\in {Q \atopwithdelims ()n-t}}\prod _{i \in T_1\cup T_2} \bigg ( \sum _{v^{\prime }_i=\max \{1,v_i\}}^{b-1} \frac{\nu (v_i,v^{\prime }_i) \nu (u_i,v^{\prime }_i)}{v^{\prime }_i(v^{\prime }_i+1)} \bigg )\\&= \sum _{t=0}^n \sum _{T_1\in {P \atopwithdelims ()t}}\sum _{T_2\in {Q \atopwithdelims ()n-t}}\prod _{i \in T_1\cup T_2}\phi _i(u,v)\\&= \sum _{t=0}^n \sum _{T_1\in {P \atopwithdelims ()t}}\sum _{T_2\in {Q \atopwithdelims ()n-t}}\left( \frac{b-1}{b}\right) ^{t} \left( \frac{-1}{b}\right) ^{n-t} ~~\hbox {(by part (i))}~~\\&= \frac{1}{b^n}\sum _{t=0}^n (-1)^{n-t}\left( \begin{array}{l}p\\ t\end{array}\right) \left( \begin{array}{l}k-p\\ n-t\end{array}\right) (b-1)^t. \end{aligned}$$
\(\square \)
Proof of Proposition 5
-
(i)
These are immediate consequences of Definition 3, except the last equation. To calculate the norm of \(x_{v^{\prime }}\), without loss of generality let \(\overline{G}_{v^{\prime }}=\{1,\ldots n\}\), hence \(v^{\prime }=v^{\prime }_1\cdots v^{\prime }_n g^{\ell -n}\) and it is evident that the nonzero elements \(\nu (w,v^{\prime })\) of \(x_{v^{\prime }}\) are obtained from the words \(w=r s \in V_{\ell k}\) with \(|s|=\ell -n\), \(|s|_g=\ell -k\), \(r=r_1 \cdots r_n\) and \(0\le r_i\le v^{\prime }_i\) for \(i=1,\ldots ,n\). Moreover, the value \(\nu (w,v^{\prime })\) is independent of \(s\) and there are \({\small \left( \begin{array}{l}\ell -n\\ \ell -k\end{array}\right) }b^{k-n}\) choices for \(s\). We thus provide
$$\begin{aligned} ||x_{v^{\prime }}||^2&= \left( \begin{array}{l}\ell -n\\ \ell -k\end{array}\right) b^{k-n}\prod _{i=1}^n \sum _{x_i=0}^{v^{\prime }_i}\nu (w_i,v^{\prime }_i)^2\\&= \left( \begin{array}{l}\ell -n\\ \ell -k\end{array}\right) b^{k-n}\prod _{i=1}^n (v^{\prime }_i+{v^{\prime }_i}^2), \end{aligned}$$as required.
-
(ii)
The first equation is proved easily considering Remark 2. To prove the second one, observe that
$$\begin{aligned} Q_{\ell k}\Lambda ^{-1}Q_{\ell k}^{\top }=\Delta _{\ell k} E \Lambda ^{-1} E^{\top } \Delta _{\ell k}^{\top }, \end{aligned}$$in which the matrix \(D=E \Lambda ^{-1} E^{\top }\) is a diagonal matrix of the form \(D=diag((d_{v^{\prime }})_{v^{\prime } \in V^{\prime }_{\ell ,\le k}})\) with entries \(d_{v^{\prime }}=\frac{1}{\lambda _{|\overline{G}_{v^{\prime }}|}||x_{v^{\prime }}||^2}\) or
$$\begin{aligned} d_{v^{\prime }}=\frac{1}{\left( \begin{array}{l}\ell -|\overline{G}_{v^{\prime }}|\\ \ell -k\end{array}\right) ^2 b^{\ell -|\overline{G}_{v^{\prime }}|}{\displaystyle \prod \nolimits _{i\in \overline{G}_{v^{\prime }}}(v^{\prime }_i+{v^{\prime }_i}^2)}}. \end{aligned}$$ -
(iii)
Since the columns of \(Q_{\ell k}\) are normalized orthogonal eigenvectors, the first identity holds. To prove the second one, using the notation of Remark 2, we begin by claiming
$$\begin{aligned} A_{\ell k}^{\top } N_{\ell k}=0. \end{aligned}$$In fact if \(y\in {\ker }(A_{\ell k}A_{\ell k}^{\top })\) then from \(A_{\ell k}A_{\ell k}^{\top }y=0\) we obtain \(y^{\top }A_{\ell k}A_{\ell k}^{\top }y=0\) and \(||A_{\ell k}^{\top }y||=0\), thus \(A_{\ell k}^{\top }y=0\), which proves the claim. Now, by using the decomposition \(P_{\ell k}=[Q_{\ell k}\,\, R_{\ell k}]\) and the equation \(P_{\ell k} P_{\ell k}^{\top }=I\), we obtain \(Q_{\ell k}Q_{\ell k}^{\top }+N_{\ell k}N_{\ell k}^{\top }=I\), hence \(A_{\ell k}^{\top } Q_{\ell k} Q_{\ell k}^{\top }=A_{\ell k}^{\top }-A_{\ell k}^{\top } N_{\ell k} N_{\ell k}^{\top }\) which yields \(A_{\ell k}^{\top } Q_{\ell k} Q_{\ell k}^{\top }=A_{\ell k}^{\top }\) using the above claim. Transposing both sides yields the result.
\(\square \)
Proof of Proposition 6
First note that
Second, we have
Finally, observe that \(W_{\ell k}A_{\ell k}\) and \(A_{\ell k}W_{\ell k}\) are real symmetric matrices. It is concluded that \(W_{\ell k}\) is the Moore–Penrose pseudo-inverse of \(A_{\ell k}\).
To prove the last statement, we show that the sum of the elements in each row of matrix \(W_{\ell k}A_{\ell k}\) is equal to \(1\). The same result for columns is concluded form the symmetry of \(W_{\ell k}A_{\ell k}\). Let \(J=[1,1,\ldots ,1]^{\top }\) and \(F=[1,0,\ldots ,0]^{\top }\). Note that \(J=z^{\ell 0}_{gg \cdots g}\) and \(F=x^{\ell k0}_{gg \cdots g}\) and note that the row sums of \(W_{\ell k}A_{\ell k}\) are the entries of \(W_{\ell k}A_{\ell k}J\). Given \(M=\left( {\begin{array}{c}\ell \\ k\end{array}}\right) b^{k}\) the number of all possible gapped \(k\)-mers, we have
\(\square \)
Rights and permissions
About this article
Cite this article
Ghandi, M., Mohammad-Noori, M. & Beer, M.A. Robust \(k\)-mer frequency estimation using gapped \(k\)-mers. J. Math. Biol. 69, 469–500 (2014). https://doi.org/10.1007/s00285-013-0705-3
Received:
Revised:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00285-013-0705-3