Skip to main content
Log in

Robust \(k\)-mer frequency estimation using gapped \(k\)-mers

  • Published:
Journal of Mathematical Biology Aims and scope Submit manuscript

Abstract

Oligomers of fixed length, \(k\), commonly known as \(k\)-mers, are often used as fundamental elements in the description of DNA sequence features of diverse biological function, or as intermediate elements in the constuction of more complex descriptors of sequence features such as position weight matrices. \(k\)-mers are very useful as general sequence features because they constitute a complete and unbiased feature set, and do not require parameterization based on incomplete knowledge of biological mechanisms. However, a fundamental limitation in the use of \(k\)-mers as sequence features is that as \(k\) is increased, larger spatial correlations in DNA sequence elements can be described, but the frequency of observing any specific \(k\)-mer becomes very small, and rapidly approaches a sparse matrix of binary counts. Thus any statistical learning approach using \(k\)-mers will be susceptible to noisy estimation of \(k\)-mer frequencies once \(k\) becomes large. Because all molecular DNA interactions have limited spatial extent, gapped \(k\)-mers often carry the relevant biological signal. Here we use gapped \(k\)-mer counts to more robustly estimate the ungapped \(k\)-mer frequencies, by deriving an equation for the minimum norm estimate of \(k\)-mer frequencies given an observed set of gapped \(k\)-mer frequencies. We demonstrate that this approach provides a more accurate estimate of the \(k\)-mer frequencies in real biological sequences using a sample of CTCF binding sites in the human genome.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1

Similar content being viewed by others

References

  • Albert AE (1972) Regression and the Moore-penrose Pseudoinverse. Academic Press, New York

    MATH  Google Scholar 

  • Beer MA, Tavazoie S (2004) Predicting gene expression from sequence. Cell 117:185–198

    Article  Google Scholar 

  • Ben-Hur A, Ong CS, Sonnenburg S, Schölkopf B, Rätsch G (2008) Support vector machines and kernels for computational biology. PLoS Comput Biol 4:e1000173

    Article  Google Scholar 

  • Boyle AP, Song L, Lee B-K, London D, Keefe D, Birney E, Iyer VR, Crawford GE, Furey TS (2011) High-resolution genome-wide in vivo footprinting of diverse transcription factors in human cells. Genome Res 21:456–464

    Article  Google Scholar 

  • Cameron PJ (2003) Notes on Counting. http://www.maths.qmul.ac.uk/pjc/notes/counting.pdf. Accessed 25 Jan 2012

  • Elemento O, Tavazoie S (2005) Fast and systematic genome-wide discovery of conserved regulatory elements using a non-alignment based approach. Genome Biol 6:R18

    Article  Google Scholar 

  • Göke J, Schulz MH, Lasserre J, Vingron M (2012) Estimation of pairwise sequence similarity of mammalian enhancers with word neighbourhood counts. Bioinformatics (Oxford, England)

  • Graham RL, Knuth DE, Patashnik O (1994) Concrete mathematics: a foundation for computer science, 2nd edn. Addison Wesley Publishing Company, Boston

    MATH  Google Scholar 

  • van Helden J (2004) Metrics for comparing regulatory sequences on the basis of pattern counts. Bioinformatics 20:399–406

    Article  Google Scholar 

  • Kantorovitz MR, Kazemian M, Kinston S, Miranda-Saavedra D, Zhu Q, Robinson GE, Göttgens B, Halfon MS, Sinha S (2009) Motif-blind, genome-wide discovery of cis-regulatory modules in Drosophila and mouse. Dev Cell 17:568–579

    Article  Google Scholar 

  • Lee D, Karchin R, Beer MA (2011) Discriminative prediction of mammalian enhancers from DNA sequence. Genome Res 21:2167–2180

    Article  Google Scholar 

  • Leslie CS, Eskin E, Cohen A, Weston J, Noble WS (2004) Mismatch string kernels for discriminative protein classification. Bioinformatics 20:467–476

    Article  Google Scholar 

  • Meinicke P, Tech M, Morgenstern B, Merkl R (2004) Oligo kernels for datamining on biological sequences: a case study on prokaryotic translation initiation sites. BMC Bioinformatics 5:169

    Article  Google Scholar 

  • Sonnenburg S, Schweikert G, Philips P, Behr J, Rätsch G (2007) Accurate splice site prediction using support vector machines. BMC Bioinformatics 8:S7

    Article  Google Scholar 

  • Sonnenburg S, Zien A, Rätsch G (2006) ARTS: accurate recognition of transcription starts in human. Bioinformatics 22:e472–480

    Article  Google Scholar 

  • Stormo GD (2000) DNA binding sites: representation and discovery. Bioinformatics 16:16–23

    Article  Google Scholar 

  • Wilson RM (1990) A diagonal form for the incidence matrices of \(t\)-subsets vs. \(k\)-subsets. Eur J Combin 11:609–615

    Article  MATH  Google Scholar 

  • Xie X, Lu J, Kulbokas EJ, Golub TR, Mootha V, Lindblad-Toh K, Lander ES, Kellis M (2005) Systematic discovery of regulatory motifs in human promoters and 3’ UTRs by comparison of several mammals. Nature 434:338–45

    Article  Google Scholar 

Download references

Acknowledgments

We thank the reviewers for their comments and suggestions which significantly improved the manuscript. We also thank users of math.stackexchange.com online community, specifically users Joriki and Siva for their useful comments which helped us in the development of the proof. Dongwon Lee graciously provided the processed CTCF sequence data. The research of M.M. was in part supported by a grant from IPM (No. CS1390-4-07), and M.B. was supported by the Searle Scholars Program and in part by NIH grant NS062972.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Michael A. Beer.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (r 7 KB)

Appendix: Proofs of some of the Propositions

Appendix: Proofs of some of the Propositions

Proof of Proposition 1

The first identity is proved as follows

$$\begin{aligned}&\left( {\begin{array}{c}k\\ p\end{array}}\right) \left( {\begin{array}{c}p\\ t\end{array}}\right) \left( {\begin{array}{c}k-p\\ n-t\end{array}}\right) = \left( {\begin{array}{c}k\\ t\end{array}}\right) \left( {\begin{array}{c}k-t\\ p-t\end{array}}\right) \left( {\begin{array}{c}k-p\\ n-t\end{array}}\right) =\left( {\begin{array}{c}k\\ t\end{array}}\right) \left( {\begin{array}{c}k-t\\ k-p\end{array}}\right) \left( {\begin{array}{c}k-p\\ k-n-p+t\end{array}}\right) \\&\quad =\left( {\begin{array}{c}k\\ t\end{array}}\right) \left( {\begin{array}{c}k-t\\ k-n\end{array}}\right) \left( {\begin{array}{c}k-n\\ p-t\end{array}}\right) =\left( {\begin{array}{c}k\\ t\end{array}}\right) \left( {\begin{array}{c}k-t\\ n-t\end{array}}\right) \left( {\begin{array}{c}k-n\\ p-t\end{array}}\right) =\left( {\begin{array}{c}k\\ n\end{array}}\right) \left( {\begin{array}{c}n\\ t\end{array}}\right) \left( {\begin{array}{c}k-n\\ p-t\end{array}}\right) . \end{aligned}$$

To prove (16), denote the left side by \(\tau _{p,k,\ell }\), then using \(\left( {\begin{array}{c}k-n\\ p-t\end{array}}\right) =\left( {\begin{array}{c}k-n-1\\ p-t-1\end{array}}\right) +\left( {\begin{array}{c}k-n-1\\ p-t\end{array}}\right) \), we obtain

$$\begin{aligned} \tau _{p,k,\ell }&= \tau _{p-1,k-1,\ell }+\sum _{t=0}^k (-1)^{k-t}\left( \begin{array}{l}\ell \\ k\end{array}\right) \left( \begin{array}{l}k\\ t\end{array}\right) \left( \begin{array}{c}-1\\ p-t-1\end{array}\right) x^t\\&\quad +\tau _{p,k-1,\ell }+\sum _{t=0}^k (-1)^{k-t}\left( \begin{array}{l}\ell \\ k\end{array}\right) \left( \begin{array}{l}k\\ t\end{array}\right) \left( \begin{array}{c}-1\\ p-t \end{array}\right) x^t\\&= \tau _{p-1,k-1,\ell }+\tau _{p,k-1,\ell }+\sum _{t=0}^k (-1)^{k-t}\left( \begin{array}{l}\ell \\ k\end{array}\right) \left( \begin{array}{l}k\\ t\end{array}\right) \left( \left( \begin{array}{c}-1\\ p-t\end{array}\right) \!+\! \left( \begin{array}{c}-1\\ p-t-1\end{array}\right) \right) x^t\\&= \tau _{p-1,k-1,\ell }+\tau _{p,k-1,\ell }+\sum _{t=0}^k (-1)^{k-t}\left( \begin{array}{l}\ell \\ k\end{array}\right) \left( \begin{array}{l}k\\ t\end{array}\right) \left( \begin{array}{c}0\\ p-t\end{array}\right) x^t\\&= \tau _{p-1,k-1,\ell }+\tau _{p,k-1,\ell }+(-1)^{k-p} \left( \begin{array}{l}\ell \\ k\end{array}\right) \left( \begin{array}{l}k\\ p\end{array}\right) x^p. \end{aligned}$$

Then by replacing \(k\) by \(k^{\prime }\) in \(\tau _{p,k,\ell }-\tau _{p,k-1,\ell }=(-1)^{k-p}\left( {\begin{array}{c}\ell \\ p\end{array}}\right) \left( {\begin{array}{c}\ell -p\\ \,k-p\end{array}}\right) x^{p}+\tau _{p-1,k-1,\ell }\) and summing up over \(k^{\prime }\), \(0\le k^{\prime }\le k\), we get

$$\begin{aligned} \tau _{p,k,\ell }=\sum _{k^{\prime }=0}^k (-1)^{k^{\prime }-p}\left( \begin{array}{l}\ell \\ p\end{array}\right) \left( \begin{array}{l}\ell -p \\ k^{\prime }-p\end{array}\right) x^{p}+\sum _{k^{\prime }=0}^k \tau _{p-1,k^{\prime }-1,\ell }, \end{aligned}$$

whence

$$\begin{aligned} \tau _{p,k,\ell }=\left( \begin{array}{l}\ell \\ p\end{array}\right) \left( \begin{array}{l}k-\ell \\ k-p\end{array}\right) x^p+\sum _{k^{\prime }=0}^k \tau _{p-1,k^{\prime }-1,\ell }. \end{aligned}$$
(34)

Now we prove the identity \(\tau _{p,k,\ell }={\small \left( \begin{array}{l}k-\ell \\ k-p\end{array}\right) }\sum _{n=0}^{p} {\small \left( \begin{array}{l}\ell \\ n\end{array}\right) }x^n\) for \(p=0,1,\cdots ,k\)  by bounded induction on \(p\). For \(p=0\), using (34) we obtain \( \tau _{0,k,\ell }={\small \left( \begin{array}{c}k-\ell \\ k\end{array}\right) }\) which proves the required identity in this case. Now let \(0<p\le k\) and suppose that the result is true for \(p-1\). We prove the validity of the identity for \(p\) by using the induction hypothesis and (34), as follows:

$$\begin{aligned} \tau _{p,k,\ell }&= \left( \begin{array}{l}k-\ell \\ k-p\end{array}\right) \left( \begin{array}{l}\ell \\ p\end{array}\right) x^p+\sum _{k^{\prime }=0}^k \tau _{p-1,k^{\prime }-1,\ell }\\&= \left( \begin{array}{l}k-\ell \\ k-p\end{array}\right) \left( \begin{array}{l}\ell \\ p\end{array}\right) x^p+\sum _{k^{\prime }=0}^k \left( \begin{array}{l}k^{\prime }-\ell -1\\ k^{\prime }-p\end{array}\right) \sum _{n=0}^{p-1} \left( \begin{array}{l}\ell \\ n\end{array}\right) x^n\\&= \left( \begin{array}{l}k-\ell \\ k-p\end{array}\right) \left( \begin{array}{l}\ell \\ p\end{array}\right) x^p+ \left( \begin{array}{l}k-\ell \\ k-p\end{array}\right) \sum _{n=0}^{p-1} \left( \begin{array}{l}\ell \\ n\end{array}\right) x^n\\&= \left( \begin{array}{l}k-\ell \\ k-p\end{array}\right) \sum _{n=0}^{p} \left( \begin{array}{l}\ell \\ n\end{array}\right) x^n. \end{aligned}$$

\(\square \)

Proof of Proposition 2

  1. (i)

    It is enough to prove that the identity

    $$\begin{aligned} \sum _{y\in M_{\ell k}(u)} \nu (y,v^{\prime })=\left( \begin{array}{l}\ell -n\\ \ell -k\end{array}\right) \nu (u,v^{\prime }) \end{aligned}$$
    (35)

    holds for any \(u\in U_\ell \). The nonzero summands of the summation on the left are obtained from the words \(y\in V_{\ell k}\) in which all symbols \(g\) appear in positions \(G_{v^{\prime }}\) (Using \(|G_{v^{\prime }}|=\ell -n\), we conclude that there are \(\left( {\begin{array}{c}\ell -n\\ \ell -k\end{array}}\right) \) such summands). On the other hand, since \(y\in M_{\ell k}(u)\), it is easily seen that for any such \(y\) we have \(\nu (y,v^{\prime })=\nu (u,v^{\prime })\). Thus the summation is simplified to \(\left( {\begin{array}{c}\ell -n\\ \ell -k\end{array}}\right) \nu (u,v^{\prime })\) as required.

  2. (ii)

    It is enough to prove that the identity

    $$\begin{aligned} \sum _{u\in M^{\prime }_{\ell k}(v)} \nu (u,v^{\prime })=b^{\ell -k}\nu (v,v^{\prime }) \end{aligned}$$
    (36)

    holds for any \(v\in V_{\ell k}\). We prove this in two cases: Case (a) Suppose that \(G_{v}\subseteq G_{v^{\prime }}\). Then for any \(u\in M^{\prime }_{\ell k}(v)\) we have

    $$\begin{aligned} \nu (u,v^{\prime })=\prod _{i\in \overline{G}_{v^{\prime }}}\nu (u_i,v^{\prime }_i)=\prod _{i\in \overline{G}_{v^{\prime }}}\nu (v_i,v^{\prime }_i)=\nu (v,v^{\prime }) \end{aligned}$$

    and since there are \(b^{\ell -k}\) such words \(u\), the equation (36) thus follows. Case (b) Suppose that \(G_{v}\not \subseteq G_{v^{\prime }}\), consequently \(G_v {\setminus } G_{v^{\prime }}\ne \emptyset \). Now for any \(i\in G_v {\setminus } G_{v^{\prime }}\) we have \(\nu (v_i,v^{\prime }_i)=0\), thus the right side of (36) is \(0\). We prove that the left side is also \(0\) as follows. The nonzero summands in the summation are obtained from elements \(u\in X\) where the subset \(X\subseteq U_\ell \) is given by

    $$\begin{aligned} X=\{u\in U_\ell : u_i\le v_i \mathrm{\,\, for \,\,} i\in G_{v}{\setminus } G_{v^{\prime }} \mathrm{\,\,\quad and\,\,} u_i=v_i \mathrm{\,\, for \,\,} i\in \overline{G}_v \}. \end{aligned}$$

    Thus we obtain

    $$\begin{aligned} \sum _{u\in M^{\prime }_{\ell k}(v)} \nu (u,v^{\prime })&= \sum _{u\in X} \nu (u,v^{\prime })\\&= \sum _{u\in X} \prod _{i=1}^\ell \nu (u_i,v^{\prime }_i)\\&= \sum _{u\in X} \left( \prod _{i\in \overline{G}_v} \nu (u_i,v^{\prime }_i) \prod _{i\in G_v \setminus G_{v^{\prime }}} \nu (u_i,v^{\prime }_i) \prod _{i\in G_v \cap G_{v^{\prime }}} \nu (u_i,v^{\prime }_i)\right) \\&= \sum _{u\in X} \left( \prod _{i\in \overline{G}_v} \nu (v_i,v^{\prime }_i) \prod _{i\in G_v \setminus G_{v^{\prime }}} \nu (u_i,v^{\prime }_i) \prod _{i\in G_v \cap G_{v^{\prime }}} \nu (u_i,g)\right) \\&= \prod _{i\in \overline{G}_v} \nu (v_i,v^{\prime }_i) \prod _{i\in G_v \setminus G_{v^{\prime }}} \sum _{u_i=0}^{v^{\prime }_i}\nu (u_i,v^{\prime }_i) \prod _{i\in G_v \cap G_{v^{\prime }}} \sum _{u_i=0}^{b-1}\nu (u_i,g)\\&= b^{|G_v \cap G_{v^{\prime }}|}\prod _{i\in \overline{G}_v} \nu (v_i,v^{\prime }_i) \prod _{i\in G_v \setminus G_{v^{\prime }}} \sum _{u_i=0}^{v^{\prime }_i}\nu (u_i,v^{\prime }_i)\\&= 0. \end{aligned}$$

    The last identity holds because for any \(i\in G_v{\setminus } G_{v^{\prime }}\) we have \(\sum _{u_i=0}^{v^{\prime }_i}\nu (u_i,v^{\prime }_i)=\sum _{u_i=0}^{v^{\prime }_i-1}1-v^{\prime }_i=0\).

  3. (iii), (iv)

    These are immediate consequences of parts (i) and (ii). \(\square \)

Proof of Proposition 3

  1. (i)

    If \(i\in P\) and \(v_i>0\) then

    $$\begin{aligned} \phi _i(u,v)&= \sum _{j=v_i}^{b-1} \frac{\nu (v_i,j)\nu (u_i,j)}{j(j+1)}\\&= \sum _{j=v_i}^{b-1} \frac{\nu (v_i,j)^2}{j(j+1)}\\&= \frac{v_i^2}{v_i(v_i+1)}+\sum _{j=v_i+1}^{b-1} \frac{1}{j(j+1)}\\&= \frac{v_i}{v_i+1}+\bigg (\frac{1}{v_i+1}-\frac{1}{b}\bigg )\\&= \frac{b-1}{b}. \end{aligned}$$

    If \(i\in P\) and \(v_i=0\), then given \(j\ge 1\), we have \(\nu (u_i,j)=\nu (v_i,j)=1\), hence,

    $$\begin{aligned} \phi _i(u,v)&= \sum _{j=1}^{b-1} \frac{1}{j(j+1)}\\&= \frac{b-1}{b}. \end{aligned}$$

    The case \(i\in Q\) is done similarly.

  2. (ii)

    Without loss of generality suppose that \(v=v_1\cdots v_k g^{\ell -k}\). Let \(P=\{1,\ldots ,p\}\) and \(Q=\{p+1,\ldots ,k\}\) and for a subset \(\overline{G}\in {\small \left( \begin{array}{l}\{1,\ldots ,\ell \}\\ \ell -n\end{array}\right) }\) let \(X^{\prime }_{\ell n}(\overline{G})=\{v^{\prime }\in V^{\prime }_{\ell n}: \overline{G}_{v^{\prime }}=\overline{G}\}\). Denote the left side of (17) by \(S\). Since the nonzero summands in (17) are obtained from the words \(v^{\prime }\in V^{\prime }_{\ell n}\) which have \(g\)’s at the \(\ell -k\) rightmost positions, we may just consider words \(v^{\prime }=v^{\prime }_1 \cdots v^{\prime }_k g^{\ell -k}\) with \(|v^{\prime }_1 \cdots v^{\prime }_k|_g=k-n\). Then we have

    $$\begin{aligned} S&= \sum _{t=0}^n \sum _{T_1\in {P \atopwithdelims ()t}}\sum _{T_2\in {Q \atopwithdelims ()n-t}} \sum _{v^{\prime }\in X^{\prime }_{\ell n}(T_1\cup T_2)}\frac{\nu (v,v^{\prime }) \nu (u,v^{\prime })}{{\displaystyle \prod \nolimits _{i \in \overline{G}_{v^{\prime }}}v^{\prime }_i(v^{\prime }_i+1)}} \\&= \sum _{t=0}^n \sum _{T_1\in {P \atopwithdelims ()t}}\sum _{T_2 \in {Q \atopwithdelims ()n-t}}\sum _{v^{\prime }\in X^{\prime }_{\ell n}(T_1 \cup T_2)}\frac{ {\displaystyle \prod \nolimits _{i \in \overline{G}_{v^{\prime }}}\nu (v_i,v^{\prime }_i) \nu (u_i,v^{\prime }_i)}}{{\displaystyle \prod \nolimits _{i \in \overline{G}_{v^{\prime }}}v^{\prime }_i(v^{\prime }_i+1)}}\\&= \sum _{t=0}^n \sum _{T_1\in {P \atopwithdelims ()t}}\sum _{T_2 \in {Q \atopwithdelims ()n-t}}\sum _{v^{\prime }\in X^{\prime }_{\ell n}(T_1 \cup T_2)} {\displaystyle \prod _{i \in T_1\cup T_2} \frac{ \nu (v_i,v^{\prime }_i) \nu (u_i,v^{\prime }_i)}{v^{\prime }_i(v^{\prime }_i+1)}}\\&= \sum _{t=0}^n \sum _{T_1\in {P \atopwithdelims ()t}}\sum _{T_2\in {Q \atopwithdelims ()n-t}}\prod _{i \in T_1\cup T_2} \bigg ( \sum _{v^{\prime }_i=\max \{1,v_i\}}^{b-1} \frac{\nu (v_i,v^{\prime }_i) \nu (u_i,v^{\prime }_i)}{v^{\prime }_i(v^{\prime }_i+1)} \bigg )\\&= \sum _{t=0}^n \sum _{T_1\in {P \atopwithdelims ()t}}\sum _{T_2\in {Q \atopwithdelims ()n-t}}\prod _{i \in T_1\cup T_2}\phi _i(u,v)\\&= \sum _{t=0}^n \sum _{T_1\in {P \atopwithdelims ()t}}\sum _{T_2\in {Q \atopwithdelims ()n-t}}\left( \frac{b-1}{b}\right) ^{t} \left( \frac{-1}{b}\right) ^{n-t} ~~\hbox {(by part (i))}~~\\&= \frac{1}{b^n}\sum _{t=0}^n (-1)^{n-t}\left( \begin{array}{l}p\\ t\end{array}\right) \left( \begin{array}{l}k-p\\ n-t\end{array}\right) (b-1)^t. \end{aligned}$$

\(\square \)

Proof of Proposition 5

  1. (i)

    These are immediate consequences of Definition 3, except the last equation. To calculate the norm of \(x_{v^{\prime }}\), without loss of generality let \(\overline{G}_{v^{\prime }}=\{1,\ldots n\}\), hence \(v^{\prime }=v^{\prime }_1\cdots v^{\prime }_n g^{\ell -n}\) and it is evident that the nonzero elements \(\nu (w,v^{\prime })\) of \(x_{v^{\prime }}\) are obtained from the words \(w=r s \in V_{\ell k}\) with \(|s|=\ell -n\), \(|s|_g=\ell -k\), \(r=r_1 \cdots r_n\) and \(0\le r_i\le v^{\prime }_i\) for \(i=1,\ldots ,n\). Moreover, the value \(\nu (w,v^{\prime })\) is independent of \(s\) and there are \({\small \left( \begin{array}{l}\ell -n\\ \ell -k\end{array}\right) }b^{k-n}\) choices for \(s\). We thus provide

    $$\begin{aligned} ||x_{v^{\prime }}||^2&= \left( \begin{array}{l}\ell -n\\ \ell -k\end{array}\right) b^{k-n}\prod _{i=1}^n \sum _{x_i=0}^{v^{\prime }_i}\nu (w_i,v^{\prime }_i)^2\\&= \left( \begin{array}{l}\ell -n\\ \ell -k\end{array}\right) b^{k-n}\prod _{i=1}^n (v^{\prime }_i+{v^{\prime }_i}^2), \end{aligned}$$

    as required.

  2. (ii)

    The first equation is proved easily considering Remark 2. To prove the second one, observe that

    $$\begin{aligned} Q_{\ell k}\Lambda ^{-1}Q_{\ell k}^{\top }=\Delta _{\ell k} E \Lambda ^{-1} E^{\top } \Delta _{\ell k}^{\top }, \end{aligned}$$

    in which the matrix \(D=E \Lambda ^{-1} E^{\top }\) is a diagonal matrix of the form \(D=diag((d_{v^{\prime }})_{v^{\prime } \in V^{\prime }_{\ell ,\le k}})\) with entries \(d_{v^{\prime }}=\frac{1}{\lambda _{|\overline{G}_{v^{\prime }}|}||x_{v^{\prime }}||^2}\) or

    $$\begin{aligned} d_{v^{\prime }}=\frac{1}{\left( \begin{array}{l}\ell -|\overline{G}_{v^{\prime }}|\\ \ell -k\end{array}\right) ^2 b^{\ell -|\overline{G}_{v^{\prime }}|}{\displaystyle \prod \nolimits _{i\in \overline{G}_{v^{\prime }}}(v^{\prime }_i+{v^{\prime }_i}^2)}}. \end{aligned}$$
  3. (iii)

    Since the columns of \(Q_{\ell k}\) are normalized orthogonal eigenvectors, the first identity holds. To prove the second one, using the notation of Remark 2, we begin by claiming

    $$\begin{aligned} A_{\ell k}^{\top } N_{\ell k}=0. \end{aligned}$$

    In fact if \(y\in {\ker }(A_{\ell k}A_{\ell k}^{\top })\) then from \(A_{\ell k}A_{\ell k}^{\top }y=0\) we obtain \(y^{\top }A_{\ell k}A_{\ell k}^{\top }y=0\) and \(||A_{\ell k}^{\top }y||=0\), thus \(A_{\ell k}^{\top }y=0\), which proves the claim. Now, by using the decomposition \(P_{\ell k}=[Q_{\ell k}\,\, R_{\ell k}]\) and the equation \(P_{\ell k} P_{\ell k}^{\top }=I\), we obtain \(Q_{\ell k}Q_{\ell k}^{\top }+N_{\ell k}N_{\ell k}^{\top }=I\), hence \(A_{\ell k}^{\top } Q_{\ell k} Q_{\ell k}^{\top }=A_{\ell k}^{\top }-A_{\ell k}^{\top } N_{\ell k} N_{\ell k}^{\top }\) which yields \(A_{\ell k}^{\top } Q_{\ell k} Q_{\ell k}^{\top }=A_{\ell k}^{\top }\) using the above claim. Transposing both sides yields the result.

\(\square \)

Proof of Proposition 6

First note that

$$\begin{aligned} W_{\ell k}A_{\ell k}W_{\ell k}&= (A_{\ell k}^{\top }Q_{\ell k}\Lambda ^{-1}Q_{\ell k}^{\top })A_{\ell k} (A_{\ell k}^{\top }Q_{\ell k}\Lambda ^{-1}Q_{\ell k}^{\top })\\&= A_{\ell k}^{\top }Q_{\ell k}\Lambda ^{-1}Q_{\ell k}^{\top }(A_{\ell k} A_{\ell k}^{\top })Q_{\ell k}\Lambda ^{-1}Q_{\ell k}^{\top }\\&= A_{\ell k}^{\top }Q_{\ell k}\Lambda ^{-1}Q_{\ell k}^{\top }(Q_{\ell k}\Lambda Q_{\ell k}^{\top })Q_{\ell k}\Lambda ^{-1}Q_{\ell k}^{\top }\\&= A_{\ell k}^{\top }Q_{\ell k}(\Lambda ^{-1}Q_{\ell k}^{\top } Q_{\ell k}\Lambda ) (Q_{\ell k}^{\top } Q_{\ell k})\Lambda ^{-1}Q_{\ell k}^{\top }\\&= A_{\ell k}^{\top }Q_{\ell k} \Lambda ^{-1}Q_{\ell k}^{\top }\\&= W_{\ell k}. \end{aligned}$$

Second, we have

$$\begin{aligned} A_{\ell k}W_{\ell k}A_{\ell k}&= A_{\ell k}(A_{\ell k}^{\top }Q_{\ell k}\Lambda ^{-1}Q_{\ell k}^{\top })A_{\ell k}\\&= (A_{\ell k}A_{\ell k}^{\top })Q_{\ell k}\Lambda ^{-1}Q_{\ell k}^{\top }A_{\ell k}\\&= Q_{\ell k}(\Lambda Q_{\ell k}^{\top } Q_{\ell k}\Lambda ^{-1})Q_{\ell k}^{\top }A_{\ell k}\\&= Q_{\ell k} Q_{\ell k}^{\top }A_{\ell k}\\&= A_{\ell k} ~~(\hbox {by Proposition} 5 (\mathrm iii)).~~ \end{aligned}$$

Finally, observe that \(W_{\ell k}A_{\ell k}\) and \(A_{\ell k}W_{\ell k}\) are real symmetric matrices. It is concluded that \(W_{\ell k}\) is the Moore–Penrose pseudo-inverse of \(A_{\ell k}\).

To prove the last statement, we show that the sum of the elements in each row of matrix \(W_{\ell k}A_{\ell k}\) is equal to \(1\). The same result for columns is concluded form the symmetry of \(W_{\ell k}A_{\ell k}\). Let \(J=[1,1,\ldots ,1]^{\top }\) and \(F=[1,0,\ldots ,0]^{\top }\). Note that \(J=z^{\ell 0}_{gg \cdots g}\) and \(F=x^{\ell k0}_{gg \cdots g}\) and note that the row sums of \(W_{\ell k}A_{\ell k}\) are the entries of \(W_{\ell k}A_{\ell k}J\). Given \(M=\left( {\begin{array}{c}\ell \\ k\end{array}}\right) b^{k}\) the number of all possible gapped \(k\)-mers, we have

$$\begin{aligned} W_{\ell k}A_{\ell k}J&= A_{\ell k}^{\top }Q_{\ell k} \Lambda ^{-1} Q_{\ell k}^{\top } A_{\ell k} z^{\ell 0}_{gg \cdots g} \\&= A_{\ell k}^{\top }Q_{\ell k} \Lambda ^{-1} Q_{\ell k}^{\top } b^{\ell -k} x^{\ell k0}_{gg \cdots g} ~~\hbox {(by Proposition 2 (ii)),}~~ \\&= A_{\ell k}^{\top }Q_{\ell k} \Lambda ^{-1} \frac{M}{\sqrt{M}} b^{\ell -k} F ~~\hbox {(using } Q^{\top } x^{\ell k0}_{gg \cdots g}=\sqrt{M}[1,0,0,..,0]),~~\\&= A_{\ell k}^{\top }Q_{\ell k} \frac{1}{\left( {\begin{array}{c}\ell \\ k\end{array}}\right) b^{\ell -k}} \frac{M}{\sqrt{M}} b^{\ell -k} F\\&= A_{\ell k}^{\top }\frac{1}{\sqrt{M}} \frac{1}{\left( {\begin{array}{c}\ell \\ k\end{array}}\right) b^{\ell -k}} \frac{M}{\sqrt{M}} b^{\ell -k} x^{\ell k0}_{gg \cdots g}\\&= \left( {\begin{array}{c}\ell \\ k\end{array}}\right) \frac{1}{\sqrt{M}} \frac{1}{\left( {\begin{array}{c}\ell \\ k\end{array}}\right) b^{\ell -k}} \frac{M}{\sqrt{M}} b^{\ell -k} z^{\ell 0}_{gg \cdots g}~~\hbox {(by Proposition 2 (i)),}~~ \\&= z^{\ell 0}_{gg \cdots g}\\&= [1,1,\ldots , 1]^{\top }. \end{aligned}$$

\(\square \)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ghandi, M., Mohammad-Noori, M. & Beer, M.A. Robust \(k\)-mer frequency estimation using gapped \(k\)-mers. J. Math. Biol. 69, 469–500 (2014). https://doi.org/10.1007/s00285-013-0705-3

Download citation

  • Received:

  • Revised:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00285-013-0705-3

Keywords

Mathematics Subject Classification

Navigation