Robust $$k$$ -mer frequency estimation using gapped $$k$$ -mers

Ghandi, Mahmoud; Mohammad-Noori, Morteza; Beer, Michael A.

doi:10.1007/s00285-013-0705-3

Robust $k$-mer frequency estimation using gapped $k$-mers

Published: 17 July 2013

Volume 69, pages 469–500, (2014)
Cite this article

Journal of Mathematical Biology Aims and scope Submit manuscript

Mahmoud Ghandi¹^nAff2,
Morteza Mohammad-Noori^3,4 &
Michael A. Beer¹

1288 Accesses
33 Citations
8 Altmetric
1 Mention
Explore all metrics

Abstract

Oligomers of fixed length, $k$, commonly known as $k$-mers, are often used as fundamental elements in the description of DNA sequence features of diverse biological function, or as intermediate elements in the constuction of more complex descriptors of sequence features such as position weight matrices. $k$-mers are very useful as general sequence features because they constitute a complete and unbiased feature set, and do not require parameterization based on incomplete knowledge of biological mechanisms. However, a fundamental limitation in the use of $k$-mers as sequence features is that as $k$ is increased, larger spatial correlations in DNA sequence elements can be described, but the frequency of observing any specific $k$-mer becomes very small, and rapidly approaches a sparse matrix of binary counts. Thus any statistical learning approach using $k$-mers will be susceptible to noisy estimation of $k$-mer frequencies once $k$ becomes large. Because all molecular DNA interactions have limited spatial extent, gapped $k$-mers often carry the relevant biological signal. Here we use gapped $k$-mer counts to more robustly estimate the ungapped $k$-mer frequencies, by deriving an equation for the minimum norm estimate of $k$-mer frequencies given an observed set of gapped $k$-mer frequencies. We demonstrate that this approach provides a more accurate estimate of the $k$-mer frequencies in real biological sequences using a sample of CTCF binding sites in the human genome.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Mining K-mers of Various Lengths in Biological Sequences

A New Feature Selection Methodology for K-mers Representation of DNA Sequences

On weighted k-mer dictionaries

Article Open access 17 June 2023

References

Albert AE (1972) Regression and the Moore-penrose Pseudoinverse. Academic Press, New York
MATH Google Scholar
Beer MA, Tavazoie S (2004) Predicting gene expression from sequence. Cell 117:185–198
Article Google Scholar
Ben-Hur A, Ong CS, Sonnenburg S, Schölkopf B, Rätsch G (2008) Support vector machines and kernels for computational biology. PLoS Comput Biol 4:e1000173
Article Google Scholar
Boyle AP, Song L, Lee B-K, London D, Keefe D, Birney E, Iyer VR, Crawford GE, Furey TS (2011) High-resolution genome-wide in vivo footprinting of diverse transcription factors in human cells. Genome Res 21:456–464
Article Google Scholar
Cameron PJ (2003) Notes on Counting. http://www.maths.qmul.ac.uk/pjc/notes/counting.pdf. Accessed 25 Jan 2012
Elemento O, Tavazoie S (2005) Fast and systematic genome-wide discovery of conserved regulatory elements using a non-alignment based approach. Genome Biol 6:R18
Article Google Scholar
Göke J, Schulz MH, Lasserre J, Vingron M (2012) Estimation of pairwise sequence similarity of mammalian enhancers with word neighbourhood counts. Bioinformatics (Oxford, England)
Graham RL, Knuth DE, Patashnik O (1994) Concrete mathematics: a foundation for computer science, 2nd edn. Addison Wesley Publishing Company, Boston
MATH Google Scholar
van Helden J (2004) Metrics for comparing regulatory sequences on the basis of pattern counts. Bioinformatics 20:399–406
Article Google Scholar
Kantorovitz MR, Kazemian M, Kinston S, Miranda-Saavedra D, Zhu Q, Robinson GE, Göttgens B, Halfon MS, Sinha S (2009) Motif-blind, genome-wide discovery of cis-regulatory modules in Drosophila and mouse. Dev Cell 17:568–579
Article Google Scholar
Lee D, Karchin R, Beer MA (2011) Discriminative prediction of mammalian enhancers from DNA sequence. Genome Res 21:2167–2180
Article Google Scholar
Leslie CS, Eskin E, Cohen A, Weston J, Noble WS (2004) Mismatch string kernels for discriminative protein classification. Bioinformatics 20:467–476
Article Google Scholar
Meinicke P, Tech M, Morgenstern B, Merkl R (2004) Oligo kernels for datamining on biological sequences: a case study on prokaryotic translation initiation sites. BMC Bioinformatics 5:169
Article Google Scholar
Sonnenburg S, Schweikert G, Philips P, Behr J, Rätsch G (2007) Accurate splice site prediction using support vector machines. BMC Bioinformatics 8:S7
Article Google Scholar
Sonnenburg S, Zien A, Rätsch G (2006) ARTS: accurate recognition of transcription starts in human. Bioinformatics 22:e472–480
Article Google Scholar
Stormo GD (2000) DNA binding sites: representation and discovery. Bioinformatics 16:16–23
Article Google Scholar
Wilson RM (1990) A diagonal form for the incidence matrices of $t$-subsets vs. $k$-subsets. Eur J Combin 11:609–615
Article MATH Google Scholar
Xie X, Lu J, Kulbokas EJ, Golub TR, Mootha V, Lindblad-Toh K, Lander ES, Kellis M (2005) Systematic discovery of regulatory motifs in human promoters and 3’ UTRs by comparison of several mammals. Nature 434:338–45
Article Google Scholar

Download references

Acknowledgments

We thank the reviewers for their comments and suggestions which significantly improved the manuscript. We also thank users of math.stackexchange.com online community, specifically users Joriki and Siva for their useful comments which helped us in the development of the proof. Dongwon Lee graciously provided the processed CTCF sequence data. The research of M.M. was in part supported by a grant from IPM (No. CS1390-4-07), and M.B. was supported by the Searle Scholars Program and in part by NIH grant NS062972.

Author information

Mahmoud Ghandi
Present address: Broad Institute, Cambridge, MA, 02142, USA

Authors and Affiliations

Department of Biomedical Engineering and McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins University, Baltimore, MD, 21205, USA
Mahmoud Ghandi & Michael A. Beer
School of Mathematics, Statistics and Computer Science, University of Tehran, Tehran, Iran
Morteza Mohammad-Noori
School of Computer Science, Institute for Research in Fundamental Sciences, P.O. Box: 19395-5746, Tehran, Iran
Morteza Mohammad-Noori

Authors

Mahmoud Ghandi
View author publications
You can also search for this author in PubMed Google Scholar
Morteza Mohammad-Noori
View author publications
You can also search for this author in PubMed Google Scholar
Michael A. Beer
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Michael A. Beer.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (r 7 KB)

Appendix: Proofs of some of the Propositions

Proof of Proposition 1

The first identity is proved as follows

$$\begin{aligned}&\left( {\begin{array}{c}k\\ p\end{array}}\right) \left( {\begin{array}{c}p\\ t\end{array}}\right) \left( {\begin{array}{c}k-p\\ n-t\end{array}}\right) = \left( {\begin{array}{c}k\\ t\end{array}}\right) \left( {\begin{array}{c}k-t\\ p-t\end{array}}\right) \left( {\begin{array}{c}k-p\\ n-t\end{array}}\right) =\left( {\begin{array}{c}k\\ t\end{array}}\right) \left( {\begin{array}{c}k-t\\ k-p\end{array}}\right) \left( {\begin{array}{c}k-p\\ k-n-p+t\end{array}}\right) \\&\quad =\left( {\begin{array}{c}k\\ t\end{array}}\right) \left( {\begin{array}{c}k-t\\ k-n\end{array}}\right) \left( {\begin{array}{c}k-n\\ p-t\end{array}}\right) =\left( {\begin{array}{c}k\\ t\end{array}}\right) \left( {\begin{array}{c}k-t\\ n-t\end{array}}\right) \left( {\begin{array}{c}k-n\\ p-t\end{array}}\right) =\left( {\begin{array}{c}k\\ n\end{array}}\right) \left( {\begin{array}{c}n\\ t\end{array}}\right) \left( {\begin{array}{c}k-n\\ p-t\end{array}}\right) . \end{aligned}$$

To prove (16), denote the left side by $\tau _{p,k,\ell }$, then using $\left( {\begin{array}{c}k-n\\ p-t\end{array}}\right) =\left( {\begin{array}{c}k-n-1\\ p-t-1\end{array}}\right) +\left( {\begin{array}{c}k-n-1\\ p-t\end{array}}\right) $, we obtain

$$\begin{aligned} \tau _{p,k,\ell }&= \tau _{p-1,k-1,\ell }+\sum _{t=0}^k (-1)^{k-t}\left( \begin{array}{l}\ell \\ k\end{array}\right) \left( \begin{array}{l}k\\ t\end{array}\right) \left( \begin{array}{c}-1\\ p-t-1\end{array}\right) x^t\\&\quad +\tau _{p,k-1,\ell }+\sum _{t=0}^k (-1)^{k-t}\left( \begin{array}{l}\ell \\ k\end{array}\right) \left( \begin{array}{l}k\\ t\end{array}\right) \left( \begin{array}{c}-1\\ p-t \end{array}\right) x^t\\&= \tau _{p-1,k-1,\ell }+\tau _{p,k-1,\ell }+\sum _{t=0}^k (-1)^{k-t}\left( \begin{array}{l}\ell \\ k\end{array}\right) \left( \begin{array}{l}k\\ t\end{array}\right) \left( \left( \begin{array}{c}-1\\ p-t\end{array}\right) \!+\! \left( \begin{array}{c}-1\\ p-t-1\end{array}\right) \right) x^t\\&= \tau _{p-1,k-1,\ell }+\tau _{p,k-1,\ell }+\sum _{t=0}^k (-1)^{k-t}\left( \begin{array}{l}\ell \\ k\end{array}\right) \left( \begin{array}{l}k\\ t\end{array}\right) \left( \begin{array}{c}0\\ p-t\end{array}\right) x^t\\&= \tau _{p-1,k-1,\ell }+\tau _{p,k-1,\ell }+(-1)^{k-p} \left( \begin{array}{l}\ell \\ k\end{array}\right) \left( \begin{array}{l}k\\ p\end{array}\right) x^p. \end{aligned}$$

Then by replacing $k$ by $k^{\prime }$ in $\tau _{p,k,\ell }-\tau _{p,k-1,\ell }=(-1)^{k-p}\left( {\begin{array}{c}\ell \\ p\end{array}}\right) \left( {\begin{array}{c}\ell -p\\ \,k-p\end{array}}\right) x^{p}+\tau _{p-1,k-1,\ell }$ and summing up over $k^{\prime }$, $0\le k^{\prime }\le k$, we get

$$\begin{aligned} \tau _{p,k,\ell }=\sum _{k^{\prime }=0}^k (-1)^{k^{\prime }-p}\left( \begin{array}{l}\ell \\ p\end{array}\right) \left( \begin{array}{l}\ell -p \\ k^{\prime }-p\end{array}\right) x^{p}+\sum _{k^{\prime }=0}^k \tau _{p-1,k^{\prime }-1,\ell }, \end{aligned}$$

whence

$$\begin{aligned} \tau _{p,k,\ell }=\left( \begin{array}{l}\ell \\ p\end{array}\right) \left( \begin{array}{l}k-\ell \\ k-p\end{array}\right) x^p+\sum _{k^{\prime }=0}^k \tau _{p-1,k^{\prime }-1,\ell }. \end{aligned}$$

(34)

Now we prove the identity $\tau _{p,k,\ell }={\small \left( \begin{array}{l}k-\ell \\ k-p\end{array}\right) }\sum _{n=0}^{p} {\small \left( \begin{array}{l}\ell \\ n\end{array}\right) }x^n$ for $p=0,1,\cdots ,k$ by bounded induction on $p$. For $p=0$, using (34) we obtain $ \tau _{0,k,\ell }={\small \left( \begin{array}{c}k-\ell \\ k\end{array}\right) }$ which proves the required identity in this case. Now let $0<p\le k$ and suppose that the result is true for $p-1$. We prove the validity of the identity for $p$ by using the induction hypothesis and (34), as follows:

$$\begin{aligned} \tau _{p,k,\ell }&= \left( \begin{array}{l}k-\ell \\ k-p\end{array}\right) \left( \begin{array}{l}\ell \\ p\end{array}\right) x^p+\sum _{k^{\prime }=0}^k \tau _{p-1,k^{\prime }-1,\ell }\\&= \left( \begin{array}{l}k-\ell \\ k-p\end{array}\right) \left( \begin{array}{l}\ell \\ p\end{array}\right) x^p+\sum _{k^{\prime }=0}^k \left( \begin{array}{l}k^{\prime }-\ell -1\\ k^{\prime }-p\end{array}\right) \sum _{n=0}^{p-1} \left( \begin{array}{l}\ell \\ n\end{array}\right) x^n\\&= \left( \begin{array}{l}k-\ell \\ k-p\end{array}\right) \left( \begin{array}{l}\ell \\ p\end{array}\right) x^p+ \left( \begin{array}{l}k-\ell \\ k-p\end{array}\right) \sum _{n=0}^{p-1} \left( \begin{array}{l}\ell \\ n\end{array}\right) x^n\\&= \left( \begin{array}{l}k-\ell \\ k-p\end{array}\right) \sum _{n=0}^{p} \left( \begin{array}{l}\ell \\ n\end{array}\right) x^n. \end{aligned}$$

$\square $

Proof of Proposition 2

(i)
It is enough to prove that the identity
$$\begin{aligned} \sum _{y\in M_{\ell k}(u)} \nu (y,v^{\prime })=\left( \begin{array}{l}\ell -n\\ \ell -k\end{array}\right) \nu (u,v^{\prime }) \end{aligned}$$
(35)
holds for any $u\in U_\ell $. The nonzero summands of the summation on the left are obtained from the words $y\in V_{\ell k}$ in which all symbols $g$ appear in positions $G_{v^{\prime }}$ (Using $|G_{v^{\prime }}|=\ell -n$, we conclude that there are $\left( {\begin{array}{c}\ell -n\\ \ell -k\end{array}}\right) $ such summands). On the other hand, since $y\in M_{\ell k}(u)$, it is easily seen that for any such $y$ we have $\nu (y,v^{\prime })=\nu (u,v^{\prime })$. Thus the summation is simplified to $\left( {\begin{array}{c}\ell -n\\ \ell -k\end{array}}\right) \nu (u,v^{\prime })$ as required.
(ii)
It is enough to prove that the identity
$$\begin{aligned} \sum _{u\in M^{\prime }_{\ell k}(v)} \nu (u,v^{\prime })=b^{\ell -k}\nu (v,v^{\prime }) \end{aligned}$$
(36)
holds for any $v\in V_{\ell k}$. We prove this in two cases: Case (a) Suppose that $G_{v}\subseteq G_{v^{\prime }}$. Then for any $u\in M^{\prime }_{\ell k}(v)$ we have
$$\begin{aligned} \nu (u,v^{\prime })=\prod _{i\in \overline{G}_{v^{\prime }}}\nu (u_i,v^{\prime }_i)=\prod _{i\in \overline{G}_{v^{\prime }}}\nu (v_i,v^{\prime }_i)=\nu (v,v^{\prime }) \end{aligned}$$
and since there are $b^{\ell -k}$ such words $u$, the equation (36) thus follows. Case (b) Suppose that $G_{v}\not \subseteq G_{v^{\prime }}$, consequently $G_v {\setminus } G_{v^{\prime }}\ne \emptyset $. Now for any $i\in G_v {\setminus } G_{v^{\prime }}$ we have $\nu (v_i,v^{\prime }_i)=0$, thus the right side of (36) is $0$. We prove that the left side is also $0$ as follows. The nonzero summands in the summation are obtained from elements $u\in X$ where the subset $X\subseteq U_\ell $ is given by
$$\begin{aligned} X=\{u\in U_\ell : u_i\le v_i \mathrm{\,\, for \,\,} i\in G_{v}{\setminus } G_{v^{\prime }} \mathrm{\,\,\quad and\,\,} u_i=v_i \mathrm{\,\, for \,\,} i\in \overline{G}_v \}. \end{aligned}$$
Thus we obtain
$$\begin{aligned} \sum _{u\in M^{\prime }_{\ell k}(v)} \nu (u,v^{\prime })&= \sum _{u\in X} \nu (u,v^{\prime })\\&= \sum _{u\in X} \prod _{i=1}^\ell \nu (u_i,v^{\prime }_i)\\&= \sum _{u\in X} \left( \prod _{i\in \overline{G}_v} \nu (u_i,v^{\prime }_i) \prod _{i\in G_v \setminus G_{v^{\prime }}} \nu (u_i,v^{\prime }_i) \prod _{i\in G_v \cap G_{v^{\prime }}} \nu (u_i,v^{\prime }_i)\right) \\&= \sum _{u\in X} \left( \prod _{i\in \overline{G}_v} \nu (v_i,v^{\prime }_i) \prod _{i\in G_v \setminus G_{v^{\prime }}} \nu (u_i,v^{\prime }_i) \prod _{i\in G_v \cap G_{v^{\prime }}} \nu (u_i,g)\right) \\&= \prod _{i\in \overline{G}_v} \nu (v_i,v^{\prime }_i) \prod _{i\in G_v \setminus G_{v^{\prime }}} \sum _{u_i=0}^{v^{\prime }_i}\nu (u_i,v^{\prime }_i) \prod _{i\in G_v \cap G_{v^{\prime }}} \sum _{u_i=0}^{b-1}\nu (u_i,g)\\&= b^{|G_v \cap G_{v^{\prime }}|}\prod _{i\in \overline{G}_v} \nu (v_i,v^{\prime }_i) \prod _{i\in G_v \setminus G_{v^{\prime }}} \sum _{u_i=0}^{v^{\prime }_i}\nu (u_i,v^{\prime }_i)\\&= 0. \end{aligned}$$
The last identity holds because for any $i\in G_v{\setminus } G_{v^{\prime }}$ we have $\sum _{u_i=0}^{v^{\prime }_i}\nu (u_i,v^{\prime }_i)=\sum _{u_i=0}^{v^{\prime }_i-1}1-v^{\prime }_i=0$.
(iii), (iv)
These are immediate consequences of parts (i) and (ii). $\square $

Proof of Proposition 3

(i)
If $i\in P$ and $v_i>0$ then
$$\begin{aligned} \phi _i(u,v)&= \sum _{j=v_i}^{b-1} \frac{\nu (v_i,j)\nu (u_i,j)}{j(j+1)}\\&= \sum _{j=v_i}^{b-1} \frac{\nu (v_i,j)^2}{j(j+1)}\\&= \frac{v_i^2}{v_i(v_i+1)}+\sum _{j=v_i+1}^{b-1} \frac{1}{j(j+1)}\\&= \frac{v_i}{v_i+1}+\bigg (\frac{1}{v_i+1}-\frac{1}{b}\bigg )\\&= \frac{b-1}{b}. \end{aligned}$$
If $i\in P$ and $v_i=0$, then given $j\ge 1$, we have $\nu (u_i,j)=\nu (v_i,j)=1$, hence,
$$\begin{aligned} \phi _i(u,v)&= \sum _{j=1}^{b-1} \frac{1}{j(j+1)}\\&= \frac{b-1}{b}. \end{aligned}$$
The case $i\in Q$ is done similarly.
(ii)
Without loss of generality suppose that $v=v_1\cdots v_k g^{\ell -k}$. Let $P=\{1,\ldots ,p\}$ and $Q=\{p+1,\ldots ,k\}$ and for a subset $\overline{G}\in {\small \left( \begin{array}{l}\{1,\ldots ,\ell \}\\ \ell -n\end{array}\right) }$ let $X^{\prime }_{\ell n}(\overline{G})=\{v^{\prime }\in V^{\prime }_{\ell n}: \overline{G}_{v^{\prime }}=\overline{G}\}$. Denote the left side of (17) by $S$. Since the nonzero summands in (17) are obtained from the words $v^{\prime }\in V^{\prime }_{\ell n}$ which have $g$’s at the $\ell -k$ rightmost positions, we may just consider words $v^{\prime }=v^{\prime }_1 \cdots v^{\prime }_k g^{\ell -k}$ with $|v^{\prime }_1 \cdots v^{\prime }_k|_g=k-n$. Then we have
$$\begin{aligned} S&= \sum _{t=0}^n \sum _{T_1\in {P \atopwithdelims ()t}}\sum _{T_2\in {Q \atopwithdelims ()n-t}} \sum _{v^{\prime }\in X^{\prime }_{\ell n}(T_1\cup T_2)}\frac{\nu (v,v^{\prime }) \nu (u,v^{\prime })}{{\displaystyle \prod \nolimits _{i \in \overline{G}_{v^{\prime }}}v^{\prime }_i(v^{\prime }_i+1)}} \\&= \sum _{t=0}^n \sum _{T_1\in {P \atopwithdelims ()t}}\sum _{T_2 \in {Q \atopwithdelims ()n-t}}\sum _{v^{\prime }\in X^{\prime }_{\ell n}(T_1 \cup T_2)}\frac{ {\displaystyle \prod \nolimits _{i \in \overline{G}_{v^{\prime }}}\nu (v_i,v^{\prime }_i) \nu (u_i,v^{\prime }_i)}}{{\displaystyle \prod \nolimits _{i \in \overline{G}_{v^{\prime }}}v^{\prime }_i(v^{\prime }_i+1)}}\\&= \sum _{t=0}^n \sum _{T_1\in {P \atopwithdelims ()t}}\sum _{T_2 \in {Q \atopwithdelims ()n-t}}\sum _{v^{\prime }\in X^{\prime }_{\ell n}(T_1 \cup T_2)} {\displaystyle \prod _{i \in T_1\cup T_2} \frac{ \nu (v_i,v^{\prime }_i) \nu (u_i,v^{\prime }_i)}{v^{\prime }_i(v^{\prime }_i+1)}}\\&= \sum _{t=0}^n \sum _{T_1\in {P \atopwithdelims ()t}}\sum _{T_2\in {Q \atopwithdelims ()n-t}}\prod _{i \in T_1\cup T_2} \bigg ( \sum _{v^{\prime }_i=\max \{1,v_i\}}^{b-1} \frac{\nu (v_i,v^{\prime }_i) \nu (u_i,v^{\prime }_i)}{v^{\prime }_i(v^{\prime }_i+1)} \bigg )\\&= \sum _{t=0}^n \sum _{T_1\in {P \atopwithdelims ()t}}\sum _{T_2\in {Q \atopwithdelims ()n-t}}\prod _{i \in T_1\cup T_2}\phi _i(u,v)\\&= \sum _{t=0}^n \sum _{T_1\in {P \atopwithdelims ()t}}\sum _{T_2\in {Q \atopwithdelims ()n-t}}\left( \frac{b-1}{b}\right) ^{t} \left( \frac{-1}{b}\right) ^{n-t} ~~\hbox {(by part (i))}~~\\&= \frac{1}{b^n}\sum _{t=0}^n (-1)^{n-t}\left( \begin{array}{l}p\\ t\end{array}\right) \left( \begin{array}{l}k-p\\ n-t\end{array}\right) (b-1)^t. \end{aligned}$$

$\square $

Proof of Proposition 5

(i)
These are immediate consequences of Definition 3, except the last equation. To calculate the norm of $x_{v^{\prime }}$, without loss of generality let $\overline{G}_{v^{\prime }}=\{1,\ldots n\}$, hence $v^{\prime }=v^{\prime }_1\cdots v^{\prime }_n g^{\ell -n}$ and it is evident that the nonzero elements $\nu (w,v^{\prime })$ of $x_{v^{\prime }}$ are obtained from the words $w=r s \in V_{\ell k}$ with $|s|=\ell -n$, $|s|_g=\ell -k$, $r=r_1 \cdots r_n$ and $0\le r_i\le v^{\prime }_i$ for $i=1,\ldots ,n$. Moreover, the value $\nu (w,v^{\prime })$ is independent of $s$ and there are ${\small \left( \begin{array}{l}\ell -n\\ \ell -k\end{array}\right) }b^{k-n}$ choices for $s$. We thus provide
$$\begin{aligned} ||x_{v^{\prime }}||^2&= \left( \begin{array}{l}\ell -n\\ \ell -k\end{array}\right) b^{k-n}\prod _{i=1}^n \sum _{x_i=0}^{v^{\prime }_i}\nu (w_i,v^{\prime }_i)^2\\&= \left( \begin{array}{l}\ell -n\\ \ell -k\end{array}\right) b^{k-n}\prod _{i=1}^n (v^{\prime }_i+{v^{\prime }_i}^2), \end{aligned}$$
as required.
(ii)
The first equation is proved easily considering Remark 2. To prove the second one, observe that
$$\begin{aligned} Q_{\ell k}\Lambda ^{-1}Q_{\ell k}^{\top }=\Delta _{\ell k} E \Lambda ^{-1} E^{\top } \Delta _{\ell k}^{\top }, \end{aligned}$$
in which the matrix $D=E \Lambda ^{-1} E^{\top }$ is a diagonal matrix of the form $D=diag((d_{v^{\prime }})_{v^{\prime } \in V^{\prime }_{\ell ,\le k}})$ with entries $d_{v^{\prime }}=\frac{1}{\lambda _{|\overline{G}_{v^{\prime }}|}||x_{v^{\prime }}||^2}$ or
$$\begin{aligned} d_{v^{\prime }}=\frac{1}{\left( \begin{array}{l}\ell -|\overline{G}_{v^{\prime }}|\\ \ell -k\end{array}\right) ^2 b^{\ell -|\overline{G}_{v^{\prime }}|}{\displaystyle \prod \nolimits _{i\in \overline{G}_{v^{\prime }}}(v^{\prime }_i+{v^{\prime }_i}^2)}}. \end{aligned}$$
(iii)
Since the columns of $Q_{\ell k}$ are normalized orthogonal eigenvectors, the first identity holds. To prove the second one, using the notation of Remark 2, we begin by claiming
$$\begin{aligned} A_{\ell k}^{\top } N_{\ell k}=0. \end{aligned}$$
In fact if $y\in {\ker }(A_{\ell k}A_{\ell k}^{\top })$ then from $A_{\ell k}A_{\ell k}^{\top }y=0$ we obtain $y^{\top }A_{\ell k}A_{\ell k}^{\top }y=0$ and $||A_{\ell k}^{\top }y||=0$, thus $A_{\ell k}^{\top }y=0$, which proves the claim. Now, by using the decomposition $P_{\ell k}=[Q_{\ell k}\,\, R_{\ell k}]$ and the equation $P_{\ell k} P_{\ell k}^{\top }=I$, we obtain $Q_{\ell k}Q_{\ell k}^{\top }+N_{\ell k}N_{\ell k}^{\top }=I$, hence $A_{\ell k}^{\top } Q_{\ell k} Q_{\ell k}^{\top }=A_{\ell k}^{\top }-A_{\ell k}^{\top } N_{\ell k} N_{\ell k}^{\top }$ which yields $A_{\ell k}^{\top } Q_{\ell k} Q_{\ell k}^{\top }=A_{\ell k}^{\top }$ using the above claim. Transposing both sides yields the result.

$\square $

Proof of Proposition 6

First note that

$$\begin{aligned} W_{\ell k}A_{\ell k}W_{\ell k}&= (A_{\ell k}^{\top }Q_{\ell k}\Lambda ^{-1}Q_{\ell k}^{\top })A_{\ell k} (A_{\ell k}^{\top }Q_{\ell k}\Lambda ^{-1}Q_{\ell k}^{\top })\\&= A_{\ell k}^{\top }Q_{\ell k}\Lambda ^{-1}Q_{\ell k}^{\top }(A_{\ell k} A_{\ell k}^{\top })Q_{\ell k}\Lambda ^{-1}Q_{\ell k}^{\top }\\&= A_{\ell k}^{\top }Q_{\ell k}\Lambda ^{-1}Q_{\ell k}^{\top }(Q_{\ell k}\Lambda Q_{\ell k}^{\top })Q_{\ell k}\Lambda ^{-1}Q_{\ell k}^{\top }\\&= A_{\ell k}^{\top }Q_{\ell k}(\Lambda ^{-1}Q_{\ell k}^{\top } Q_{\ell k}\Lambda ) (Q_{\ell k}^{\top } Q_{\ell k})\Lambda ^{-1}Q_{\ell k}^{\top }\\&= A_{\ell k}^{\top }Q_{\ell k} \Lambda ^{-1}Q_{\ell k}^{\top }\\&= W_{\ell k}. \end{aligned}$$

Second, we have

$$\begin{aligned} A_{\ell k}W_{\ell k}A_{\ell k}&= A_{\ell k}(A_{\ell k}^{\top }Q_{\ell k}\Lambda ^{-1}Q_{\ell k}^{\top })A_{\ell k}\\&= (A_{\ell k}A_{\ell k}^{\top })Q_{\ell k}\Lambda ^{-1}Q_{\ell k}^{\top }A_{\ell k}\\&= Q_{\ell k}(\Lambda Q_{\ell k}^{\top } Q_{\ell k}\Lambda ^{-1})Q_{\ell k}^{\top }A_{\ell k}\\&= Q_{\ell k} Q_{\ell k}^{\top }A_{\ell k}\\&= A_{\ell k} ~~(\hbox {by Proposition} 5 (\mathrm iii)).~~ \end{aligned}$$

Finally, observe that $W_{\ell k}A_{\ell k}$ and $A_{\ell k}W_{\ell k}$ are real symmetric matrices. It is concluded that $W_{\ell k}$ is the Moore–Penrose pseudo-inverse of $A_{\ell k}$.

To prove the last statement, we show that the sum of the elements in each row of matrix $W_{\ell k}A_{\ell k}$ is equal to $1$. The same result for columns is concluded form the symmetry of $W_{\ell k}A_{\ell k}$. Let $J=[1,1,\ldots ,1]^{\top }$ and $F=[1,0,\ldots ,0]^{\top }$. Note that $J=z^{\ell 0}_{gg \cdots g}$ and $F=x^{\ell k0}_{gg \cdots g}$ and note that the row sums of $W_{\ell k}A_{\ell k}$ are the entries of $W_{\ell k}A_{\ell k}J$. Given $M=\left( {\begin{array}{c}\ell \\ k\end{array}}\right) b^{k}$ the number of all possible gapped $k$-mers, we have

$$\begin{aligned} W_{\ell k}A_{\ell k}J&= A_{\ell k}^{\top }Q_{\ell k} \Lambda ^{-1} Q_{\ell k}^{\top } A_{\ell k} z^{\ell 0}_{gg \cdots g} \\&= A_{\ell k}^{\top }Q_{\ell k} \Lambda ^{-1} Q_{\ell k}^{\top } b^{\ell -k} x^{\ell k0}_{gg \cdots g} ~~\hbox {(by Proposition 2 (ii)),}~~ \\&= A_{\ell k}^{\top }Q_{\ell k} \Lambda ^{-1} \frac{M}{\sqrt{M}} b^{\ell -k} F ~~\hbox {(using } Q^{\top } x^{\ell k0}_{gg \cdots g}=\sqrt{M}[1,0,0,..,0]),~~\\&= A_{\ell k}^{\top }Q_{\ell k} \frac{1}{\left( {\begin{array}{c}\ell \\ k\end{array}}\right) b^{\ell -k}} \frac{M}{\sqrt{M}} b^{\ell -k} F\\&= A_{\ell k}^{\top }\frac{1}{\sqrt{M}} \frac{1}{\left( {\begin{array}{c}\ell \\ k\end{array}}\right) b^{\ell -k}} \frac{M}{\sqrt{M}} b^{\ell -k} x^{\ell k0}_{gg \cdots g}\\&= \left( {\begin{array}{c}\ell \\ k\end{array}}\right) \frac{1}{\sqrt{M}} \frac{1}{\left( {\begin{array}{c}\ell \\ k\end{array}}\right) b^{\ell -k}} \frac{M}{\sqrt{M}} b^{\ell -k} z^{\ell 0}_{gg \cdots g}~~\hbox {(by Proposition 2 (i)),}~~ \\&= z^{\ell 0}_{gg \cdots g}\\&= [1,1,\ldots , 1]^{\top }. \end{aligned}$$

$\square $

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ghandi, M., Mohammad-Noori, M. & Beer, M.A. Robust $k$-mer frequency estimation using gapped $k$-mers. J. Math. Biol. 69, 469–500 (2014). https://doi.org/10.1007/s00285-013-0705-3

Download citation

Received: 04 December 2012
Revised: 09 June 2013
Published: 17 July 2013
Issue Date: August 2014
DOI: https://doi.org/10.1007/s00285-013-0705-3

Keywords

Mathematics Subject Classification

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Robust \(k\)-mer frequency estimation using gapped \(k\)-mers

Abstract

Access this article

Similar content being viewed by others

Mining K-mers of Various Lengths in Biological Sequences

A New Feature Selection Methodology for K-mers Representation of DNA Sequences

On weighted k-mer dictionaries

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Electronic supplementary material

Supplementary material 1 (r 7 KB)

Appendix: Proofs of some of the Propositions

Proof of Proposition 1

Proof of Proposition 2

Proof of Proposition 3

Proof of Proposition 5

Proof of Proposition 6

Rights and permissions

About this article

Cite this article

Keywords

Mathematics Subject Classification

Navigation

Robust \(k\)-mer frequency estimation using gapped \(k\)-mers

Abstract

Access this article

Similar content being viewed by others

Mining K-mers of Various Lengths in Biological Sequences

A New Feature Selection Methodology for K-mers Representation of DNA Sequences

On weighted k-mer dictionaries

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Electronic supplementary material

Supplementary material 1 (r 7 KB)

Appendix: Proofs of some of the Propositions

Appendix: Proofs of some of the Propositions

Proof of Proposition 1

Proof of Proposition 2

Proof of Proposition 3

Proof of Proposition 5

Proof of Proposition 6

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Search

Navigation