Abstract
On a finite sequence of binary (0-1) trials we define a random variable enumerating patterns of length subject to certain constraints. For sequences of independent and identically distributed binary trials exact probability mass functions are established in closed forms by means of combinatorial analysis. An explicit expression of the mean value of this random variable is obtained. The results associated with the probability mass functions are extended on sequences of exchangeable binary trials. An application in Information theory concerning counting of a class of run-length-limited binary sequences is provided as a direct byproduct of our study. Illustrative numerical examples exemplify further the results.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction and Preliminaries
Pattern counting statistics defined on sequences of binary (zero, 0 - one, 1) random variables (RVs), along with their exact/approximate/limiting probability distributions, have been extensively studied because of their importance and usefulness as theoretical models in many research areas of science and engineering involving intrinsic uncertainty. Such areas include applied probability and statistics (e.g. hypothesis testing and statistical control of stochastic processes), engineering (e.g. system reliability, quality control and start-up demonstration tests of mechanical/electrical/electronic systems), frequency analysis and risk managing of the occurrence of critical events in physical sciences (e.g. seismology, metereology and hydrology), biology (e.g. population genetics and DNA/RNA sequence homology), computer science (e.g. encoding/decoding, storage and transmission of digital information) and stochastic financial engineering (e.g. insurance and risk analysis of financial data).
The counting of patterns is carried out according to certain enumerating schemes. Usually these schemes define whether overlapping counting is allowed or not as well as if the counting starts or not from scratch when a pattern of certain form or size has been so far enumerated. The uncertainty may be modeled by several probabilistic models often suggested by the applications. Usually such a model generates a binary sequence with elements either independent (memoryless source) or there is some kind of dependence (source with memory) among them (e.g. exchangeability and Markov dependence). The methods used to derive exact/approximate/limiting, marginal/joint/conditional probability distributions include combinatorial analysis, generating functions, finite Markov chain imbedding technique, recursive schemes as well as normal, Poisson and large deviation approximation.
The literature on patterns and run related statistics is very rich. The books of Balakrishnan and Koutras (2002), Fu and Lou (2003) and Glaz et al. (2001) cover the vast majority of the past and recent one. Current works on the subject include among others those of Dafnis and Makri (2022), Dafnis et al. (2021), Eryilmaz (2019), Gera (2018, 2021), Makri and Psillakis (2019) and Zhao et al. (2022).
To start with and fix notation as well, let \(\{X_{i}\}_{i=1}^{n}\) be a sequence of random variables defined over the alphabet \(A=\{0,1\}\), i.e. \(X_{i}=x_{i}\in A\), and \(S_{n}\), \(n>0\), a RV representing the total number of 1 s in \(\{X_{i}\}_{i=1}^{n}\), with \(S_{n}=s\), \(s\in \{0,1,\ldots ,n\}\). That is,
Next, we introduce a discrete non-negative RV (statistic) counting 0-1 patterns of constrained length. The introduced RV depends on the length of the 0-1 sequence and on two flexible parameters which impose constraints on the length of the enumerated by the RV 0-1 patterns. Consequently, this RV on the one hand generalizes and covers as particular cases RVs recently studied by several researchers and on the other hand provides flexibility, additional perspective and applications not given by the other RVs. One such important and direct application is counting finite 0-1 sequences subject to certain constraints. These sequences are used in Information theory and in particular in noiseless coding/decoding schemes implemented by many digital communication and recording systems.
Definition 1
We define as a run of 0 s or a 0-run a (sub)sequence of \(\{X_{i}\}_{i=1}^{n}\) consisting of consecutive 0 s, the number of which is referred to as its length. \(\diamondsuit \)
Definition 2
For non-negative integer numbers n, d, k, \(0\le d\le k\le n\), \(n>0\), we define on \(\{X_{i}\}_{i=1}^{n}\), the RV \(X_{n,d,k}\) enumerating in n 0-1 trials the (total) number of occurrences of (d, k)-constrained patterns, in short [d, k; n], of the following forms:
Patterns of (f1) and (f3) forms may only occur at the beginning and at the end of the sequence, respectively. The counting of [d, k; n] is considered in the overlapping sense. That is, a 1 in the sequence can contribute toward counting two patterns; the one that ends with the occurrence of it and the next one which starts with it. \(\diamondsuit \)
In other words, \(X_{n,d,k}\) counts in \(\{X_{i}\}_{i=1}^{n}\): (f1) pattern possibly occurring at the beginning of the sequence, consisting of a 0-run of leading 0 s of length at most k followed by a 1, (f2) patterns consisting of two subsequent 1 s separated (interrupted) by a 0-run of length at least d and at most k, (f3) pattern possibly occurring at the end of the sequence, consisting by a 1 followed by a 0-run of trailing 0 s of length at most k.
Remark 1
Definition 2 implies that: (a) If there is no 1 in the sequence, i.e. \(X_{i}=0\), \(i=1,2,\ldots , n\), then \(X_{n,d,k}=0\), \(0\le d\le k\le n\). (b) If there is no 0 in the sequence, i.e. \(X_{i}=1\), \(i=1,2,\ldots , n\), then \(X_{n,d,k}=n+1\), for \(d=0\), \(k=0,1,\ldots ,n\). \(\diamondsuit \)
Example 1
As an illustration, let us assume an experiment where \(n=25\) 0-1 trials, numbered from 1 to 25, result in the following sequence of outcomes: 0100010000111011001001100. Then, e.g. \(X_{25,0,1}=6\), \(X_{25,0,2}=9\), \(X_{25,1,1}=2\), \(X_{25,1,2}=5\) and \(X_{25,2,4}=6\). For instance, the 5 formed [1, 2; 25] patterns are: 1, 2; 13, 14, 15; 16, 17, 18, 19; 19, 20, 21, 22; 23, 24, 25. For a complete chart of the values of \(X_{25,d,k}\), for \(0\le d\le k\le 4\), computed by the forthcoming Eqs. (1)–(3), see Table 1. \(\diamondsuit \)
Throughout the article, for integers n, m, \({n\atopwithdelims ()m}\) denotes the extended binomial coefficient (see, e.g. Feller 1968, pp. 50, 63), \(\lfloor x\rfloor \) stands for the greatest integer less than or equal to x, \(\delta _{i,j}\) denotes the Kronecker delta function for integer arguments i and j, such that \(\delta _{i,j}=1\), \(i=j\); 0, otherwise. \(\mid A\mid \) denotes the cardinality of a set A and \(I_{A}(\omega )\) is the indicator function of a set A, \(A\subseteq \Omega \), such that \(\forall \omega \in \Omega \), \(I_{A}(\omega )=1\), \(\omega \in A\); 0, otherwise. Also, we apply the convention \(\sum _{i=\alpha }^{\beta }=0\) (\(\prod _{i=\alpha }^{\beta }=1\)), for \(\alpha >\beta \), that is an empty sum (product) is to be interpreted as zero (unity).
\(X_{n,d,k}\), \(0\le d\le k\le n\), takes values in
Next, we formally define \(X_{n,d,k}\), \(0\le d\le k\le n\), on a 0-1 sequence of RVs \(X_{1}, X_{2},\ldots ,X_{n}\), \(n>0\), as follows:
where, for \(k=n-1,n\),
and for \(k\le n-2\),
The setup (1)–(3) holds for any 0-1 sequence \(\{X_{i}\}_{i=1}^{n}\) and it is the main tool to derive closed expressions for the expected (mean) value of \(X_{n,d,k}\). It is also useful to determine numeric values, empirical frequencies and moments of \(X_{n,d,k}\) in simulation studies of \(X_{n,d,k}\), e.g. in estimation and hypothesis testing problems, and in studies concerning applications such as processing of computer data files.
\(X_{n,d,k}\) covers as particular cases RVs referring to enumeration of some forms among (f1), (f2) and (f3). More specifically, let \(M_{n,d,k}\); \(N_{n,d,k}^{(F)}\) and \(N_{n,d,k}^{(R)}\), \(0\le d\le k\le n\), denote RVs counting patterns of (f2); (f1), (f2) and (f2), (f3) forms, respectively, in a 0-1 sequence of length n. Let \((\bar{f1})\) \(\underbrace{00\ldots 0}_{>k}1\) and \((\bar{f3})\) \(1\underbrace{00\ldots 0}_{>k}\) be patterns occurring only in the left end and the right end of the sequence, respectively. Then it holds:
For instance, on the \(0-1\) sequence of Example 1 we have, by (4)–(6), \(M_{25,1,2}=X_{25,1,2}-2=3\) and \(N_{25,1,2}^{(F)}=N_{25,1,2}^{(R)}=X_{25,1,2}-1=4\).
\(N_{n,d,k}^{(F)}\), \(N_{n,d,k}^{(R)}\), \(M_{n,d,k}\), variants of them as well as the particular cases \(M_{n,0,k}\), \(M_{n,d,d}\), \(M_{n,d,n-2}\) and \(M_{n,0,0}\) defined on \(0-1\) sequences of several internal structures have been extensively studied in the literature. Applications of these special cases of \(X_{n,d,k}\), defined on \(0-1\) sequences of independent and dependent elements, were found in hypothesis testing, urn models, record models and random permutations, system reliability, start-up demonstration tests, biomedical engineering and statistical control of stochastic processes. Indicative contributions on the subject and its applications, including overlapping/non-overlapping counting of patterns related to some of the prementioned RVs as well as to other type RVs, e.g. waiting time RVs, are among others the works of Dafnis et al. (2012), Eryilmaz and Zuo (2010), Gera (2021), Holst (2009), Kumar and Upadhye (2019), Ling (1988), Makri and Psillakis (2012, 2013, 2017), Sen and Goyal (2004) and Stefanov and Szpankowski (2007).
The rest of the paper is organized as follows. In Section 2 we establish our main results on sequences of independent and identically distributed (IID) binary trials with a common probability of 1 s p, \(0<p<1\). More specifically, in Section 2.1 we obtain via combinatorial analysis exact closed form expressions in terms of binomial coefficients of the joint probability mass function (JPMF) of \(X_{n,d,k}\) and \(S_{n}\), of the conditional probability mass function (CPMF) of \(X_{n,d,k}\) given \(S_{n}\) and of the marginal probability mass function (PMF) of \(X_{n,d,k}\). A simple explicit expression of \(E(X_{n,d,k})\) is derived using the setup (1)–(3). In Section 2.2 we consider symmetric (\(p=1/2\)) IID sequences and we define two potentially useful in data analysis numbers. In Section 3 the exact formulae of Section 2.1 referring to JPMF, CPMF and PMF are extended efficiently on exchangeable (EXCH) or symmetrically dependent binary sequences. An application in Information theory is developed in Section 4. In this application an important class of the run-length-limited sequences which is used in communication and data storage applications, namely the (d, k)-constrained sequences, is considered and Theorem 1 (for \(p=1/2\)) is directly applied to enumerate such sequences. Numerical examples, computed and presented in Sections 2–4, exemplify further our theoretical results. Section 5 ends the article with a brief discussion on its findings and some implications on a future study and potential applications of (d, k)-constrained patterns.
2 Main Results - Independent and Identical 0-1 Trials
In this section we establish our main results. More specifically, in Section 2.1 we consider IID 0-1 sequences, \(0<p<1\), and in Section 2.2 we restrict our study to symmetric, \(p=1/2\), IID 0-1 sequences. IID \(0-1\) sequences are of particular importance in studies of applied probability and statistics because of the simplicity of their internal structure and their help in understudying the notion of randomness.
2.1 IID 0-1 Sequences, \(0<p<1\)
We consider a finite \(0-1\) sequence \(\{X_{i}\}_{i=1}^{n}\) of length n, \(n>0\), of IID RVs with a common probability of 1 s p, \(0<p<1\); i.e. \(p=P(X_{i}=1)\), \(q=P(X_{i}=0)\), \(i=1,2,\ldots ,n\), \(p+q=1\). For \(0\le d\le k\le n\), \(n>0\), we denote as
the PMF of \(X_{n,d,k}\), the JPMF of \(X_{n,d,k}\) and \(S_{n}\) and the CPMF of \(X_{n,d,k}\) given \(S_{n}\), respectively, and
The exact distributions \(f_{n,d,k}(x;p)\), \(g_{n,d,k}(x,s;p)\) and \(h_{n,d,k}(x,s)\) are established via enumerative combinatorics. To that end we provide proper combinatorial numbers.
Lemma 1
The number of allocations of \(\alpha \) indistinguishable balls into \(\beta \) distinguishable urns with each of the \(m_{j}\) specified urns having capacity (i.e. can be occupied by at most) \(\gamma _{j}\), \(j=1,2,3\), balls is given by
with \(\delta _{1}=\lfloor \frac{\alpha }{\gamma _{1}+1}\rfloor \), \( \delta _{2}=\lfloor \frac{\alpha -(\gamma _{1}+1)j_{1}}{\gamma _{2}+1}\rfloor \), \( \delta _{3}=\lfloor \frac{\alpha -(\gamma _{1}+1)j_{1}-(\gamma _{2}+1)j_{2}}{\gamma _{3}+1}\rfloor \), \(m_{j}\ge 0\), \(\gamma _{j}\ge 0\), \(j=1,2,3\), \(0\le \sum _{j=1}^{3}m_{j}\le \beta \). Equivalently, \( H_{m_{1},m_{2},m_{3}}(\alpha ,\beta ,\gamma _{1},\gamma _{2},\gamma _{3}) \) gives the number of integer solutions of the equation \(x_{1}+x_{2}+\ldots +x_{\beta }=\alpha \) such that \(0\le x_{j}\le \gamma _{1}\), \(1\le j\le m_{1}\), \(0\le x_{j}\le \gamma _{2}\), \(m_{1}+1\le j\le m_{1}+m_{2}\), \(0\le x_{j}\le \gamma _{3}\), \(m_{1}+m_{2}+1\le j\le m_{1}+m_{2}+m_{3}\).
Proof
It follows by expanding the generating function
of \(H_{m_{1},m_{2},m_{3}}(\alpha ,\beta ,\gamma _{1},\gamma _{2},\gamma _{3})\). \( \diamondsuit \)
Corollary 1
(Makri et al. 2007) The coefficient
is the number of allocations of \(\alpha \) indistinguishable balls into \(\beta \) distinguishable urns where each of r, \(0\le r\le \beta \), specified urns is occupied by at most \(\gamma \), \(\gamma \ge 0\), balls. \(\diamondsuit \)
Remark 2
By Corollary 1, for \(\gamma =0\) we get
which is the number of allocations of \(\alpha \) indistinguishable balls into \(\beta -r\) distinguishable urns with no restrictions. \(\diamondsuit \)
After that we obtain our probabilistic results.
Theorem 1
The JPMF \(g_{n,d,k}(x,s;p)\) is given by
with \(\alpha _{1}=n-s-(s+1-x)(k+1)\) and \(\alpha _{2}=n-s-(x-i)d-(s+1-j-x)(k+1)\).
Proof
For \(s\ge 1\), \(d>0\), we first observe that the length of the 0-run in a pattern of the form \(1\underbrace{00\ldots 0}_{\ge 1}1\) may be greater than \(d-1\) and less than \(k+1\) or less than d or greater than k. For \(x\in \Lambda _{n,d,k}\), we obtain the probability of the event \((X_{n,d,k}=x, S_{n}=s)\) by visualizing the problem as a model of allocation of \(n-s\) balls (0s) into the \(s+1\) urns, formed by the s 1s, in the following way. Each of i, \(i=0,1,2\), of them (the external ones created before the first and after the last 1), chosen in \({2\atopwithdelims ()i}\) ways, receives at most k 0s, each of \(x-i\) of them, chosen in \({s-1\atopwithdelims ()x-i}\) ways, receives at least d and at most k 0s, each of j, \(0\le j\le s-1-(x-i)\) of them, chosen in \({s-1-(x-i)\atopwithdelims ()j}\) ways, receives no more than \(d-1\) 0s and each of the remaining \(s+1-x-j\) urns receives at least \(k+1\) 0s. Noting that all sequences in \((X_{n,d,k}=x, S_{n}=s)\) have probability \(\pi _{n}(s)=p^{s}(1-p)^{n-s}\) and summing with respect to i and j we get the result by the multiplication rule and Lemma 1. The case \(s\ge 1\), \(d=0\), follows in a similar way by using Corollary 1, while the case \(s=0\), \(d\ge 0\) is straightforward. \(\diamondsuit \)
Corollary 2
The CPMF \(h_{n,d,k}(x,s)\), is given by
with \(\alpha _{1}\) and \(\alpha _{2}\) as in Theorem 1.
Theorem 2
The PMF \(f_{n,d,k}(x;p)\), \(x\in \Lambda _{n,d,k}\), is given by
Proof
Summing the JPMF \(g_{n,d,k}(x,s;p)\), with respect to s, \(s\ge 1\), and noting that \(P(X_{n,d,k}=x, S_{n}=0)=q^{n}\delta _{x,0}\), the result follows. \(\diamondsuit \)
Using setup (1)–(3), we next provide a simple explicit formula for the mean of \(X_{n,d,k}\).
Theorem 3
The mean value \(E(X_{n,d,k};p)\), is given by
Proof
By means of Eq. (1) we have that \(E(X_{n,d,k})=\sum _{j=1}^{n}E(Y_{j})\). Then, for \(k\le n-2\), by Eq. (3) and the independence of the sequence \(\{X_{i}\}_{i=1}^{n}\) it is clear that \(E(Y_{j})=pq^{j-1}\), \(j=1,2,\ldots ,d+1\), \(E(Y_{j})=pq^{d}\), \(j=d+2,d+3,\ldots , k+1\), \(E(Y_{j})=pq^{d}-pq^{k+1}\), \(j=k+2,\ldots ,n-1\) and \(E(Y_{n})=1-q^{k+1}+pq^{d}-pq^{k+1}\). The result then follows after some algebraic manipulations. For \(k=n-1,n\), \(E(X_{n,d,k})\) is obtained, by means of (2), in a similar way. \(\diamondsuit \)
Example 2
Table 1 shows the observed numbers \(X_{25,d,k}=x_{25,d,k}\) of [d, k; 25] patterns occurring in the sequence 0100010000111011001001100 of Example 1 and the corresponding right-tailed p-values p-v\(=P(X_{25,d,k}\ge x_{25,d,k}\mid S_{25}=10)\), for \(0\le d\le k\le 4\). The depicted p-v’s, computed via Corollary 2, imply that the null hypothesis of randomness of the prementioned sequence is not rejected since if it is rejected a type-I error of probability at least p-v=\(25.05\%\), for \(0\le d\le k\le 4\), is committed. \(\diamondsuit \)
2.2 Symmetric IID 0-1 Sequences, \(p=1/2\)
In this section we consider a symmetric (\(p=1/2\)) finite IID 0-1 sequence of length n, \(n>0\), for which we obtain our results.
Since \(X_{n,d,k}\) is defined on the sample space \(\Omega =\{0,1\}^{n}=\{\omega =\omega _{1}\omega _{2}\ldots \omega _{n}:\omega _{i}\in \{0,1\}, 1\le i\le n\}\) with \(\mid \Omega \mid =2^{n}\) and for \(p=1/2\) all \(2^{n}\) 0-1 sequences of length n are equiprobable, the classical probability implies
\(Q_{n,d,k}(x)\) is the number of 0-1 sequences of length n with exactly x, \(x\in \Lambda _{n,d,k}\), [d, k; n] patterns among all \(2^{n}\) equiprobable 0-1 sequences of length n. More specifically,
where
i.e. \(\Gamma _{n,d,k}(x)\) is the set of sequences with exactly x [d, k; n] patterns, so that
with the values of \(X_{n,d,k}(\omega )\), \(\omega \in \Omega \), determined via (1)–(3).
Let \(R_{n,d,k}\) be the (total) number of occurrences of all [d, k; n] patterns in all \(2^{n}\) equiprobable 0-1 sequences of length n. \(R_{n,d,k}\) can be defined as
Accordingly, since it holds
it follows that
Consequently, by (7) and (13), we get the explicit expression
Therefore, \(R_{n,d,k}\) can explicitly and efficiently be computed by (14) even by using an ordinary hand calculator.
The numbers \(Q_{n,d,k}(x)\) and \(R_{n,d,k}\) might be potentially useful in digital communication and data storage applications. For analogous numbers defined for several other run counting RVs as well as for their importance in the aformentioned application areas, see e.g. Sinha and Sinha (2009, 2012) and Makri and Psillakis (2011).
Equations (9)–(12) imply that if we wish to empirically determine the numbers \(Q_{n,d,k}(x)\) and \(R_{n,d,k}\) we have to generate, by a computer, all \(2^{n}\) equiprobable 0-1 sequences of length n and then count on them these numbers. This procedure, i.e. first listing and then counting, although possible and useful in several applications, does not theoretically determine \(Q_{n,d,k}(x)\) and \(R_{n,d,k}\). It is the analysis that was followed which gives answers to such problems as it counts arrangements of things without listing them.
Example 3
In order to clarify \(\Lambda _{n,d,k}\), \(\Gamma _{n,d,k}(x)\), \(Q_{n,d,k}(x)\), \(x\in \Lambda _{n,d,k}\) and \(R_{n,d,k}\), let us consider \(n=4\), \(d=1\) and \(k=2\). Table 2 depicts them. From the first column of the table we see that \(\Lambda _{n,d,k}=\{0,1,2,3\}\). Furthermore, for instance, the second line of the table implies that among all \(2^{4}=16\) equiprobable 0-1 sequences of length 4, there are \(Q_{4,1,2}(1)=2\) sequences which contain exactly 1 [1, 2; 4]; i.e. \(\Gamma _{4,1,2}(1)=\{0001,1000\}\). Moreover, the number of occurrences of all [1, 2; 4] in all 16 equiprobable 0-1 sequences of length 4 is \(R_{4,1,2}=33\). That is, in all 16 equiprobable 0-1 sequences of length 4, [1, 2; 4] patterns occur 33 times, so that the mean number of their occurrence is 2.0625.
In addition, in Table 3 we compute by (8) and (14) and present \(Q_{n,d,k}(x)\), \(x\in \Lambda _{n,d,k}\) and \(R_{n,d,k}\), for \(n=4\), \(0\le d\le k\le n\). The entries of the table show the entire distribution of these numbers for \(n=4\) which has been chosen small because of space limitation. Similar results can be easily computed for any values of n, d and k. \(\diamondsuit \)
3 Extended Results - Exchangeable 0-1 Trials
In this section we derive the JPMF of \(X_{n,d,k}\) and \(S_{n}\), the CPMF of \(X_{n,d,k}\) given \(S_{n}\) and the PMF of \(X_{n,d,k}\) defined on an exchangeable sequence. To that end the corresponding results of Section 2.1 can be extended in a simple way in order to study \(X_{n,d,k}\) under the weaker, compared to IID assumption, of exchangeable or symmetrically dependent 0-1 trials.
Exchangeability assumes that the joint distribution of 0-1 \(X_{i}\)s is a symmetric one. For such sequences, for any \(n>0\) and any vector \((x_{1},x_{2},\ldots ,x_{n})\), \(x_{i}\in \{0,1\}\), it holds
for any permutation \((\alpha _{1},\alpha _{2},\ldots ,\alpha _{n})\) of the index set \(\{1,2,\ldots ,n\}\).
Accordingly, the key point (see Makri and Psillakis 2013, 2019) that makes feasible an extension from IID to EXCH sequences is that under exchangeability all finite 0-1 sequences with the same length and the same number of 1 s are equally likely. That is, because of exchangeability any 0-1 \(\{X_{i}\}_{i=1}^{n}\) with s 1 s and \(n-s\) 0 s, has probability
of occurrence, which in the case of IID trials becomes
(for a detailed proof see pp. 4–6 of Makri and Psillakis 2019).
Consequently, the JPMF of \(X_{n,d,k}\) and \(S_{n}\), and the marginal PMF of \(X_{n,d,k}\) defined on an EXCH sequence \(\{X_{i}\}_{i=1}^{n}\) with a given \(p_{n}(s)\), are directly given by Theorems 1 and 2 by replacing \(\pi _{n}(s)\) and \(\pi _{n}(0)\) with \(p_{n}(s)\) and \(p_{n}(0)\), respectively. Therefore, Corollary 2 provides the CPMF of \(X_{n,d,k}\) given \(S_{n}\), defined on an EXCH sequence with a given \(p_{n}(s)\), since
Example 4
To further clarify the previous results, we next compute and present in Table 4 some numerics concerning PMFs and means of \(X_{n,d,k}\) defined on IID (Case I) and EXCH (Case II) sequences for \(n=5\) and \(d=0\), \(k=1\); \(d=1\), \(k=1,2,3\); \(d=2\), \(k=3\). The value \(n=5\) was chosen small so that the required calculations can also be carried out by a pocket calculator and thus to gain insight in the respective formulae. The sequences in Table 4 are as follows:
-
Case I: An IID sequence with a common probability of 1 s \(p=1/2\). Equivalently, this case corresponds to an EXCH sequence with \(p_{5}(s)=1/2^{5}\), \(0\le s\le 5\).
-
Case II: An EXCH sequence with \(p_{5}(s)=\prod _{j=0}^{s-1}(4+j)\prod _{j=0}^{4-s}(4+j)/\prod _{j=0}^{4}(8+j)\), \(0\le s\le 5\). This case concerns a Polya-Eggenberger urn model with initially 4 white (1) and 4 black (0) balls and 1 additional ball of the same color (white or black) returned to the urn along with the ball drawn from the urn (cf. Johnson and Kotz 1977, pp. 176–178). \(\diamondsuit \)
4 An Application in Information Theory
Closely connected with [d, k; n] patterns are the (d, k)-constrained sequences of length n, \(n>0\). These sequences are extensively used in digital communication, recording and storage information from Shannon’s (1948) era, see e.g. Immink (2004).
Definition 3
We say that a sequence \(\{x_{i}\}_{i=1}^{n}\), \(x_{i}\in \{0,1\}\), \(n>0\), is a (d, k)-constrained or (d, k)-limited sequence and it is denoted as (d, k; n), \(0\le d<k\le n\), if it satisfies simultaneously the following two conditions:
-
(c1) every 0-run has length at most k (k-constraint).
-
(c2) two successive 1 s are separated by a 0-run of length at least d (d-constraint).
A (0, n; n) sequence is an unconstrained 0-1 sequence. The set of all (d, k; n) sequences defines a (d, k)-code, with code words of length n, and is denoted as \(\Delta _{n}(d,k)\) with \(N_{n}(d,k)= |\Delta _{n}(d,k)|\). \(\diamondsuit \)
Remark 3
Notice that \(X_{n,d,k}\) enumerates [d, k; n] constrained patterns in an unconstrained (0, n; n) sequence and \(N_{n}(d,k)\) counts (d, k; n) constrained (among all \(2^{n}\)) sequences. \(\diamondsuit \)
The problem of determining \(N_{n}(d,k)\) goes back to Shannon. It is an interesting one since, besides its own merit as a counting problem, \(N_{n}(d,k)\) is used in enumerating coding/decoding system (Tang and Bahl 1970) for noiseless encoding/decoding m unconstrained source bits (binary digits, i.e. 0-1) to n (d, k)-constrained channel bits, such that
The exact calculation of \(N_{n}(d,k)\) has been given via a variety of methods, e.g. using a finite state sequential machine, recursion relations and generating function method. See, e.g. Shannon (1948), Franaszek (1970), Tang and Bahl (1970), Jacquet and Szpankowski (2006).
Complementary to the prementioned approaches, we next provide for the first time an alternative solution by easily determining the number \(N_{n}(d,k)\) as a direct byproduct of Theorem 1 for \(p=1/2\).
Corollary 3
For \(0\le d<k\le n\), \(N_{n}(d,k)\) is given by
Proof
By Definition 3 we observe that, for \(s\ge 1\), an element of the event \((X_{n,d,k}=s+1, S_{n}=s)\) is a (d, k)-constrained sequence with s 1s and \(n-s\) 0 s each having probability \(\pi _{n}(s)\). Consequently, for \(p=1/2\), the number of such sequences equals \(2^{n}P(X_{n,d,k}=s+1, S_{n}=s)=2^{n}g_{n,d,k}(s+1,s;1/2)\). Summing with respect to s and noting that for \(n=k\) a sequence with no 1s is a (d, k)-constrained sequence too, we get the result. \(\diamondsuit \)
Example 5
To have a view of the numbers \(N_{n}(d,k)\) as well as of the sets \(\Delta _{n}(d,k)\) we compute and depict a complete chart of them in Table 5 for \(n=4\), \(0\le d<k\le n\). For instance, let us consider \(d=1\), \(k=2\). Then, the corresponding entries of the table indicate that among all 16 0-1 sequences of length 4, i.e. the elements of \(\Delta _{4}(0,4)\), there are only 5 (1, 2; 4) sequences, i.e. \(\Delta _{4}(1,2)=\{0010, 0100, 0101, 1001, 1010\}\).
Furthermore, to get a sense of the magnitude of \(N_{n}(d,k)\) compared to \(2^{n}\) and \(R_{n,d,k}\), even for a moderate n, we consider for instance \((d,k)=(2,10)\) and \(n=16\). (2, 10) code is employed by commercial CDs and DVDs. Then we have: \(2^{16}=65536\), \(R_{16,2,10}=229264\) and \(N_{16}(2,10)=566\). As it is expected \(N_{16}(2,10)\) is a small fraction of both \(2^{16}\) and \(R_{16,2,10}\); namely \(N_{16}(2,10)/2^{16}\simeq 0.0086\), \(N_{16}(2,10)/R_{10,2,10}\simeq 0.0025\). The first fraction implies that among all 65536 available words of length 16 only 566 are proper (2, 10)-code words of length 16. Accordingly, using for instance the enumeration coding/decoding system of Tang and Bahl (1970), \(2^{9}=512\) 0-1 unconstrained (0, 9; 9) source words of length 9 can be coded by the first, ordered according to their decimal representation, 512 (2, 10; 16) 0-1 code words of length 16 and vice versa. Some of the rest \(N_{16}(2,10)-2^{9}=54\) (d, k)-code words might be used as special patterns for checking or error detection procedures as they obey the specified (2, 10) constraints.
Next, in order to give an even more clear view of the prementioned comments we return to the case \(n=4\), \(d=1\) and \(k=2\). Consequently, using again the prementioned enumerating coding/decoding scheme the \(2^{2}=4\) unconstrained (0, 2; 2) 0-1 source words of length 2, 00, 01, 10, 11, are coded as (1, 2; 4) 0-1 constrained code words of length 4, 0010, 0100, 0101, 1001, respectively, and vice versa. The code word 1010, i.e. the unused one of the set \(\Delta _{4}(1,2)\), might be used for error checking of the coding/decoding procedure. \(\diamondsuit \)
5 Conclusions and Future Study
In this article we introduced and studied the discrete non-negative statistic \(X_{n,d,k}\) counting on a finite 0-1 sequence, \(\{X_{i}\}_{i=1}^{n}\), [d, k; n] constrained patterns. Exact JPMF of \(X_{n,d,k}\) and the number of 1 s \(S_{n}\), CPMF of \(X_{n,d,k}\) given \(S_{n}\) and PMF of \(X_{n,d,k}\) are obtained, via combinatorial analysis, in closed forms containing binomial coefficients when the statistics are defined on IID and EXCH sequences. An explicit expression of \(E(X_{n,d,k})\) for IID sequences is derived. As an application of the present study, the number \(N_{n}(d,k)\) of (d, k; n) constrained sequences used in coding/decoding schemes in Information theory is directly obtained.
As future works including some highlighted potential applications on [d, k; n] patterns might be considered the following ones:
-
(a)
Statistical estimation and inference procedures concerning the probability of 1 s p in IID 0-1 sequences. These procedures can be implemented by simulation along with numerical techniques.
-
(b)
Determination of exact distributions of \(X_{n,d,k}\) or \(X_{n,d,k}\) given \(S_{n}\) when the 0-1 sequences are independent not necessarily identically distributed (INID), Markov dependent (MRKV) and partially exchangeable (PEXCH). Accordingly as an application, the distribution of these statistics defined, e.g. on a sequence of MRKV 0-1 trials, can be used to determine the chance that a stochastic process remains or not in statistical control, i.e. in or out an acceptable zone of interest. The distributions are also useful when the prementioned statistics served as test statistics in hypothesis testing of model sequences. The model sequences may be EXCH, INID, MRKV or PEXCH as possible alternatives to IID sequence of null hypothesis.
-
(c)
Analysis of real-life data files (binary or converted to binary), e.g. computer and communication networks, financial market indices, multimedia applications and social media networks and files. The analysis, based on \(X_{n,d,k}\) and \(N_{n}(d,k)\) as well as on \(Q_{n,d,k}(x)\), \(x\in \Lambda _{n,d,k}\) and \(R_{n,d,k}\), might be helpful in understanding if the distribution of 0 s and 1 s in the prementioned data files is random or some patterns are formed. In the latter case a probabilistic prediction of their future behavior might be probable and useful.
Finally, an interesting subject to work with is the study of limiting/approximate distributions of \(X_{n,d,k}\) defined on IID and MRKV \(0-1\) trials. One might examine if a Poisson and/or a compound Poisson approximation are appropriate for IID and MRKV trials. For the latter case, the papers of Erhardsson (1999) as well as Erhardsson (2000) that improves earlier results of Geske et al. (1995) could prove to be particularly helpful.
To that end, one might define an appropriate embedded Markov chain on a proper state space via which the under study RV can be expressed. Then applying theoretical results, derived by the same authors, one might get a compound Poisson approximation to the distribution of \(X_{n,d,k}\) and a bound for the total variation distance between the two distributions.
Availability of Data and Material
Not applicable.
References
Balakrishnan N, Koutras MV (2002) Runs and scans with applications. Wiley, New York
Dafnis SD, Makri FS (2022) Weak runs in sequences of binary trials. Metrika 85:573–603
Dafnis SD, Makri FS, Koutras MV (2021) Generalizations of runs and patterns distributions for sequences of binary trials. Methodol Comput Appl Probab 23:165–185
Dafnis SD, Philippou AN, Antzoulakos DL (2012) Distributions of patterne of two successes separated by a string of \(k-2\) failures. Stat Pap 53:323–344
Erhardsson T (1999) Compound Poisson approximation for Markov chains using Stein’s method. Ann Probab 27:565–596
Erhardsson T (2000) Compound Poisson approximation for counts of rare patterns in Markov Chains and extreme sojourns in birth-death chains. Ann Appl Probab 10:573–591
Eryilmaz S (2019) Statistical inference for a class of start-up demonstration tests. J Qual Technol 51:314–324
Eryilmaz S, Zuo M (2010) Constrained \((k, d)\)-out-of-\(n\) systems. Int J Syst Sci 41:679–685
Feller W (1968) An introduction to probability theory and its applications, vol I, 3rd edn. Wiley, New York
Franaszek PA (1970) Sequence-state methods for run-length-limited coding. IBM J Res Dev 14:376–383
Fu JC, Lou WYW (2003) Distribution theory of runs and patterns and its applications: a finite Markov chain imbedding approach. World Scientific, River Edge
Gera A (2018) Simultaneous demonstration tests involving sparse failures. Statist Probab Lett 135:26–31
Gera A (2021) From runs to patterns. Commun Stat Simul Comput 50:4300–4314
Geske MX, Godbole AP, Schaffner AA, Scolnick AM, Wallstrom GL (1995) Compound Poisson approximations for word patterns under Markovian hypotheses. J Appl Probab 32:877–892
Glaz J, Nauss J, Wallenstein S (2001) Scan statistics. Springer, New York
Holst L (2009) On consecutive records in certain binary sequences. J Appl Probab 46:1201–1208
Immink KAS (2004) Codes for mass data storage systems, 2nd edn. Shannon Foundation Publishers, Eindhoven, The Netherlands
Jacquet P, Szpankowski W (2006) On \((d,k)\) sequences not containing a given word. IEEE International Symposium on Information Theory (ISIT) Seatle Jul 2006, pp 1486–1489
Johnson N, Kotz S (1977) Urn models and their applications. John Wiley, New York
Kumar AN, Upadhye NS (2019) Generalizations of distributions related to \((k_{1}, k_{2})\)-runs. Metrika 82:249–268
Ling KD (1988) On binomial distributions of order \(k\). Statist Probab Lett 6:247–250
Makri FS, Philippou AN, Psillakis ZM (2007) Success run statistics defined on an urn model. Adv Appl Probab 39:991–1019
Makri FS, Psillakis ZM (2011) On success runs of a fixed length in Bernoulli sequences: exact and asymptotic results. Comput Math with Appl 61:761–772
Makri FS, Psillakis ZM (2012) Counting certain binary strings. J Stat Plan Inference 142:908–924
Makri FS, Psillakis ZM (2013) Exact distributions of constrained \((k, l)\) strings of failures between subsequent successes. Stat Pap 54:783–806
Makri FS, Psillakis ZM (2017) On limited length binary strings with an application in statistical control. The Open Statistics & Probability Journal 8:1–6
Makri FS, Psillakis ZM (2019) On the exact distributions of pattern statistics for a sequence of binary trials: a combinatorial approach. In: Glaz J, Koutras MV (eds) Handbook of Scan Statistics. pp 1-20. https://doi.org/10.1007/978-1-4614-8414-1_48-1
Sen K, Goyal B (2004) Distributions of patterns of two failures separated by success runs of length \(k\). J Korean Stat Soc 33:35–58
Shannon CE (1948) A mathematical theory of communication. Bell Syst Tech J 27:379–423, 623–656
Sinha K, Sinha BP (2009) On the distribution of ones in binary strings. Comput Math with Appl 58:1816–1829
Sinha K, Sinha BP (2012) Energy efficient communication: understanding the distribution of runs in binary strings. In: 1st International Conference on Recent Advances in Information Technology (Rait-2012), pp 177–181
Stefanov VT, Szpankowski W (2007) Waiting time distributions for patterns occurrence in a constrained sequence. Discret Math Theor Comput Sci 9:305–320
Tang DT, Bahl LR (1970) Block codes for a class of constrained noiseless channels. Inf Control 17:436–461
Zhao X, Song Y, Lv Z (2022) Distributions of \((k_{1}, k_{2},\ldots, k_{m})\)-runs with multi-state trials. Methodol Comput Appl Probab 24:2689–2702
Acknowledgements
The authors wish to thank the anonymous reviewers for the thorough reading, useful comments and suggestions which helped to improve the article.
Funding
Open access funding provided by HEAL-Link Greece. No funding was obtained for this study.
Author information
Authors and Affiliations
Contributions
Frosso S. Makri and Zaharias M. Psillakis wrote the whole manuscript and both reviewed it.
Corresponding author
Ethics declarations
Competing Interests
The authors declare no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Makri, F.S., Psillakis, Z.M. Distribution of Patterns of Constrained Length in Binary Sequences. Methodol Comput Appl Probab 25, 90 (2023). https://doi.org/10.1007/s11009-023-10068-5
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11009-023-10068-5
Keywords
- Constrained binary patterns and codes
- Runs
- Exact distributions
- Independent and identical trials
- Exchangeable trials
- Combinatorial analysis