Japanese Journal of Statistics and Data Science

, Volume 2, Issue 2, pp 591–613

# An improved analysis of least squares superposition codes with bernoulli dictionary

• Yoshinari Takeishi
• Jun’ichi Takeuchi
Original Paper Information Theory and Statistics

## Abstract

For the additive white Gaussian noise channel with average power constraint, sparse superposition codes (or sparse regression codes), proposed by Barron and Joseph in 2010, achieve the capacity. While the codewords of the original sparse superposition codes are made with a dictionary matrix drawn from a Gaussian distribution, we consider the case that it is drawn from a Bernoulli distribution. We show an improved upper bound on its block error probability with least squares decoding, which is fairly simplified and tighter bound than our previous result in 2014.

## Keywords

Channel coding theorem Euler–Maclaurin formula Exponential error bounds Gaussian channel Sparse superposition codes

## 1 Introduction

We analyze the error probability of sparse superposition codes (Barron and Joseph 2010a, b; Joseph and Barron 2012, 2014; Venkataramanan et al. 2019) with Bernoulli dictionary and least squares decoding, which ignores computational complexity. In this paper, we improve the upper bound of the error probability shown in Takeishi et al. (2014). The obtained bound is tighter and is in a simpler form than the previous result.

In the field of information theory, after the invention of Turbo codes, which achieve communication rate near the Channel capacity C with practical decoders, many efforts have been made to pursue error correction codes achieving C with efficient decoding algorithms. In 2009, Polar codes were invented by Arikan (2009), which are the first codes achieving C with efficient decoders. Nowadays, at least three kinds of error correction codes which are shown to achieve C with efficient decoders, were Known; Polar codes, spatially coupled low density parity check (LDPC) codes (2011) Kudekar et al. (2011), and sparse superposition codes (2010). Here, “achieving channel capacity C” means that the decoding error probability converges to zero at $$R < C$$ as the code length n goes to infinity (R is the communication rate of the code). Among these codes, sparse superposition codes proposed by Barron and Joseph are applied to the Additive White Gaussian Noise (AWGN) channel and shown to achieve the capacity (Joseph and Barron 2014).

In the coding of sparse superposition codes, we prepare a real valued matrix X (called dictionary), where the number of its rows is smaller than that of its columns. Then, we map a message to be sent to a spares vector $$\beta$$ and create a codeword c for the message as
\begin{aligned} c=X\beta . \end{aligned}
When this codeword is input to an AWGN channel, the received signal is
\begin{aligned} y=X\beta + \epsilon , \end{aligned}
where $$\epsilon$$ is an i.i.d. Gaussian noise vector. Hence, decoding of sparse superposition codes is done by estimating $$\beta$$ based on y and X, where the dictionary X is shared by the sender and the receiver. This decoding process is analogous to reconstruction in compressed sensing and sparse learning, and our analysis is related to these subjects.
In the original sparse superposition codes, entries of the dictionary are independent samples of a Gaussian distribution with mean zero. Using this Gaussian dictionary, the distribution of each element of codewords is a Gaussian distribution with mean zero, which is the optimal input distribution for the AWGN channel. (See Cover and Thomas (2006); Shannon (1948) for example.) This is one of the key elements of sparse superposition codes, and the error probability with least square decoding is shown to be
\begin{aligned} O\left( \exp \left\{ - (d (C-R)^2 -\log n/ n) n \right\} \right) , \end{aligned}
(1)
where d is a certain positive constant, n is code length, R is a transmission rate, and C is the channel capacity by Joseph and Barron (2012). The bound (1) is exponentially small in n when R satisfies
\begin{aligned} |C-R| = \varOmega ((\log n)^{1/2}/ n^{1/2}). \end{aligned}
(2)
This is nice, but it is difficult to realize the Gaussian dictionary in a real device since the Gaussian random variable can take arbitrary large or small values. Takeishi et al. (2014) studied the case that entries of the dictionary are samples of the unbiased Bernoulli distribution. Namely, each entry of the dictionary only takes $$+1$$ or $$-1$$ with probability 1/2, respectively. This is quite simper than the Gaussian dictionary, but the distribution of each element of codewords, which is a binomial distribution, is near to the Gaussian when the number of columns of the dictionary are large. In fact, they proved that the error probability with Bernoulli dictionary with least square decoding is
\begin{aligned} O\left( \exp \left\{ - (d(C-R)^2 -\log n /n^{1/4}) n \right\} \right) . \end{aligned}
(3)
Although the above bound is worse than (1), it is exponentially small in n when R satisfies
\begin{aligned} |C-R| = \varOmega ((\log n)^{1/2}/ n^{1/8}). \end{aligned}
(4)
To show the above bound, Takeishi et al. (2014) analyzed the difference between the probabilities of a certain event with cases where binomial and Gaussian distributions are respectively employed. For this task, evaluation of sectional measurement is one of the important factors. However in Takeishi et al. (2014), they employed a naive approximation, which turned out to be fairly loose. Concretely we have found that employment of Euler-Maclaurin formula produces the best result to date. Then the above bound (3) is refined as
\begin{aligned} O\left( \exp \left\{ - (d(C-R)^2 -1 /n^{1/2}) n \right\} \right) . \end{aligned}
(5)
Comparing the above bound to (3), $$\log n /n^{1/4}$$ is reduced to $$1 /n^{1/2}$$. Further, we have refined the proofs of some lemmas used in Takeishi et al. (2014) to tighter and quite simpler forms. Consequently, the condition (4) is improved to
\begin{aligned} |C-R| =\varOmega (1/ n^{1/4}). \end{aligned}
(6)
In this paper, we treat the least squares decoder, which is optimal in terms of error probability, but computationally intractable. Efficient decoding algorithms are also researched until now, such as (Barbier and Krzakala 2014, 2017; Barron and Cho 2012; Cho and Barron 2013; Joseph and Barron 2014; Rush et al. 2017; Rush and Venkataramanan 2018). For the efficient decoding algorithms of Joseph and Barron (2014), the block error probability is
\begin{aligned} O\left( \exp \left\{ -d (C_n-R)^2n \right\} \right) , \end{aligned}
(7)
where $$R< C_n < C$$ and
\begin{aligned} (C-C_n)/C = O(\log \log n/\log n). \end{aligned}
(8)
The above bound is exponentially small while there is a considerable gap between $$C_n$$ and C for the practical code length. It is still an open problem to show that sparse superposition codes with Bernoulli dictionary achieve the capacity with efficient algorithms.

We review the sparse superposition codes in Sect. 2. In Sect. 3, we show the new upper bound of the error probability with Bernoulli dictionary. Section 4 provides proofs of some lemmas used in Sect. 3.

## 2 Sparse superposition codes

In this section, we review the sparse superposition codes and show the performance of Gaussian dictionary with the least squares estimator.

In the following, ‘$$\log$$’ denotes the logarithm of base 2 and ‘$$\ln$$’ denotes the natural logarithm. The Gaussian distribution with mean $$\mu$$ and variance $$\sigma ^2$$ is denoted by $$N(\mu , \sigma ^2)$$.

### 2.1 Problem setting

We consider communication via the AWGN channel. Assume that a message is a K bit string $$u\in \{0,1\}^K$$ and that it is generated from the uniform distribution on $$\{0,1 \}^K$$. We use a real value vector $$c \in \mathfrak {R}^n$$ as a codeword to send a message. The codeword c is polluted by the Gaussian noise in the channel. Namely, letting $$Y \in \mathfrak {R}^n$$ be the output of the channel, we have
\begin{aligned} Y=c+\epsilon , \end{aligned}
where $$\epsilon$$ is a real number string with length n and each coordinate is independently subject to $$N(0,\sigma ^2)$$. The power of c is defined as $$(1/n)\sum _{i=1}^n c_i^2$$ and it is constrained to be not more than P averagely. We also define a signal-to-noise ratio as $$v=P/\sigma ^2$$.
We consider the task to estimate the message u based on Y and X. Let $$\hat{u}$$ be an estimated u. We call the event $$\hat{u}\ne u$$ “block error”. Further, we define the transmission rate R as K / n. It is desired that we transmit messages at large R with sufficiently small block error probability. It is well known that at all rate less than
\begin{aligned} C=\frac{1}{2}\log (1+v)\ \mathrm{(bit/transmission)}, \end{aligned}
we can transmit messages with arbitrary small block error probability for sufficiently large n.

### 2.2 Coding

We state the coding method of sparse superposition codes. First, we map a message u into a coefficient vector $$\beta \in \{0,1\}^N$$ by a one to one function. The vector $$\beta$$ is split into L sections of size M and each section has one nonzero element and the other elements are all zero. Then the codeword c is formed as
\begin{aligned} c=X\beta =\beta _1 X_1+\beta _2 X_2+\cdots +\beta _N X_N, \end{aligned}
where X is an $$n\times N$$ matrix (dictionary) and $$X_j$$ is the jth column vector of X. Thus, c is a superposition of L column vectors of X, with exactly one column selected from each section. We illustrate an example of coding method in Fig.1.

In this paper, we set all nonzero elements to 1. On the other hand, for efficient decoding algorithms such as the adaptive successive decoder proposed in Joseph and Barron (2014), nonzero elements are decaying exponentially among sections. However, we do not treat it here.

In the original paper by Joseph and Barron (2012), each element of the dictionary X is independently drawn from N(0, P / L). This distribution is optimal for the random coding argument used to prove the channel coding theorem for the AWGN channel with average power constraint by P (Shannon 1948). While in this paper, we analyze the case in which each entry of the dictionary is independently drawn as the following random variable:
\begin{aligned} X_{ij}= \left\{ \begin{array}{ll} -\sqrt{P/L}\ &{} (\mathrm{with\ probability}\ 1/2)\\ \sqrt{P/L}\ &{} (\mathrm{otherwise}) \end{array} \right. \end{aligned}
The parameters L, M, and N are selected so as to satisfy the following. The number of messages is $$2^K$$ according to our problem setting about u, and the number of codewords is $$M^L$$ according to the way of making $$\beta$$. Thus, we arrange $$2^K=M^L$$, equivalently, $$K = L\log M$$. According to the original paper by Joseph and Barron (2012), the value of M is set to be $$L^a$$ and the parameter a is referred to as section size rate. Then we have $$K=aL\log L$$ and $$n=(aL\log L)/R$$.

### 2.3 Decoding

We analyze the least squares estimator, which makes the error probability minimum ignoring computational complexity. From the received word Y and knowledge of the dictionary X, we estimates the original message u, equivalently, estimates the corresponding $$\beta$$.

Define a set $${{\mathcal {B}}}$$ as
\begin{aligned} {{\mathcal {B}}}=\{ \beta \in \{0,1\}^N| \beta _j\ \mathrm{has\ one\ }1\ \mathrm{in\ each\ section} \}. \end{aligned}
Then the least squares decoder $$\hat{\beta }$$ is denoted as
\begin{aligned} \hat{\beta }=\arg \min _{\beta \in {{\mathcal {B}}}}\Vert Y-X\beta \Vert ^2, \end{aligned}
where $$\Vert \cdot \Vert$$ denotes the Euclidean norm.

Let $$\beta ^*$$ denote the true $$\beta$$, then the event $$\hat{\beta }\ne \beta ^*$$ corresponds to the block error. Let #mistakes denote the number of sections in which the position of the nonzero element in $$\hat{\beta }$$ is different from that in the $$\beta ^*$$. Define the error event $$\mathcal{E}_{\alpha _0}=\{\#{\text {mistakes}}\ge \alpha _0 L\},$$ that the decoder makes #mistakes in at least $$\alpha _0$$ fraction of sections. The fraction $$\alpha =\#{\text {mistakes}}/L$$ is called section error rate.

### 2.4 Performance

It is proved in Joseph and Barron (2012) that given $$0<\alpha _0\le 1$$, the probability of the event $$\mathcal{E}_{\alpha _0}$$ is exponentially small in n. The following theorem (Proposition 1 in Joseph and Barron (2012)) provides an upper bound on the probability of the event $${{\mathcal {E}}}_{\alpha _0}$$, where
\begin{aligned} w_v=\frac{v}{[4(1+v)^2]\sqrt{1+(1/4)v^3/(1+v)}} \end{aligned}
and $$g(x)=\sqrt{1+4x^2}-1$$. It follows that
\begin{aligned} g(x)\ge \min \{\sqrt{2}x,x^2\}\ \ \mathrm{for\ all }\ \ x\ge 0. \end{aligned}
The definition of $$a_{v,L}$$ in the statement is given later as (12).

### Theorem 1

(Joseph and Barron 2012) Suppose that each entry ofXis independently drawnfromN(0, P / L). Assume$$M=L^a$$, where$$a\ge a_{v,L}$$, and the rateRis less than the capacityC. Then
\begin{aligned} \Pr [{{\mathcal {E}}}_{\alpha _0}]={\text {e}}^{-nE(\alpha _0,R)} \end{aligned}
with$$E(\alpha _0,R)\ge h(\alpha _0,C-R)-(\ln (2L))/n$$, where
\begin{aligned} h(\alpha ,\varDelta )=\min \left\{ \alpha w_v \varDelta ,\frac{1}{4}g\left( \frac{\varDelta }{2\sqrt{v}}\right) \right\} \end{aligned}
is evaluated at$$\alpha =\alpha _0$$and$$\varDelta =C-R$$.

### Remark

In this theorem, the unit of R and C is nat/transmission. Then, since $$n=(aL\ln L)/R$$, L is bounded by nR / a when $$\ln L\ge 1$$.

As noted in Joseph and Barron (2012), in order to bound the block error probability, we can use composition with an outer Reed-Solomon (RS) code (Reed and Solomon 1960) of rate near one. If $$R_{\text {outer}}=1-\delta$$ is the rate of an RS code, with $$0<\delta <1$$, then section error rates less than $$\delta /2$$ can be corrected. Thus, through concatenation with an outer RS code, we get a code with rate $$(1-2\alpha _0)R$$ and block error probability less than or equal to $$\Pr [{{\mathcal {E}}}_{\alpha _0}]$$. Arrange as $$R=C-\varDelta$$ and $$\alpha _0=\varDelta$$, with $$\varDelta >0$$. Then the overall rate $$(1-2\varDelta )(C-\varDelta )$$ continues to have drop from capacity of order $$\varDelta$$. The composite code have block error probability of order $$\exp \{-nd\varDelta ^2\}$$, where d is a positive constant.

To prove Theorem 1, we evaluate the probability of the event $$E_l=\{\#{\text {mistakes}}=l\}$$ for $$l=1,2,\ldots ,L$$. The probability $$\Pr [E_l]$$ is used to evaluate
\begin{aligned} \Pr [{{\mathcal {E}}}_{\alpha _0}]=\sum _{l\ge \alpha _0 L}\Pr [E_l]. \end{aligned}
We introduce the function $$C_\alpha =(1/2)\ln (1+\alpha v)$$ for $$0\le \alpha \le 1$$. It equals the channel capacity C when $$\alpha =1$$. Then $$C_{\alpha }-\alpha C$$ is a nonnegative function which equals 0 when $$\alpha$$ is 0 or 1 and is strictly positive in between. Thus the quantity $$C_{\alpha }-\alpha R$$ is larger than $$\alpha (C-R)$$, which is positive when $$R<C$$.
For a positive $$\varDelta$$ and $$\rho \in [-1,1]$$, we define a quantity $$D(\varDelta ,1-\rho ^2)$$ as
\begin{aligned} D(\varDelta ,1-\rho ^2)=\max _{\lambda \ge 0} \Bigl \{ \lambda \varDelta + \frac{1}{2}\ln (1-\lambda ^2(1-\rho ^2)) \Bigr \} \end{aligned}
(9)
and $$D_1(\varDelta ,1-\rho ^2)$$ as
\begin{aligned} D_1(\varDelta ,1-\rho ^2)=\max _{0\le \lambda \le 1} \Bigl \{ \lambda \varDelta + \frac{1}{2}\ln (1-\lambda ^2(1-\rho ^2)) \Bigr \}. \end{aligned}
(10)
Note that these quantities are nonnegative.

The following lemma (Lemma 4 in Joseph and Barron (2012)) provides an upper bound on $$\Pr [E_l]$$.

### Lemma 2

(Joseph and Barron 2012) Suppose that each entry ofXis independently drawn fromN(0, P/L). Let a positive integer$$l\le L$$be given and let$$\alpha =l/L$$. Then, $$\Pr [E_l]$$is bounded by the minimum for$$t_{\alpha }$$in the interval$$[0,C_{\alpha }-\alpha R]$$of$${\text {err}}_{\text {Gauss}}(\alpha )$$, where
\begin{aligned} {\text {err}}_{\text {Gauss}}(\alpha )&={_LC_{\alpha L}} \exp \{ -nD_1(\varDelta _{\alpha },1-\rho _1^2) \}\nonumber \\&\quad +\exp \{ -nD(t_{\alpha },1-\rho _2^2) \} \end{aligned}
(11)
with$$\varDelta _{\alpha }=C_{\alpha }-\alpha R-t_{\alpha }$$, $$1-\rho _1^2=\alpha (1-\alpha )v/(1+\alpha v)$$, and$$1-\rho _2^2=\alpha ^2 v/(1+\alpha ^2 v)$$.
To make (11) exponentially small, it is sufficient that the section size rate a is larger than
\begin{aligned} a_{v,L}=\max _{\alpha \in \{\frac{1}{L},\frac{2}{L},\ldots , 1-\frac{1}{L}\}} \frac{R\ln {}_LC_{L\alpha }}{D_1(C_{\alpha }-\alpha C,1-\rho _1^2)L\ln L}. \end{aligned}
(12)
The quantity $$a_{v,L}$$ converges to a finite value as L goes to infinity (see Lemma 5 in Joseph and Barron (2012)).

## 3 Main results

In this section, we analyze the performance of sparse superposition codes with Bernoulli dictionary. The result stated here is an improvement of the result in Takeishi et al. (2014), where they used the same code. We improve the upper bound of the error probability by refining some lemmas in Takeishi et al. (2014) to Lemmas 6, 7, and 8 in this paper.

First, we state the main theorem in this paper.

### Theorem 3

Suppose that each entry ofXis independent equiprobable$$\pm \sqrt{P/L}$$. Assume$$M=L^a$$, where$$a\ge a_{v,L}$$, and rateRis less than capacityC. Then,
\begin{aligned} \Pr [{{\mathcal {E}}}_{\alpha _0}]={\text {e}}^{-nE(\alpha _0,R)} \end{aligned}
with
\begin{aligned} E(\alpha _0,R)\ge h(\alpha _0,C-R)-(\ln (2L))/n -\iota (L), \end{aligned}
where$$\iota (L)=\max \{\iota _1,\iota _2\}$$, which are defined in Lemma 4.

### Remark

This theorem is the correspondent of Theorem 1 in Bernoulli dictionary case and the error exponent is worse than that in Theorem 1 by $$\iota (L)$$, which turns out to be $$O(1/\sqrt{L})$$ by Lemma 4. This theorem is the same form as the previous result in Theorem 5 in Takeishi et al. (2014), however, $$\iota (L)$$ converges to zero more rapidly than that in the previous result Takeishi et al. (2014), in which $$\iota (L)$$ was $$O(\sqrt{\ln L}/L^{1/4})$$ as details mentioned later.

To prove Theorem 3, we use the following lemma, which is the correspondent in this case to Lemma 2. The definition of $$\iota _1$$ and $$\iota _2$$ in Theorem 3 is given in the following lemma.

### Lemma 4

Suppose that each entry ofXis independently equiprobable$$\pm \sqrt{P/L}$$. Let$$\alpha _0$$be a certain real number in (0, 1] and$$\alpha = l/L$$. Then, for every$$L \ge 2$$andfor alllsuch that$$\alpha _0 \le \alpha \le 1$$, $$\Pr [E_l]$$isbounded by the minimum for$$t_{\alpha }$$in the interval$$[0,C_{\alpha }-\alpha R]$$of$$err_{Ber}(\alpha )$$, where
\begin{aligned} {\text {err}}_{\text {Ber}}(\alpha )&={_LC_{\alpha L}} \exp \{ -n(D_1(\varDelta _{\alpha },1-\rho _1^2)-\iota _1) \} \\&\quad +\exp \{ -n(D(t_{\alpha },1-\rho _2^2)-\iota _2) \} \end{aligned}
with$$\varDelta _{\alpha }=C_{\alpha }-\alpha R-t_{\alpha }$$, $$1-\rho _1^2=\alpha (1-\alpha )v/(1+\alpha v)$$, and$$1-\rho _2^2=\alpha ^2 v/(1+\alpha ^2 v)$$. The variables$$\iota _1$$and$$\iota _2$$are defined by the following series of equations
\begin{aligned} \iota _1&=\ln ((1+\iota _3)(1+\max \{\iota _4,\iota _5\}))\\ \iota _2&=\phi (L)+\ln \left( 1+\frac{2\eta }{L} \right) \end{aligned}
where
\begin{aligned} 1+\iota _3&=\max _{\alpha _0 L\le l \le L} \left( {\text {e}}^{\phi (l)}\left( 1+\frac{\eta (1+v)}{l}\right) \right) \\ 1+\iota _4&= \max _{\alpha _0 L\le l \le L-\sqrt{L}} \left( {\text {e}}^{\phi (l)+\phi (L-l)}\left( 1+\frac{\eta }{l}\right) \left( 1+\frac{\eta }{L-l}\right) \right) \\ 1+\iota _5&=\max _{ L-\sqrt{L}\le l \le L-1} \left( \frac{{\text {e}}^{\phi (l)}}{\sqrt{1-1/\sqrt{L}}} \left( 1+\frac{\eta }{l}\right) \right) , \end{aligned}
$$\eta = \sqrt{9/(8\pi e)}$$, and the function$$\phi$$is defined in Lemma 5.

### Remark

The function $$\phi$$ is O(1 / L) by Lemma 5. Thus we have $$\iota _1=O(1/\sqrt{L})$$ and $$\iota _2=O(1/L)$$. So $$\iota =\iota (L)$$ in Theorem 3 is $$O(1/\sqrt{L})$$.

To prove this lemma, we evaluate the difference between a binomial distribution and a Gaussian distribution with identical mean and variance. We perform it by the following two steps. The first step is evaluating the proportion of the probability mass function of the binomial distribution to the probability density function of the Gaussian. The following lemma is given in Takeishi et al. (2014) to evaluate that, where $$N(x| \mu ,\sigma ^2)$$ denotes the density function of the distribution $$N(\mu ,\sigma ^2)$$.

### Lemma 5

(Takeishi et al. 2014) For any natural number l,
\begin{aligned} \max _{k\in \{0,1,\ldots ,l\}}\frac{{}_lC_k (1/2)^l}{N(k| l/2,l/4)} \le \exp \{\phi (l)\} \end{aligned}
holds, where
\begin{aligned}&\phi (l)=\inf _{\zeta \in (0,1/2)}\phi _{\zeta }(l),\\&\phi _{\zeta }(l) \\&\quad =\max \left\{ \left( \frac{3}{16}c_{\zeta }^2 +\frac{1}{12}\right) \frac{1}{l},\ -\frac{4\zeta ^4}{3}l+\ln \frac{l}{2}+\frac{1}{12l},\right. \left. -\left( \ln 2-\frac{1}{2}\right) l+\frac{1}{2}\ln \frac{\pi l}{2} \right\} \end{aligned}
and $$c_{\zeta }=1/(1+2\zeta )^2+1/(1-2\zeta )^2$$. In particular, for any $$l \ge 1000$$, it follows that $$\phi (l) \le 5/l.$$

The second step is to evaluate the error in replacing summation about a discrete random variable with integral about the corresponding continuous random variable. It is a feasible way to replace the summation with the integral by the sectional measurement. In the previous result Takeishi et al. (2014), they evaluated the error in the sectional measurement by Lemmas 8, 9, and 10 in Takeishi et al. (2014). In this paper, we improve these lemmas. The following lemma is an improvement of Lemma 8 in Takeishi et al. (2014).

### Lemma 6

For a natural number n, let $$h=2/\sqrt{n}$$ and $$x_k=h(k-n/2)$$ ($$k=0,1,\ldots ,n$$). For $$\mu \in \mathfrak {R}$$ and $$s>0$$, define
\begin{aligned} I_d&=h\sum _{k=0}^n \exp \left\{ -\frac{s^2}{2}(x_k-\mu )^2\right\} ,\\ I_c&=\int _{-\infty }^{\infty } \exp \left\{ -\frac{s^2}{2}(x-\mu )^2\right\} {\text {d}}x. \end{aligned}
Then, we have
\begin{aligned} I_d\le \left( 1+\frac{\eta s^2}{n}\right) I_c, \end{aligned}
where $$\eta =\sqrt{9/(8\pi e)}\le 0.37$$.

Further, by reconsidering the proof and using Lemma 6, we also improve Lemmas 9 and 10 in Takeishi et al. (2014). The following lemmas are improvements of Lemmas 9 and 10 in Takeishi et al. (2014), respectively.

### Lemma 7

For a natural number n, define $$h=2/\sqrt{n}$$ and $${{\mathcal {X}}}=\{h(k-n/2)\mid k=0,1,\ldots ,n\}$$. Further, for a 2-dimensional real vector $$\mathbf{x}=(x_1,x_2)^T$$ and a strictly positive definite $$2\times 2$$ matrix A, define
\begin{aligned} I_d=\int _{-\infty }^{\infty }h\sum _{x_1\in {{\mathcal {X}}}} \exp \left\{ -\frac{\mathbf{x}^TA\mathbf{x}}{2}\right\} {\text {d}}x_2 \end{aligned}
and
\begin{aligned} I_c=\int _{-\infty }^{\infty }\int _{-\infty }^{\infty } \exp \left\{ -\frac{\mathbf{x}^TA\mathbf{x}}{2}\right\} {\text {d}}x_1{\text {d}}x_2. \end{aligned}
Then, we have
\begin{aligned} I_d\le \left( 1+\frac{\eta A_{11}}{n}\right) I_c, \end{aligned}
where $$\eta =\sqrt{9/(8\pi e)}\le 0.37$$ and $$A_{11}$$ is (1,1) element of matrix A.

### Lemma 8

For natural numbers n and $$n'$$, define $$\mathcal{X}_1=\{h_1(k-n/2)\mid k=0,1,\ldots ,n\}$$ and $$\mathcal{X}_2=\{h_2(k-n'/2)\mid k=0,1,\ldots ,n'\}$$, where $$h_1=2/\sqrt{n}$$ and $$h_2=2/\sqrt{n'}$$. Further, for a 3-dimensional real vector $$\mathbf{x}=(x_1,x_2,x_3)^T$$ and a strictly positive definite $$3\times 3$$ matrix A, define
\begin{aligned} I_d=\int _{-\infty }^{\infty } h_1h_2\sum _{x_1\in \mathcal{X}_1}\sum _{x_2\in {{\mathcal {X}}}_2} \exp \left\{ -\frac{\mathbf{x}^TA\mathbf{x}}{2}\right\} {\text {d}}x_3 \end{aligned}
and
\begin{aligned} I_c=\int _{\mathfrak {R}^3} \exp \left\{ -\frac{\mathbf{x}^TA\mathbf{x}}{2}\right\} {\text {d}}{} \mathbf{x}. \end{aligned}
Then, we have
\begin{aligned} I_d\le \left( 1+\frac{\eta A_{11}}{n}\right) \left( 1+\frac{\eta A_{22}}{n'}\right) I_c, \end{aligned}
where $$\eta =\sqrt{9/(8\pi e)}\le 0.37$$ and $$A_{ij}$$ is (ij) element of matrix A.

### 3.1 Proof of Lemma 4

We prove Lemma 4 along the lines of the proof of Lemma 6 in Takeishi et al. (2014), which is based on Lemma 4 in Joseph and Barron (2012).

We evaluate the probability of the event $$E_l$$. The random variables are the dictionary $$X=(X_1,X_2,\ldots ,X_N)$$ and the noise $$\epsilon$$.

For $$\beta \in {{\mathcal {B}}}$$, let $$S(\beta )=\{j| \beta _j=1\}$$ denote the set of indices j for which $$\beta _j$$ is nonzero. Further, let $${{\mathcal {A}}}=\{S(\beta )| \beta \in {{\mathcal {B}}}\}$$ denote the set of allowed subsets of terms. Let $$\beta ^*$$ denote $$\beta$$ which is sent, and let $$S^*=S(\beta ^*)$$. Furthermore, for $$S\in {{\mathcal {A}}}$$, let $$X_S=\sum _{j\in S}X_j$$. For the occurrence of $$E_l$$, there must be an $$S\in {{\mathcal {A}}}$$ which differs from $$S^*$$ in an amount l and which has $$\Vert Y-X_S\Vert ^2 \le \Vert Y-X_{S^*}\Vert ^2$$. Let S denote a subset which differs from $$S^*$$ in an amount l. Here, we define T(S) as
\begin{aligned} T(S)=\frac{1}{2}\left[ \frac{| Y-X_{S} |^2}{\sigma ^2} -\frac{| Y-X_{S^*}| ^2}{\sigma ^2} \right] , \end{aligned}
where for a vector x of length n, $$|x|^2$$ denote $$(1/n)\sum _{i=1}^n x_i^2$$. Then $$T(S)\le 0$$ is equivalent to $$\Vert Y-X_S\Vert ^2\le \Vert Y-X_{S^*}\Vert ^2$$. The subsets S and $$S^*$$ have an intersection $$S_1=S\cap S^*$$ of size $$L-l$$ and a difference $$S_2=S\setminus S^*$$ ($$=S\setminus S_1$$) of size l. Note that $$X_{S^*}$$ and Y are independent of $$X_{S_2}$$.
We use the decomposition $$T(S)=\widetilde{T}(S)+T^*$$, where
\begin{aligned} \widetilde{T}(S)=\frac{1}{2} \left[ \frac{| Y-X_{S} |^2}{\sigma ^2} -\frac{| Y-(1-\alpha )X_{S^*}| ^2}{\sigma ^2+\alpha ^2P} \right] \end{aligned}
and
\begin{aligned} T^*=\frac{1}{2} \left[ \frac{| Y-(1-\alpha )X_{S^*} |^2}{\sigma ^2+\alpha ^2P} -\frac{| Y-X_{S^*}| ^2}{\sigma ^2} \right] . \end{aligned}
For a positive $${\tilde{t}}=t_{\alpha }$$, let $${\tilde{E}}_l$$ denote an event that there is an $$S\in {{\mathcal {A}}}$$ which differs from $$S^*$$ in an amount l and $$\widetilde{T}(S)\le {\tilde{t}}$$. Similarly, for a negative $$t^*=-t_{\alpha }$$, let $$E_l^*$$ denotes a corresponding event that $$T^*\le t^*$$. Then we have
\begin{aligned} \Pr [E_l] \le \Pr [E_l^*] + \Pr [{\tilde{E}}_l]. \end{aligned}
First, we evaluate $$\Pr [E_l^*]$$. We use Markov’s inequality for $${\text {e}}^{-n\lambda T^*}$$ as in Joseph and Barron (2012) with a parameter $$0 \le \lambda <1/\sqrt{1-\rho _2^2}=1+1/\alpha ^2v$$. Then we have
\begin{aligned} \Pr [E_l^*]\le {\text {e}}^{n\lambda t^*} \mathbb {E}_{Y,X_{S^*}}{\text {e}}^{-n\lambda T^*}. \end{aligned}
Here, we write down the expectation $$\mathbb {E}_{Y,X_{S^*}}{\text {e}}^{-n\lambda T^*}$$ and apply Lemma 5 in this paper as in Takeishi et al. (2014). Then we have for $$\mathbf{x}=(x_1,x_2)^T$$
\begin{aligned} \Pr [E_l^*]\le {\text {e}}^{n\lambda t^*} \left( \frac{{\text {e}}^{\phi (L)}}{2\pi } \int _{-\infty }^{\infty }h_1\sum _{x_1\in {{\mathcal {X}}}_1} {\text {e}}^{-\mathbf{x}^T A \mathbf{x}/2}{\text {d}}x_2\right) ^n \end{aligned}
where $$h_1=2/\sqrt{L}$$, $${{\mathcal {X}}}_1=\{h_1(k-L/2)| k=0,1,\ldots ,L\}$$, and $$A=I-\lambda B$$ with the identity matrix I and
\begin{aligned} B=(1-\rho _2^2)\begin{pmatrix} -1 &{} \frac{1}{\alpha \sqrt{v}} \\ \frac{1}{\alpha \sqrt{v}} &{} 1 \end{pmatrix}. \end{aligned}
Then applying Lemma 7, we have
\begin{aligned}&\frac{{\text {e}}^{\phi (L)}}{2\pi } \int _{-\infty }^{\infty }h_1\sum _{x_1\in {{\mathcal {X}}}_1} {\text {e}}^{-\mathbf{x}^T A \mathbf{x}/2}{\text {d}}x_2\\&\quad \le \frac{{\text {e}}^{\phi (L)}}{2\pi } \left( 1+\frac{\eta A_{11}}{L}\right) \int _{-\infty }^{\infty }\int _{-\infty }^{\infty } {\text {e}}^{-\mathbf{x}^T A \mathbf{x}/2}{\text {d}}x_1 {\text {d}}x_2\\&\quad = \left( 1+\frac{\eta A_{11}}{L}\right) \frac{{\text {e}}^{\phi (L)}}{\sqrt{ 1-\lambda ^2(1-\rho _2^2)}}. \end{aligned}
Here, using
\begin{aligned} A_{11}=1+\lambda (1-\rho _2^2) \le 1+\sqrt{1-\rho _2^2}\le 2, \end{aligned}
we have
\begin{aligned} \frac{{\text {e}}^{\phi (L)}}{2\pi } \int _{-\infty }^{\infty }h_1\sum _{x_1\in \mathcal{X}_1} {\text {e}}^{-\mathbf{x}^T A \mathbf{x}/2}{\text {d}}x_2 \le \frac{{\text {e}}^{\iota _2}}{\sqrt{ 1-\lambda ^2(1-\rho _2^2)}}. \end{aligned}
Then we have
\begin{aligned} \Pr [E_l^*]&\le {\text {e}}^{n\lambda t^*} \left( \frac{{\text {e}}^{\iota _2}}{\sqrt{ 1-\lambda ^2(1-\rho _2^2)}}\right) ^n\nonumber \\&=\exp \{-n(\lambda t_{\alpha }+(1/2)\ln (1-\lambda ^2(1-\rho _2^2))-\iota _2)\}. \end{aligned}
(13)
Second, we evaluate $$\Pr [\tilde{E}_l]$$. Note that the indicator of the event $$\tilde{E}_l$$ satisfies the following inequality
\begin{aligned} 1_{\tilde{E}_l} \le \sum _{S_1}\Bigl (\sum _{S_2} {\text {e}}^{-n(\tilde{T}(S)-\tilde{t})} \Bigr )^\lambda , \end{aligned}
where $$\lambda$$ is an arbitrary number in [0, 1]. Taking expectation of both sides of the inequality, and applying the Jensen’s inequality to the right side, we have
\begin{aligned} \Pr [\tilde{E}_l]\le \sum _{S_1}\mathbb {E}_{Y,X_{S^*}} {\text {e}}^{-n\lambda (\widetilde{T}_1(S_1)-\tilde{t})} \Bigl (\sum _{S_2} \mathbb {E}_{X_{S_2}} {\text {e}}^{-n\widetilde{T}_2(S)} \Bigr ) ^{\lambda }, \end{aligned}
(14)
where
\begin{aligned} \widetilde{T}_1(S_1)&=\frac{1}{2} \left[ \frac{| Y-X_{S_1} |^2}{\sigma ^2+\alpha P} -\frac{| Y-(1-\alpha )X_{S^*}| ^2}{\sigma ^2+\alpha ^2P} \right] ,\\ \widetilde{T}_2(S)&=\frac{1}{2}\left[ \frac{| Y-X_{S} |^2}{\sigma ^2} -\frac{| Y-X_{S_1}| ^2}{\sigma ^2+\alpha P} \right] , \end{aligned}
which satisfy $$\tilde{T}(S)=\tilde{T}_1(S_1)+\tilde{T}_2(S)$$. Here, we have used the fact that $$\tilde{T}_1(S_1)$$ is independent of $$X_{S_2}$$. This derivation is in the same manner as the corresponding discussion for $$\Pr \{E_l\}$$ in p.2547 of Joseph and Barron (2012).
As for $$\widetilde{T}_2(S)$$, recalling $$C_\alpha =(1/2)\ln (1+\alpha v)$$ we can write
\begin{aligned} {\text {e}}^{-n\widetilde{T}_2(S)} = \frac{p^h_{Y|X_S}(Y|X_S)}{p^{(c)}_{Y|X_{S_1}}(Y|X_{S_1})}{\text {e}}^{-n C_\alpha }, \end{aligned}
where $$p^{(c)}_{Y| X_{S_1}}$$ is the conditional probability density function of Y given $$X_{S_1}$$ in case $$X_{ij}\sim N(0,P/L)$$, and $$p^h_{Y|X_S}$$ denotes the conditional probability density function Y given $$X_{S}$$ under the hypothesis that $$X_S$$ was sent. Hence we have
\begin{aligned} {\text {e}}^{-n\widetilde{T}_2(S)} = \frac{p_{Y|X_{S_1}}(Y|X_{S_1})}{p^{(c)}_{Y|X_{S_1}}(Y|X_{S_1})} \frac{p^h_{Y|X_S}(Y|X_S)}{p_{Y|X_{S_1}}(Y|X_{S_1})} {\text {e}}^{-n C_\alpha }. \end{aligned}
Since $$(X_{S_1},Y)$$ is independent of $$X_{S_2}$$, we have
\begin{aligned} \mathbb {E}_{X_{S_2}}{\text {e}}^{-n\widetilde{T}_2(S)} = {\text {e}}^{-n C_\alpha } \frac{p_{Y|X_{S_1}}(Y|X_{S_1})}{p^{(c)}_{Y|X_{S_1}}(Y|X_{S_1})} \mathbb {E}_{X_{S_2}} \frac{p^h_{Y|X_S}(Y|X_S)}{p_{Y|X_{S_1}}(Y|X_{S_1})}. \end{aligned}
(15)
As for the last factor’s expectation of (15), we have
\begin{aligned} \mathbb {E}_{X_{S_2}} \frac{p^h_{Y|X_S}(Y|X_S)}{p_{Y|X_{S_1}}(Y|X_{S_1})} = \sum _{X_{S_2}} \frac{p^h_{Y|X_S}(Y|X_S)p(X_{S_2})}{p_{Y|X_{S_1}}(Y|X_{S_1})}, \end{aligned}
where $$p(X_{S_2})$$ denotes the probability mass function of $$X_{S_2}$$. Since $$X_S=X_{S_1}\cup X_{S_2}$$, and since $$p_{Y|X_{S_1}}(Y|X_{S_1})=p^h_{Y|X_{S_1}}(Y|X_{S_1})$$ (because $$S_1=S^* \cap S$$), we have
\begin{aligned} \mathbb {E}_{X_{S_2}} \frac{p^h_{Y|X_S}(Y|X_S)}{p_{Y|X_{S_1}}(Y|X_{S_1})}&= \sum _{X_{S_2}} \frac{p^h_{Y|X_{S_1}}(Y,X_{S_2}|X_{S_1})}{p^h_{Y|X_{S_1}}(Y|X_{S_1})}\\&= \sum _{X_{S_2}}p^h_{X_{S_2}|Y,X_{S_1}}(X_{S_2}|Y,X_{S_1})=1. \end{aligned}
Note that this analysis’ idea is the same as that for the corresponding evaluation in Joseph and Barron (2012). Hence from (15), we have
\begin{aligned} \mathbb {E}_{X_{S_2}} {\text {e}}^{-n\widetilde{T}_2(S)}= \frac{P_{Y| X_{S_1}}(Y| X_{S_1})}{P^{(c)}_{Y| X_{S_1}}(Y| X_{S_1})} {\text {e}}^{-nC_{\alpha }}. \end{aligned}
(16)
To evaluate the right side of (16), we will prove that $$P_{Y|X_{S_1}}(Y|X_{S_1})$$ is nearly bounded by $$P^{(c)}_{Y|X_{S_1}}(Y| X_{S_1})$$ uniformly for all Y and $$X_{S_1}$$. Here, we define $$Y'=Y-X_{S_1}$$ and define $$P_{Y'}$$ as the probability density function of each coordinate of $$Y'$$ and $$P_{Y'}^{(c)}$$ as $$P_{Y'}$$ in case $$X_{ij}\sim N(0,P/L)$$. Then we have
\begin{aligned} \frac{P_{Y| X_{S_1}}(Y| X_{S_1})}{P_{Y| X_{S_1}}^{(c)}(Y| X_{S_1})} =\prod _{i=1}^n \frac{P_{Y'}(Y'_i)}{P^{(c)}_{Y'}(Y'_i)}. \end{aligned}
(17)
Define a set $${{\mathcal {X}}}_2=\{h_2(k-l/2)| k=0,1,\ldots ,l\}$$ with $$h_2=2/\sqrt{l}$$. Note that $$Y'=X_{S^*-S_1}+\epsilon$$. Hence, $$P_{Y'}(Y_i')$$ is the convolution of $$N(0,\sigma ^2)$$ and the density of unbiased binomial distribution of size l. Then, by applying Lemma 5, we have
\begin{aligned} P_{Y'}(Y'_i)\le \frac{{\text {e}}^{\phi (l)}h_2}{2\pi \sqrt{\sigma ^2}} \sum _{w_2\in {{\mathcal {X}}}_2} \exp \left\{ -\frac{a_2(w_2-a_3Y'_i)^2+a_4{Y'_i}^2}{2}\right\} , \end{aligned}
where $$a_2=1+\alpha v$$, $$a_3=\sqrt{\alpha v/\sigma ^2}/a_2$$ and $$a_4=1/(\sigma ^2a_2)$$$$=(\sigma ^2+\alpha P)^{-1}$$.
Using Lemma 6, we have
\begin{aligned}&h_2 \sum _{w_2\in {{\mathcal {X}}}_2} \exp \left\{ -\frac{a_2(w_2-a_3Y'_i)^2+a_4{Y'_i}^2}{2}\right\} \\&\quad \le \left( 1+\frac{\eta a_2}{l}\right) \int _{-\infty }^{\infty } \exp \left\{ -\frac{a_2(w_2-a_3Y'_i)^2+a_4{Y'_i}^2}{2}\right\} dw_2. \end{aligned}
Thus, we have
\begin{aligned} P_{Y'}(Y'_i)\le (1+\iota _3) P_{Y'}^{(c)}(Y'_i). \end{aligned}
(18)
From (16), (17) and (18), we have
\begin{aligned} \sum _{S_2} \mathbb {E}_{X_{S_2}} {\text {e}}^{-n\widetilde{T}_2(S)}\le \sum _{S_2} (1+\iota _3)^n {\text {e}}^{-nC_{\alpha }} \le (1+\iota _3)^n {\text {e}}^{-n(C_{\alpha } - \alpha R)}. \end{aligned}
(19)
The last inequality follows from
\begin{aligned} \ln \sum _{S_2}1 \le \ln M^l = l\ln M = \alpha L \ln M = n \alpha R. \end{aligned}
From (14) and (19), we have
\begin{aligned} \Pr [\tilde{E}_l]\le (1+\iota _3)^n \sum _{S_1}{\mathbb E}_{Y,X_{S^*}}{\text {e}}^{-n\lambda \tilde{T}_1(S_1)} {\text {e}}^{-n\lambda \varDelta _{\alpha }}. \end{aligned}
(20)
Here, recall that $$\varDelta _{\alpha }=C_{\alpha }-\alpha R-t_{\alpha }$$.
To evaluate the right side of (20), we will make case argument for (i) $$l \le L - \sqrt{L}$$ and (ii) $$l > L-\sqrt{L}$$. First, we consider the case (i) $$l \le L - \sqrt{L}$$. In this case, $$l'=L-l$$ is lager than $$\sqrt{L}$$. According to Takeishi et al. (2014), for $$0 \le \lambda \le 1$$ and $$\mathbf{x}=(x_1,x_2,x_3)^T$$, we have
\begin{aligned}&{{\mathbb {E}}}_{Y,X_{S^*}}{\text {e}}^{-n\lambda \tilde{T}_1(S_1)}\\&\quad \le \left( \frac{{\text {e}}^{\phi (l)+\phi (l')}}{(2\pi )^{3/2}} \int _{-\infty }^{\infty }h_2h_3\sum _{x_1\in {{\mathcal {X}}}_2} \sum _{x_2\in {{\mathcal {X}}}_3} {\text {e}}^{-\mathbf{x}^T \tilde{A} \mathbf{x}/2}{\text {d}}x_3\right) ^n, \end{aligned}
where $$h_3=2/\sqrt{l'}$$, $${{\mathcal {X}}}_3=\{h_3(k'-l'/2)| k'=0,1,\ldots ,l'\}$$, and $$\tilde{A}=I-\lambda \tilde{B}$$ with the identity matrix I and
\begin{aligned} \tilde{B}= \left( \begin{array}{ccc} \frac{\alpha v}{1+\alpha v}- \frac{\alpha ^3v}{1+\alpha ^2 v} &{} -\frac{ \alpha ^2\sqrt{\alpha (1-\alpha )}v}{1+\alpha ^2 v} \frac{\sqrt{\alpha v}}{1+\alpha v} -\frac{\alpha \sqrt{\alpha v}}{1+\alpha ^2 v}\\ -\frac{ \alpha ^2\sqrt{\alpha (1-\alpha )}v}{1+\alpha ^2 v}&{} \frac{\alpha ^2(1-\alpha )v}{1+\alpha ^2 v} &{} -\frac{ \alpha \sqrt{(1-\alpha )v}}{1+\alpha ^2 v}\\ \frac{\sqrt{\alpha v}}{1+\alpha v} -\frac{\alpha \sqrt{\alpha v}}{1+\alpha ^2 v}&{} -\frac{ \alpha \sqrt{(1-\alpha )v}}{1+\alpha ^2 v}&{} \frac{1}{1+\alpha v} -\frac{1}{1+\alpha ^2 v}\\ \end{array} \right) . \end{aligned}
Applying Lemma 8, we have
\begin{aligned}&\frac{{\text {e}}^{\phi (l)+\phi (l')}}{(2\pi )^{3/2}} \int _{-\infty }^{\infty }h_2h_3\sum _{x_1\in {{\mathcal {X}}}_2} \sum _{x_2\in {{\mathcal {X}}}_3} {\text {e}}^{-\mathbf{x}^T \tilde{A} \mathbf{x}/2}{\text {d}}x_3\\&\quad \le \frac{{\text {e}}^{\phi (l)+\phi (l')}}{(2\pi )^{3/2}} \left( 1+\frac{\eta \tilde{A}_{11}}{l}\right) \left( 1+\frac{\eta \tilde{A}_{22}}{L-l}\right) \int _{\mathfrak {R}^3} {\text {e}}^{-\mathbf{x}^T \tilde{A} \mathbf{x}/2}{\text {d}}{} \mathbf{x}\\&\quad \le \left( 1+\frac{\eta \tilde{A}_{11}}{l}\right) \left( 1+\frac{\eta \tilde{A}_{22}}{L-l}\right) \frac{{\text {e}}^{\phi (l)+\phi (l')}}{\sqrt{ 1-\lambda ^2(1-\rho _1^2)}}\\&\quad \le \frac{1+\iota _4}{\sqrt{ 1-\lambda ^2(1-\rho _1^2)}}, \end{aligned}
where we used $$\tilde{A}_{11}\le 1$$ and $$\tilde{A}_{22}\le 1$$. Thus, we have
\begin{aligned} {{\mathbb {E}}}_{Y,X_{S^*}}{\text {e}}^{-n\lambda \tilde{T}_1(S_1)} \le \left( \frac{1+\iota _4}{\sqrt{ 1-\lambda ^2(1-\rho _1^2)}}\right) ^n. \end{aligned}
(21)
Now we consider the case (ii) $$l > L-\sqrt{L}$$. Since $$l'=L-l$$ can be small in this case, we cannot use the same method as the case (i). Instead, we calculate the expectation $${\mathbb E}_{Y,X_{S^*}}{\text {e}}^{-n\lambda \tilde{T}_1(S_1)}$$ specifically, and evaluate the value by using the fact that $$l'$$ is small. The detailed evaluation is written in p.2744r. l-16 - p.2745l. l-28. of Takeishi et al. (2014). We have improved the part of evaluation of applying Lemma 8 in that paper using Lemma 6 in this paper. Namely, the quantity
\begin{aligned} 1+\iota _5'={\text {e}}^{\kappa h_2(B_4+h_2)/\tilde{a}_{33}} +\frac{\sqrt{\kappa /\tilde{a}_{33}}}{B_4\xi _4'} {\text {e}}^{-\xi _4'B_4^2/2+\kappa h_2^2/\tilde{a}_{33}} \end{aligned}
is replaced by $$1+\eta \tilde{A}_{11}/l$$, where $$B_4$$, $$\xi _4'$$ and $$\kappa$$ are defined in Takeishi et al. (2014). Using $$\tilde{A}_{11}\le 1$$, we have
\begin{aligned} {{\mathbb {E}}}_{Y,X_{S^*}}{\text {e}}^{-n\lambda \tilde{T}_1(S_1)} \le \left( \frac{1+\iota _5}{\sqrt{ 1-\lambda ^2(1-\rho _1^2)}}\right) ^n. \end{aligned}
(22)
From (20), (21) and (22)
\begin{aligned} \Pr [\tilde{E}_l]&\le (1+\iota _3)^n{}_LC_{\alpha L}\frac{(1+\max (\iota _4,\iota _5))^n}{ (1-\lambda ^2(1-\rho _1^2))^{n/2}}{\text {e}}^{-n\lambda \varDelta _{\alpha }}\\&={}_LC_{\alpha L} {\text {e}}^{-n(\lambda \varDelta _{\alpha }+(1/2)\ln (1-\lambda ^2(1-\rho _1^2))-\iota _1)} \end{aligned}
where $$\iota _1=\ln ((1+\iota _3)(1+\max (\iota _4,\iota _5))$$. Minimizing the right side for $$0\le \lambda \le 1$$, we have
\begin{aligned} \Pr [\tilde{E}_l]\le {}_LC_{\alpha L} \exp \{-n(D_1(\varDelta _{\alpha },1-\rho _1^2)- \iota _1)\}. \end{aligned}
(23)
Thus, (13) and (23) yield the bound to be obtained.

## 4 Proofs of lemmas

In this section, we prove the lemmas used in Sect. 3.

### 4.1 Proof of Lemma 6

We prove Lemma 6 by making use of the Euler-Maclaurin formula see Bourbaki (1986) for example, which has several variants. Among those, we employ the following one stated as Theorem 1 in Osada (2019). In the statement below, $$b_k$$ is the Bernoulli number ($$b_0=1$$, $$b_1=-1/2$$, $$b_2=1/6$$, ...) and $$B_n(x)$$ is the Bernoulli polynomial defined by
\begin{aligned} B_n(x) = \sum _{k=0}^n {}_n C_k b_{n-k}x^k. \end{aligned}
Note that, in Osada (2019) the residual term is not given in the statement but in the proof.

### Theorem 9

(the Euler–MacLaurin formula) Letf(x) be a class$$C^{2m+2}$$function over$$[a,b] \subset \mathfrak {R}$$. Let $$\delta = (b-a)/(n+2)$$, and$$y_k = a+ (k+1)\delta$$ ($$k=-1,0,1,\ldots ,n$$). Note that$$y_{-1}=a$$and$$y_{n+1}=b$$. Then
\begin{aligned}&\delta \Bigl ( \frac{1}{2}f(a) +\sum _{k=0}^{n}f(y_k)+ \frac{1}{2} f(b) \Bigr ) -\int _a^bf(x){\text {d}}x\nonumber \\&\quad = \sum _{j=1}^{m+1}\frac{b_{2j}\delta ^{2j}}{(2j)\text{! }} \Bigl ( f^{(2j-1)}(b)-f^{(2j-1)}(a) \Bigr ) +R_{m+1} \end{aligned}
(24)
holds, where
\begin{aligned} R_{m+1}&= -\delta ^{2m+2}\sum _{k=-1}^{n}J_{k,m+1}, \\ J_{k,m+1}&= \frac{1}{(2m+2)\text{! }} \int _0^\delta B_{2m+2}\Bigl (\frac{t}{\delta }\Bigr ) f^{(2m+2)}(y_k+t){\text {d}}t. \end{aligned}

For the proof, see Osada (2019) for example. We use this theorem with $$m=0$$, which yields the tightest order result (Lemma 7) for our goal. Further, for $$m=0$$ we can easily obtain some generalization of the Euler-Maclaurin formula as the following lemma, with which we can optimize the constant factor of the upper bound in Lemma 7.

### Lemma 10

(some extension of the Euler-Maclaurin formula with $$m=0$$) For an arbitrary real number $${\bar{b}}_2$$, define $${\bar{B}}_2(t)={\bar{b}}_2-t + t^2$$. Let f(x) be a class $$C^{2}$$ function over $$[a,b] \subset \mathfrak {R}$$. Let $$\delta = (b-a)/(n+2)$$, and $$y_k = a+ (k+1)\delta$$ ($$k=-1,0,1,\ldots ,n$$). Note that $$y_{-1}=a$$ and $$y_{n+1}=b$$. Then, for all $${\bar{b}}_2$$,
\begin{aligned}&\delta \Bigl ( \frac{1}{2}f(a) +\sum _{k=0}^{n}f(y_k)+ \frac{1}{2} f(b) \Bigr ) -\int _a^bf(x){\text {d}}x\nonumber \\&\quad = \frac{{\bar{b}}_{2}\delta ^{2}}{2} \Bigl ( f^{(1)}(b)-f^{(1)}(a) \Bigr ) -\delta ^{2}\sum _{k=-1}^{n}{\bar{J}}_{k,1} \end{aligned}
(25)
holds, where
\begin{aligned} {\bar{J}}_{k,1} = \frac{1}{2} \int _0^\delta {\bar{B}}_{2}\Bigl (\frac{t}{\delta }\Bigr ) f^{(2)}(y_k+t)dt. \end{aligned}

### Proof

Note that $${\bar{B}}_2(0)={\bar{B}}_2(1)$$ and $${\bar{B}}_2'(t)=2B_1(t)=2t-1$$ hold. Using the technique of integration by parts twice, we have
\begin{aligned} {\bar{J}}_{k,1}&=\frac{{\bar{b}}_2(f'(y_{k+1})-f'(y_k))}{2} -\frac{1}{\delta }\int _0^\delta B_1\Bigl ( \frac{t}{\delta }\Bigr ) f'(y_k +t)dt\\&=\frac{{\bar{b}}_2(f'(y_{k+1})-f'(y_k))}{2} -\frac{f(y_{k+1})+f(y_k)}{2\delta }\\&\quad + \frac{1}{\delta ^2}\int _{x_k}^{y_{k+1}} f(t)dt. \end{aligned}
Summing the first side and the third side of the above from $$k=-1$$ to $$k=n$$, we have
\begin{aligned} \sum _{k=-1}^{n} {\bar{J}}_{k,1}&= \frac{{\bar{b}}_2(f'(y_{n+1})-f'(y_{-1}))}{2}\\&\quad + \frac{1}{\delta } \Bigl ( \frac{1}{2}f(a) +\sum _{k=0}^{n}f(y_k)+ \frac{1}{2} f(b) \Bigr ) + \frac{1}{\delta ^2} \int _a^b f(t)dt, \end{aligned}
which yields (25). $$\square$$

### Remark 1

This proof is based on the proof of Theorem 9 given in Osada (2019).

### Remark 2

In particular with $${\bar{b}}_2=b_2$$, (25) is reduced to the Euler-Maclaurin formula with $$m=0$$.

### Remark 3

In the formula in Bourbaki (1986), the residual term is given as
\begin{aligned} -\delta ^{2m+3}\sum _{k=-1}^{n} \frac{1}{(2m+3)\text{! }} \int _0^\delta B_{2m+3}\Bigl (\frac{t}{\delta }\Bigr ) f^{(2m+3)}(y_k+t)dt, \end{aligned}
which has $$f^{(2m+3)}(y_k+t)$$ rather than $$f^{(2m+2)}(y_k+t)$$. If we use the formula in Bourbaki (1986), we have
\begin{aligned} I_d/I_c-1=O(s^3), \end{aligned}
which is worse order about s than Lemma 6. It is due to higher order derivative of f(x) defined below. Note that if we make partial integration to the residual term of the formula in Bourbaki (1986), it yields the same one as $$R_{m+1}$$ in (24).

Now, we can prove Lemma 6.

### Proof of Lemma 6

We define $$f(x)=\exp \{-s^2(x-\mu )^2/2\}$$. Further we define
\begin{aligned} I_d'&= I_d + \frac{f(x_{-1})+f(x_{n+1})}{2}h\\&= h\left[ \frac{1}{2}f(x_{-1})+\sum _{k=0}^n f(x_k)+\frac{1}{2}f(x_{n+1})\right] ,\\ I_c'&= \int _{x_{-1}}^{x_{n+1}}f(x){\text {d}}x, \end{aligned}
where we defined $$x_{-1}=x_0-h$$ and $$x_{n+1}=x_n+h$$. Then we have $$I_d-I_c\le I_d'-I_c'.$$
Now, we evaluate $$I_d'-I_c'$$ according to the extended Euler-Maclaurin formula (25), letting $$a=-h(n/2+1)$$ and $$b=h(n/2+1)$$, which means $$y_k = x_k$$, $$\delta = h$$, and $$2/\sqrt{n}=(b-a)/(n+2)$$. Hence we have
\begin{aligned} I'_d - I'_c= h^2{\bar{b}}_2 \frac{f'(x_{n+1})-f'(x_{-1})}{2} -h^2 \sum _{k=-1}^{n} {\bar{J}}_{k,1}. \end{aligned}
Here, we have
\begin{aligned} \left| \sum _{k=-1}^{n} {\bar{J}}_{k,1}\right| \le \sum _{k=-1}^{n} \frac{1}{2}\int _0^h \left| \frac{t^2}{h^2}-\frac{t}{h}+{\bar{b}}_2 \right| |f''(x_k +t)|dt. \end{aligned}
Noting that
\begin{aligned} \min _{{\bar{b}}_2 \in \mathfrak {R}}\max _{0\le x \le 1}|x^2-x+{\bar{b}}_2|=\frac{1}{8}, \end{aligned}
we can evaluate the above as
\begin{aligned} \left| \sum _{k=-1}^{n} {\bar{J}}_{k,1}\right|&\le \sum _{k=-1}^{n} \frac{1}{16}\int _0^h |f''(x_k +t)|dt\\&=\frac{1}{16}\int _{x_{-1}}^{x_{n+1}} |f''(t)|dt. \end{aligned}
Thus, we have
\begin{aligned} I'_d - I'_c \le \frac{h^2}{16}\left[ |f'(x_{n+1})-f'(x_{-1})| +\int _{x_{-1}}^{x_{n+1}} |f''(t)|dt\right] . \end{aligned}
Now, we evaluate
\begin{aligned} f'(x)=-s^2(x-\mu )\exp \left\{ -\frac{s^2(x-\mu )^2}{2}\right\} . \end{aligned}
We can find $$\max _{x\in \mathfrak {R}} f'(x)=-\min _{x\in \mathfrak {R}} f'(x)=s/\sqrt{e}$$ and the fluctuation of $$f'(x)$$ from $$x=-\infty$$ to $$x=\infty$$ is $$4s/\sqrt{e}$$, which is an upper bound on $$\int _{x_{-1}}^{x_{n+1}}|f''(t)|dt.$$
Thus, we have
\begin{aligned} I_d-I_c\le I_d'-I_c'\le \frac{h^2}{16} \left[ \frac{s}{\sqrt{e}} +\frac{s}{\sqrt{e}} +\frac{4s}{\sqrt{e}}\right] =\frac{3sh^2}{8\sqrt{e}}. \end{aligned}
Recalling $$h=2/\sqrt{n}$$ and noting $$I_c=\sqrt{2\pi }/s$$, we have
\begin{aligned} I_d\le \left( 1+\frac{3s^2}{2\sqrt{2\pi e}n}\right) I_c= \left( 1+\frac{\eta s^2}{n}\right) I_c, \end{aligned}
where we have defined $$\eta =\sqrt{9/(8\pi e)}$$. $$\square$$

### 4.2 Proof of Lemma 7

We prove Lemma 7. Recall the definition of $$I_d$$ in this Lemma,
\begin{aligned} I_d=\int _{-\infty }^{\infty }h\sum _{x_1\in {{\mathcal {X}}}} \exp \left\{ -\frac{\mathbf{x}^TA\mathbf{x}}{2}\right\} {\text {d}}x_2. \end{aligned}
We evaluate the summation about $$x_1$$ in the above by using Lemma 6. We can see
\begin{aligned} \mathbf{x}^TA\mathbf{x}&=A_{11} x_{1}^2 + (A_{12}+A_{21})x_{1}x_{2}+A_{22} x_{2}^2 \\&=A_{11} (x_{1}+a x_2)^2 + b x_{2}^2, \end{aligned}
where a and b are certain constants depending on A. Thus, we have
\begin{aligned} I_d&=\int _{-\infty }^{\infty }h\sum _{x_1\in {{\mathcal {X}}}} \exp \left\{ -\frac{A_{11} (x_{1}+a x_2)^2+b x_{2}^2}{2}\right\} {\text {d}}x_2 \\&\le \left( 1+\frac{\eta A_{11}}{n}\right) \int _{-\infty }^{\infty }\int _{-\infty }^{\infty } \exp \left\{ -\frac{\mathbf{x}^TA\mathbf{x}}{2}\right\} {\text {d}}x_1 {\text {d}}x_2\\&=\left( 1+\frac{\eta A_{11}}{n}\right) I_c, \end{aligned}
which proves the lemma.

### 4.3 Proof of Lemma 8

We prove Lemma 8. Recall the definition of $$I_d$$ in this Lemma,
\begin{aligned} I_d=\int _{-\infty }^{\infty } h_1h_2\sum _{x_1\in \mathcal{X}_1}\sum _{x_2\in {{\mathcal {X}}}_2} \exp \left\{ -\frac{\mathbf{x}^TA\mathbf{x}}{2}\right\} {\text {d}}x_3. \end{aligned}
We evaluate the summation about $$x_1$$ and $$x_2$$ in above by using Lemma 6. We can see
\begin{aligned} \mathbf{x}^TA\mathbf{x}=A_{11} (x_{1}+a x_2+b x_3)^2 + c(x_2,x_3), \end{aligned}
where a and b are certain constants depending on A, and c is a quadratic function of $$x_2$$ and $$x_3$$. Thus, we have
\begin{aligned}&h_1\sum _{x_1\in {{\mathcal {X}}}_1} \exp \left\{ -\frac{A_{11} (x_{1}+a x_2+b x_3)^2 + c(x_2,x_3)}{2}\right\} \\&\quad \le \left( 1+\frac{\eta A_{11}}{n}\right) \int _{-\infty }^{\infty } \exp \left\{ -\frac{\mathbf{x}^TA\mathbf{x}}{2}\right\} {\text {d}}x_1. \end{aligned}
Using the above inequality, we can bound $$I_d$$ by
\begin{aligned}&\left( 1+\frac{\eta A_{11}}{n}\right) \nonumber \quad \int _{-\infty }^{\infty }\int _{-\infty }^{\infty } h_2\sum _{x_1\in {{\mathcal {X}}}_1}\sum _{x_2\in {{\mathcal {X}}}_2} \exp \left\{ -\frac{\mathbf{x}^TA\mathbf{x}}{2}\right\} {\text {d}}x_1 {\text {d}}x_3. \end{aligned}
(26)
With the same way, we can find
\begin{aligned} \mathbf{x}^TA\mathbf{x}=A_{22} (x_{2}+a' x_1+b' x_3)^2 + c'(x_1,x_3), \end{aligned}
where $$a'$$ and $$b'$$ are certain constants depending on A, and $$c'$$ is a quadratic function of $$x_1$$ and $$x_3$$. Thus, we have
\begin{aligned}&h_2\sum _{x_2\in {{\mathcal {X}}}_2} \exp \left\{ -\frac{A_{22} (x_{2}+a' x_1+b' x_3)^2 + c'(x_1,x_3)}{2}\right\} \nonumber \\&\quad \le \left( 1+\frac{\eta A_{22}}{n'}\right) \int _{-\infty }^{\infty } \exp \left\{ -\frac{\mathbf{x}^TA\mathbf{x}}{2}\right\} {\text {d}}x_2. \end{aligned}
(27)
According to (26) and (27), $$I_d$$ is bounded by
\begin{aligned} \left( 1+\frac{\eta A_{11}}{n}\right) \left( 1+\frac{\eta A_{22}}{n'}\right) I_c, \end{aligned}
which proves the lemma.

## Notes

### Acknowledgements

The authors thank Professor Andrew R. Barron for his valuable comments. This research was partially supported by JSPS KAKENHI Grant numbers JP16K12496 and JP18H03291.

## References

1. Arikan, E. (2009). Channel polarization. IEEE Transactions Information Theory, 55(7), 3051–3073.
2. Barbier, J., & Krzakala, F. (2014). Replica analysis and approximate message passing decoder for superposition codes. In Proc. 2014 IEEE int. symp. inf. theory, Honolulu, HI, USA, June 29–July 4, pp. 1494–1498.Google Scholar
3. Barbier, J., & Krzakala, F. (2017). Approximate message-passing decoder and capacity-achieving sparse superposition codes. IEEE Transactions Information Theory, 63(8), 4894–4927.
4. Barron, A. R., & Cho, S. (2012). High-rate sparse superposition codes with iteratively optimal estimates. In Proc. 2012 IEEE int. symp. inf. theory, Boston, MA, USA, July 1–6, pp. 120–124.Google Scholar
5. Barron, A. R., & Joseph, A. (2010a). Least squares superposition coding of moderate dictionary size, reliable at rates up to channel capacity. In Proc. 2010 IEEE int. symp. inf. theory, Austin, Texas, USA, June 13–18, pp. 275–279.Google Scholar
6. Barron, A. R., & Joseph, A. (2010b). Towards fast reliable communication at rates near capacity with Gaussian noise. In Proc. 2010 IEEE. Int. symp. inf. theory, Austin, Texas, USA, June 13–18, pp. 315–319.Google Scholar
7. Berrou, C., Glavieux, A., & Thitimajshima, P. (1993). Near Shannon limit error-correcting coding: turbo codes. In Proc. Int. Conf. Commun, Geneva, Switzerland, May, pp. 1064–1070.Google Scholar
8. Bourbaki, N. (1986). Functions of a real variable, (Japanese translation). Tokyo: TokyoTosho.Google Scholar
9. Cho, S., & Barron, A. R. (2013). Approximate iterative Bayes optimal estimates for high-rate sparse superposition codes. In The sixth workshop on information-theoretic methods in science and engineering.Google Scholar
10. Cover, T. M., & Thomas, J. A. (2006). Elements of information theory. New York: Wiley-Interscience.
11. Joseph, A., & Barron, A. R. (2012). Least squares superposition codes of moderate dictionary size are reliable at rates up to capacity. IEEE Transactions Information Theory, 58(5), 2541–2557.
12. Joseph, A., & Barron, A. R. (2014). Fast sparse superposition codes have near exponential error probability for $$R < mathcal {C}$$. IEEE Transactions Information Theory, 60(2), 919–942.
13. Kudekar, S., Richardson, T. J., & Urbanke, R. (2011). Threshold saturation via sparitial coupling: why convolutional LDPC ensembles perform so well over the BEC. IEEE Transactions Information Theory, 57(2), 803–834.
14. Rush, C., Greig, A., & Venkataramanan, R. (2017). Capacity achieving sparse regression codes via approximate message passing decoding. IEEE Transactions Information Theory, 63(3), 1476–1500.
15. Rush, C., & Venkataramanan, R. (2018). Finite sample analysis of approximate message passing. IEEE Transactions Information Theory, 64(11), 7264–7286.
16. Reed, I. S., & Solomon, G. (1960). Polynomial codes over certain finite fields. Journal of SIAM, 8, 300–304.
17. Shannon, C. E. (1948). A mathematical theory of communication. Bell System Technical Journal, 27, 379–423.
18. Takeishi, Y., Kawakita, M., & Takeuchi, J. (2014). Least squares superposition codes with Bernoulli dictionary are still reliable at rates up to capacity. Bell System Technical Journal, 60(5), 2737–2750.
19. Venkataramanan, R., Tatikonda, S., & Barron, A. R. (2019). Sparse regression codes. Foundations and Trends in Communications and Information Theory, 15(1–2), 85–283. .
20. Osada, N. (2008). A story: numerical analysis part 3; the Euler-MacLaurin formula. Rikei Heno Suugaku, July 2008. (in Japanese) http://www.lab.twcu.ac.jp/~osada/rikei/rikei2008-7.pdf. Accessed 8 June 2019