# An improved analysis of least squares superposition codes with bernoulli dictionary

- 168 Downloads
- 1 Citations

## Abstract

For the additive white Gaussian noise channel with average power constraint, sparse superposition codes (or sparse regression codes), proposed by Barron and Joseph in 2010, achieve the capacity. While the codewords of the original sparse superposition codes are made with a dictionary matrix drawn from a Gaussian distribution, we consider the case that it is drawn from a Bernoulli distribution. We show an improved upper bound on its block error probability with least squares decoding, which is fairly simplified and tighter bound than our previous result in 2014.

## Keywords

Channel coding theorem Euler–Maclaurin formula Exponential error bounds Gaussian channel Sparse superposition codes## 1 Introduction

We analyze the error probability of sparse superposition codes (Barron and Joseph 2010a, b; Joseph and Barron 2012, 2014; Venkataramanan et al. 2019) with Bernoulli dictionary and least squares decoding, which ignores computational complexity. In this paper, we improve the upper bound of the error probability shown in Takeishi et al. (2014). The obtained bound is tighter and is in a simpler form than the previous result.

In the field of information theory, after the invention of Turbo codes, which achieve communication rate near the Channel capacity *C* with practical decoders, many efforts have been made to pursue error correction codes achieving *C* with efficient decoding algorithms. In 2009, Polar codes were invented by Arikan (2009), which are the first codes achieving *C* with efficient decoders. Nowadays, at least three kinds of error correction codes which are shown to achieve *C* with efficient decoders, were Known; Polar codes, spatially coupled low density parity check (LDPC) codes (2011) Kudekar et al. (2011), and sparse superposition codes (2010). Here, “achieving channel capacity *C*” means that the decoding error probability converges to zero at \(R < C\) as the code length *n* goes to infinity (*R* is the communication rate of the code). Among these codes, sparse superposition codes proposed by Barron and Joseph are applied to the Additive White Gaussian Noise (AWGN) channel and shown to achieve the capacity (Joseph and Barron 2014).

*X*(called dictionary), where the number of its rows is smaller than that of its columns. Then, we map a message to be sent to a spares vector \(\beta \) and create a codeword

*c*for the message as

*y*and

*X*, where the dictionary

*X*is shared by the sender and the receiver. This decoding process is analogous to reconstruction in compressed sensing and sparse learning, and our analysis is related to these subjects.

*Gaussian dictionary*, the distribution of each element of codewords is a Gaussian distribution with mean zero, which is the optimal input distribution for the AWGN channel. (See Cover and Thomas (2006); Shannon (1948) for example.) This is one of the key elements of sparse superposition codes, and the error probability with least square decoding is shown to be

*d*is a certain positive constant,

*n*is code length,

*R*is a transmission rate, and

*C*is the channel capacity by Joseph and Barron (2012). The bound (1) is exponentially small in

*n*when

*R*satisfies

*n*when

*R*satisfies

*C*for the practical code length. It is still an open problem to show that sparse superposition codes with Bernoulli dictionary achieve the capacity with efficient algorithms.

We review the sparse superposition codes in Sect. 2. In Sect. 3, we show the new upper bound of the error probability with Bernoulli dictionary. Section 4 provides proofs of some lemmas used in Sect. 3.

## 2 Sparse superposition codes

In this section, we review the sparse superposition codes and show the performance of Gaussian dictionary with the least squares estimator.

In the following, ‘\(\log \)’ denotes the logarithm of base 2 and ‘\(\ln \)’ denotes the natural logarithm. The Gaussian distribution with mean \(\mu \) and variance \(\sigma ^2\) is denoted by \(N(\mu , \sigma ^2)\).

### 2.1 Problem setting

*K*bit string \(u\in \{0,1\}^K\) and that it is generated from the uniform distribution on \(\{0,1 \}^K\). We use a real value vector \(c \in \mathfrak {R}^n\) as a codeword to send a message. The codeword

*c*is polluted by the Gaussian noise in the channel. Namely, letting \(Y \in \mathfrak {R}^n\) be the output of the channel, we have

*n*and each coordinate is independently subject to \(N(0,\sigma ^2)\). The power of

*c*is defined as \((1/n)\sum _{i=1}^n c_i^2\) and it is constrained to be not more than

*P*averagely. We also define a signal-to-noise ratio as \(v=P/\sigma ^2\).

*u*based on

*Y*and

*X*. Let \(\hat{u}\) be an estimated

*u*. We call the event \(\hat{u}\ne u\) “block error”. Further, we define the transmission rate

*R*as

*K*/

*n*. It is desired that we transmit messages at large

*R*with sufficiently small block error probability. It is well known that at all rate less than

*n*.

### 2.2 Coding

*u*into a coefficient vector \(\beta \in \{0,1\}^N\) by a one to one function. The vector \(\beta \) is split into

*L*sections of size

*M*and each section has one nonzero element and the other elements are all zero. Then the codeword

*c*is formed as

*X*is an \(n\times N\) matrix (dictionary) and \(X_j\) is the

*j*th column vector of

*X*. Thus,

*c*is a superposition of

*L*column vectors of

*X*, with exactly one column selected from each section. We illustrate an example of coding method in Fig.1.

In this paper, we set all nonzero elements to 1. On the other hand, for efficient decoding algorithms such as the adaptive successive decoder proposed in Joseph and Barron (2014), nonzero elements are decaying exponentially among sections. However, we do not treat it here.

*X*is independently drawn from

*N*(0,

*P*/

*L*). This distribution is optimal for the random coding argument used to prove the channel coding theorem for the AWGN channel with average power constraint by

*P*(Shannon 1948). While in this paper, we analyze the case in which each entry of the dictionary is independently drawn as the following random variable:

*L*,

*M*, and

*N*are selected so as to satisfy the following. The number of messages is \(2^K\) according to our problem setting about

*u*, and the number of codewords is \(M^L\) according to the way of making \(\beta \). Thus, we arrange \(2^K=M^L\), equivalently, \(K = L\log M\). According to the original paper by Joseph and Barron (2012), the value of

*M*is set to be \(L^a\) and the parameter

*a*is referred to as section size rate. Then we have \(K=aL\log L\) and \(n=(aL\log L)/R\).

### 2.3 Decoding

We analyze the least squares estimator, which makes the error probability minimum ignoring computational complexity. From the received word *Y* and knowledge of the dictionary *X*, we estimates the original message *u*, equivalently, estimates the corresponding \(\beta \).

Let \(\beta ^*\) denote the true \(\beta \), then the event \(\hat{\beta }\ne \beta ^* \) corresponds to the block error. Let #mistakes denote the number of sections in which the position of the nonzero element in \(\hat{\beta }\) is different from that in the \(\beta ^*\). Define the error event \( \mathcal{E}_{\alpha _0}=\{\#{\text {mistakes}}\ge \alpha _0 L\}, \) that the decoder makes #mistakes in at least \(\alpha _0\) fraction of sections. The fraction \(\alpha =\#{\text {mistakes}}/L\) is called section error rate.

### 2.4 Performance

*n*. The following theorem (Proposition 1 in Joseph and Barron (2012)) provides an upper bound on the probability of the event \({{\mathcal {E}}}_{\alpha _0}\), where

### Theorem 1

*Suppose that each entry of*

*X*

*is independently drawn*

*from*

*N*(0,

*P*/

*L*).

*Assume*\(M=L^a\),

*where*\(a\ge a_{v,L}\),

*and the rate*

*R*

*is less than the capacity*

*C*.

*Then*

*with*\(E(\alpha _0,R)\ge h(\alpha _0,C-R)-(\ln (2L))/n\),

*where*

*is evaluated at*\(\alpha =\alpha _0\)

*and*\(\varDelta =C-R\).

### Remark

In this theorem, the unit of *R* and *C* is nat/transmission. Then, since \(n=(aL\ln L)/R\), *L* is bounded by *nR* / *a* when \(\ln L\ge 1\).

As noted in Joseph and Barron (2012), in order to bound the block error probability, we can use composition with an outer Reed-Solomon (RS) code (Reed and Solomon 1960) of rate near one. If \(R_{\text {outer}}=1-\delta \) is the rate of an RS code, with \(0<\delta <1\), then section error rates less than \(\delta /2\) can be corrected. Thus, through concatenation with an outer RS code, we get a code with rate \((1-2\alpha _0)R\) and block error probability less than or equal to \(\Pr [{{\mathcal {E}}}_{\alpha _0}]\). Arrange as \(R=C-\varDelta \) and \(\alpha _0=\varDelta \), with \(\varDelta >0\). Then the overall rate \((1-2\varDelta )(C-\varDelta )\) continues to have drop from capacity of order \(\varDelta \). The composite code have block error probability of order \(\exp \{-nd\varDelta ^2\}\), where *d* is a positive constant.

*C*when \(\alpha =1\). Then \(C_{\alpha }-\alpha C\) is a nonnegative function which equals 0 when \(\alpha \) is 0 or 1 and is strictly positive in between. Thus the quantity \(C_{\alpha }-\alpha R\) is larger than \(\alpha (C-R)\), which is positive when \(R<C\).

The following lemma (Lemma 4 in Joseph and Barron (2012)) provides an upper bound on \(\Pr [E_l]\).

### Lemma 2

*Suppose that each entry of*

*X*

*is independently drawn from*

*N*(0,

*P*/

*L*).

*Let a positive integer*\(l\le L\)

*be given and let*\(\alpha =l/L\).

*Then*, \(\Pr [E_l]\)

*is bounded by the minimum for*\(t_{\alpha }\)

*in the interval*\([0,C_{\alpha }-\alpha R]\)

*of*\({\text {err}}_{\text {Gauss}}(\alpha )\),

*where*

*with*\(\varDelta _{\alpha }=C_{\alpha }-\alpha R-t_{\alpha }\), \(1-\rho _1^2=\alpha (1-\alpha )v/(1+\alpha v)\),

*and*\(1-\rho _2^2=\alpha ^2 v/(1+\alpha ^2 v)\).

*a*is larger than

*L*goes to infinity (see Lemma 5 in Joseph and Barron (2012)).

## 3 Main results

In this section, we analyze the performance of sparse superposition codes with Bernoulli dictionary. The result stated here is an improvement of the result in Takeishi et al. (2014), where they used the same code. We improve the upper bound of the error probability by refining some lemmas in Takeishi et al. (2014) to Lemmas 6, 7, and 8 in this paper.

First, we state the main theorem in this paper.

### Theorem 3

*Suppose that each entry of*

*X*

*is independent equiprobable*\(\pm \sqrt{P/L}\).

*Assume*\(M=L^a\),

*where*\(a\ge a_{v,L}\),

*and rate*

*R*

*is less than capacity*

*C*.

*Then*,

*with*

*where*\(\iota (L)=\max \{\iota _1,\iota _2\}\),

*which are defined in*Lemma 4.

### Remark

This theorem is the correspondent of Theorem 1 in Bernoulli dictionary case and the error exponent is worse than that in Theorem 1 by \(\iota (L)\), which turns out to be \(O(1/\sqrt{L})\) by Lemma 4. This theorem is the same form as the previous result in Theorem 5 in Takeishi et al. (2014), however, \(\iota (L)\) converges to zero more rapidly than that in the previous result Takeishi et al. (2014), in which \(\iota (L)\) was \(O(\sqrt{\ln L}/L^{1/4})\) as details mentioned later.

To prove Theorem 3, we use the following lemma, which is the correspondent in this case to Lemma 2. The definition of \(\iota _1\) and \(\iota _2\) in Theorem 3 is given in the following lemma.

### Lemma 4

*Suppose that each entry of*

*X*

*is independently equiprobable*\(\pm \sqrt{P/L}\).

*Let*\(\alpha _0\)

*be a certain real number in*(0, 1]

*and*\(\alpha = l/L\).

*Then, for every*\(L \ge 2\)

*and*

*for all*

*l*

*such that*\(\alpha _0 \le \alpha \le 1\), \(\Pr [E_l]\)

*is*

*bounded by the minimum for*\(t_{\alpha }\)

*in the interval*\([0,C_{\alpha }-\alpha R]\)

*of*\(err_{Ber}(\alpha )\),

*where*

*with*\(\varDelta _{\alpha }=C_{\alpha }-\alpha R-t_{\alpha }\), \(1-\rho _1^2=\alpha (1-\alpha )v/(1+\alpha v)\),

*and*\(1-\rho _2^2=\alpha ^2 v/(1+\alpha ^2 v)\).

*The variables*\(\iota _1\)

*and*\(\iota _2\)

*are defined by the following series of equations*

*where*

*and the function*\(\phi \)

*is defined in*Lemma 5.

### Remark

The function \(\phi \) is *O*(1 / *L*) by Lemma 5. Thus we have \(\iota _1=O(1/\sqrt{L})\) and \(\iota _2=O(1/L)\). So \(\iota =\iota (L)\) in Theorem 3 is \(O(1/\sqrt{L})\).

To prove this lemma, we evaluate the difference between a binomial distribution and a Gaussian distribution with identical mean and variance. We perform it by the following two steps. The first step is evaluating the proportion of the probability mass function of the binomial distribution to the probability density function of the Gaussian. The following lemma is given in Takeishi et al. (2014) to evaluate that, where \(N(x| \mu ,\sigma ^2)\) denotes the density function of the distribution \(N(\mu ,\sigma ^2)\).

### Lemma 5

*l*,

The second step is to evaluate the error in replacing summation about a discrete random variable with integral about the corresponding continuous random variable. It is a feasible way to replace the summation with the integral by the sectional measurement. In the previous result Takeishi et al. (2014), they evaluated the error in the sectional measurement by Lemmas 8, 9, and 10 in Takeishi et al. (2014). In this paper, we improve these lemmas. The following lemma is an improvement of Lemma 8 in Takeishi et al. (2014).

### Lemma 6

*n*, let \(h=2/\sqrt{n}\) and \(x_k=h(k-n/2)\) (\(k=0,1,\ldots ,n\)). For \(\mu \in \mathfrak {R}\) and \(s>0\), define

Further, by reconsidering the proof and using Lemma 6, we also improve Lemmas 9 and 10 in Takeishi et al. (2014). The following lemmas are improvements of Lemmas 9 and 10 in Takeishi et al. (2014), respectively.

### Lemma 7

*n*, define \(h=2/\sqrt{n}\) and \({{\mathcal {X}}}=\{h(k-n/2)\mid k=0,1,\ldots ,n\}\). Further, for a 2-dimensional real vector \(\mathbf{x}=(x_1,x_2)^T\) and a strictly positive definite \(2\times 2\) matrix

*A*, define

*A*.

### Lemma 8

*n*and \(n'\), define \(\mathcal{X}_1=\{h_1(k-n/2)\mid k=0,1,\ldots ,n\}\) and \(\mathcal{X}_2=\{h_2(k-n'/2)\mid k=0,1,\ldots ,n'\}\), where \(h_1=2/\sqrt{n}\) and \(h_2=2/\sqrt{n'}\). Further, for a 3-dimensional real vector \(\mathbf{x}=(x_1,x_2,x_3)^T\) and a strictly positive definite \(3\times 3\) matrix

*A*, define

*i*,

*j*) element of matrix

*A*.

### 3.1 Proof of Lemma 4

We prove Lemma 4 along the lines of the proof of Lemma 6 in Takeishi et al. (2014), which is based on Lemma 4 in Joseph and Barron (2012).

We evaluate the probability of the event \(E_l\). The random variables are the dictionary \(X=(X_1,X_2,\ldots ,X_N)\) and the noise \(\epsilon \).

*j*for which \(\beta _j\) is nonzero. Further, let \( {{\mathcal {A}}}=\{S(\beta )| \beta \in {{\mathcal {B}}}\} \) denote the set of allowed subsets of terms. Let \(\beta ^*\) denote \(\beta \) which is sent, and let \(S^*=S(\beta ^*)\). Furthermore, for \(S\in {{\mathcal {A}}}\), let \(X_S=\sum _{j\in S}X_j\). For the occurrence of \(E_l\), there must be an \(S\in {{\mathcal {A}}}\) which differs from \(S^*\) in an amount

*l*and which has \(\Vert Y-X_S\Vert ^2 \le \Vert Y-X_{S^*}\Vert ^2\). Let

*S*denote a subset which differs from \(S^*\) in an amount

*l*. Here, we define

*T*(

*S*) as

*x*of length

*n*, \(|x|^2\) denote \((1/n)\sum _{i=1}^n x_i^2\). Then \(T(S)\le 0\) is equivalent to \(\Vert Y-X_S\Vert ^2\le \Vert Y-X_{S^*}\Vert ^2\). The subsets

*S*and \(S^*\) have an intersection \(S_1=S\cap S^*\) of size \(L-l\) and a difference \(S_2=S\setminus S^*\) (\(=S\setminus S_1\)) of size

*l*. Note that \(X_{S^*}\) and

*Y*are independent of \(X_{S_2}\).

*l*and \(\widetilde{T}(S)\le {\tilde{t}}\). Similarly, for a negative \(t^*=-t_{\alpha }\), let \(E_l^*\) denotes a corresponding event that \(T^*\le t^*\). Then we have

*I*and

*Y*given \(X_{S_1}\) in case \(X_{ij}\sim N(0,P/L)\), and \(p^h_{Y|X_S}\) denotes the conditional probability density function

*Y*given \(X_{S}\) under the hypothesis that \(X_S\) was sent. Hence we have

*Y*and \(X_{S_1}\). Here, we define \(Y'=Y-X_{S_1}\) and define \(P_{Y'}\) as the probability density function of each coordinate of \(Y'\) and \(P_{Y'}^{(c)}\) as \(P_{Y'}\) in case \(X_{ij}\sim N(0,P/L)\). Then we have

*l*. Then, by applying Lemma 5, we have

*I*and

## 4 Proofs of lemmas

In this section, we prove the lemmas used in Sect. 3.

### 4.1 Proof of Lemma 6

### Theorem 9

*Let*

*f*(

*x*)

*be a class*\(C^{2m+2}\)

*function over*\([a,b] \subset \mathfrak {R}\). Let \(\delta = (b-a)/(n+2)\),

*and*\(y_k = a+ (k+1)\delta \) (\(k=-1,0,1,\ldots ,n\)).

*Note that*\(y_{-1}=a\)

*and*\(y_{n+1}=b\).

*Then*

*holds, where*

For the proof, see Osada (2019) for example. We use this theorem with \(m=0\), which yields the tightest order result (Lemma 7) for our goal. Further, for \(m=0\) we can easily obtain some generalization of the Euler-Maclaurin formula as the following lemma, with which we can optimize the constant factor of the upper bound in Lemma 7.

### Lemma 10

*f*(

*x*) be a class \(C^{2}\) function over \([a,b] \subset \mathfrak {R}\). Let \(\delta = (b-a)/(n+2)\), and \(y_k = a+ (k+1)\delta \) (\(k=-1,0,1,\ldots ,n\)). Note that \(y_{-1}=a\) and \(y_{n+1}=b\). Then, for all \({\bar{b}}_2\),

### Proof

### Remark 2

In particular with \({\bar{b}}_2=b_2\), (25) is reduced to the Euler-Maclaurin formula with \(m=0\).

### Remark 3

*s*than Lemma 6. It is due to higher order derivative of

*f*(

*x*) defined below. Note that if we make partial integration to the residual term of the formula in Bourbaki (1986), it yields the same one as \(R_{m+1}\) in (24).

Now, we can prove Lemma 6.

### Proof of Lemma 6

### 4.2 Proof of Lemma 7

*a*and

*b*are certain constants depending on

*A*. Thus, we have

### 4.3 Proof of Lemma 8

*a*and

*b*are certain constants depending on

*A*, and

*c*is a quadratic function of \(x_2\) and \(x_3\). Thus, we have

*A*, and \(c'\) is a quadratic function of \(x_1\) and \(x_3\). Thus, we have

## Notes

### Acknowledgements

The authors thank Professor Andrew R. Barron for his valuable comments. This research was partially supported by JSPS KAKENHI Grant numbers JP16K12496 and JP18H03291.

## References

- Arikan, E. (2009). Channel polarization.
*IEEE Transactions Information Theory*,*55*(7), 3051–3073.MathSciNetCrossRefGoogle Scholar - Barbier, J., & Krzakala, F. (2014). Replica analysis and approximate message passing decoder for superposition codes. In
*Proc. 2014 IEEE int. symp. inf. theory*, Honolulu, HI, USA, June 29–July 4, pp. 1494–1498.Google Scholar - Barbier, J., & Krzakala, F. (2017). Approximate message-passing decoder and capacity-achieving sparse superposition codes.
*IEEE Transactions Information Theory*,*63*(8), 4894–4927.MathSciNetCrossRefGoogle Scholar - Barron, A. R., & Cho, S. (2012). High-rate sparse superposition codes with iteratively optimal estimates. In
*Proc. 2012 IEEE int. symp. inf. theory*, Boston, MA, USA, July 1–6, pp. 120–124.Google Scholar - Barron, A. R., & Joseph, A. (2010a). Least squares superposition coding of moderate dictionary size, reliable at rates up to channel capacity. In
*Proc. 2010 IEEE int. symp. inf. theory*, Austin, Texas, USA, June 13–18, pp. 275–279.Google Scholar - Barron, A. R., & Joseph, A. (2010b). Towards fast reliable communication at rates near capacity with Gaussian noise. In
*Proc. 2010 IEEE. Int. symp. inf. theory*, Austin, Texas, USA, June 13–18, pp. 315–319.Google Scholar - Berrou, C., Glavieux, A., & Thitimajshima, P. (1993). Near Shannon limit error-correcting coding: turbo codes. In
*Proc. Int. Conf. Commun*, Geneva, Switzerland, May, pp. 1064–1070.Google Scholar - Bourbaki, N. (1986).
*Functions of a real variable, (Japanese translation)*. Tokyo: TokyoTosho.Google Scholar - Cho, S., & Barron, A. R. (2013). Approximate iterative Bayes optimal estimates for high-rate sparse superposition codes. In
*The sixth workshop on information-theoretic methods in science and engineering*.Google Scholar - Cover, T. M., & Thomas, J. A. (2006).
*Elements of information theory*. New York: Wiley-Interscience.zbMATHGoogle Scholar - Joseph, A., & Barron, A. R. (2012). Least squares superposition codes of moderate dictionary size are reliable at rates up to capacity.
*IEEE Transactions Information Theory*,*58*(5), 2541–2557.MathSciNetCrossRefGoogle Scholar - Joseph, A., & Barron, A. R. (2014). Fast sparse superposition codes have near exponential error probability for \(R < mathcal {C}\).
*IEEE Transactions Information Theory*,*60*(2), 919–942.MathSciNetCrossRefGoogle Scholar - Kudekar, S., Richardson, T. J., & Urbanke, R. (2011). Threshold saturation via sparitial coupling: why convolutional LDPC ensembles perform so well over the BEC.
*IEEE Transactions Information Theory*,*57*(2), 803–834.MathSciNetCrossRefGoogle Scholar - Rush, C., Greig, A., & Venkataramanan, R. (2017). Capacity achieving sparse regression codes via approximate message passing decoding.
*IEEE Transactions Information Theory*,*63*(3), 1476–1500.MathSciNetCrossRefGoogle Scholar - Rush, C., & Venkataramanan, R. (2018). Finite sample analysis of approximate message passing.
*IEEE Transactions Information Theory*,*64*(11), 7264–7286.MathSciNetCrossRefGoogle Scholar - Reed, I. S., & Solomon, G. (1960). Polynomial codes over certain finite fields.
*Journal of SIAM*,*8*, 300–304.MathSciNetzbMATHGoogle Scholar - Shannon, C. E. (1948). A mathematical theory of communication.
*Bell System Technical Journal*,*27*, 379–423.MathSciNetCrossRefGoogle Scholar - Takeishi, Y., Kawakita, M., & Takeuchi, J. (2014). Least squares superposition codes with Bernoulli dictionary are still reliable at rates up to capacity.
*Bell System Technical Journal*,*60*(5), 2737–2750.MathSciNetzbMATHGoogle Scholar - Venkataramanan, R., Tatikonda, S., & Barron, A. R. (2019). Sparse regression codes.
*Foundations and Trends in Communications and Information Theory*,*15*(1–2), 85–283. https://doi.org/10.1561/0100000092.CrossRefzbMATHGoogle Scholar - Osada, N. (2008). A story: numerical analysis part 3; the Euler-MacLaurin formula. Rikei Heno Suugaku, July 2008. (in Japanese) http://www.lab.twcu.ac.jp/~osada/rikei/rikei2008-7.pdf. Accessed 8 June 2019