1 Introduction

By Shannon, the scalar additive Gaussian noise channel subject to average power constraints achieves capacity if the input distribution is Gaussian as well. This result transfers to complex circularly symmetric Gaussian vector channels, as Telatar showed in [1]. His general model particularly applies to multiple-input multiple-output (MIMO) transmission systems. However, due to the unlimited support of the normal distribution, this input is not realizable in practice. Thus, peak power constraints of different types have been imposed to avoid unbounded power requirements for the transmitter. It is a very interesting implication, that the capacity achieving input distribution then becomes discrete with finite support. Many challenging questions arise in this context. A good overview of previous research on this topic is given in [2]. The discreteness of the capacity achieving distribution was shown in [3] for the real and in [4] for the complex Gaussian channel. By considering conditionally Gaussian vector channels subject to bounded-input constraints by some bounded set \({{\mathcal S}\in{\mathbb{R}}^n}\) reference [5] generalizes a number of previous papers on the subject. Under certain conditions on \({{\mathcal S}}\) the capacity achieving distribution is discrete, which includes the previously mentioned channels as special cases. Non-coherent additive white Gaussian noise channels are investigated in [6] and it is shown that the optimum distribution is discrete. The same conclusion was shown for general fading channels in [7] and for Rician fading channels in [8]. Related topics are discussed in the following two references. In [9] a characterization for the optimum number of mass points is given. Reference [10] investigates the optimum constellation of M equiprobable complex signals for an additive Gaussian channel under average power constraints such that the error probability is minimized.

Summarizing the above, for practical purposes it is sufficient to investigate signaling constellations of a maximum number M of mass points. In this context, the following general problem of outermost interest arises. Given a closed and bounded subset \({{\mathcal S}\in{\mathbb{R}}^n}\) of possible signaling points, determine a discrete input distribution consisting of a maximum number M of support points \({x_1,\ldots,x_M\in{\mathcal S}}\) and probabilities P(X = x i ) = p i , 1 ≤ i ≤ M, which maximizes the mutual information between channel input and output, thus is capacity-achieving in the set of discrete distributions over \({{\mathcal S}}\) with at most M support points. Note that the optimum solution may exhibit p i  = 0 for some \(i\in\{1,\ldots,M\}\) such that the number of effectively used points may be less than M. For the special case of conditional Gaussian channels and a non restricted number of signaling points, a partial answer is given in [2]. However, in general this seems to be a hard problem.

In this paper, we confine ourselves to a large, finite constellation set and ask the question of how to select both a small subset of prescribed cardinality and the associated probability measure such that the mutual information is highest, continuing some of the topics in [11]. As analytical results are extremely difficult to achieve, it is of interest to find algorithms that provide good input distributions and probabilities. The main purpose of using only a small number of signaling points is to simplify the receiver design and corresponding decoding algorithms.

The classical Blahut-Arimoto algorithm computes in an elegant way the capacity of a discrete memoryless channel [12, 13]. In the context of our distribution selection, we extend the algorithm to the considered case of discrete input and continious output. We prove convergence of the extended algorithm by means of some optimality criterion. As the standard Blahut-Arimoto algorithm is computationally quite complex, several enhancements have been proposed. The most important ones, the natural-gradient-based algorithm and the accelerated Blahut-Arimoto algorithm are mentioned in [14]. They converge significantly faster to the capacity-achieving distribution and can be extended to our system model. References [1517] suggest another interesting enhancement, the iterated Blahut-Arimoto algorithm. It was proposed for discrete memoryless channels and we extend it to a discrete input continuous output channel.

The material in this correspondence is organized as follows. First, we introduce the precise system model and the problem description in Sect. 2. In Sect. 3 we solve the subset selection by utilizing semidefinite programming and two different relaxation techniques. The probability measure optimization is considered in Sect. 4. Two different approaches are analyzed. The first one is a successive subset probability measure selection, while the second one updates both chosen signaling points and probability measure simultaneously. Numerical results are presented in Sect. 5. The paper concludes with a short summary and outlook in Sect. 6.

2 System model and prerequisites

We consider a channel with discrete input and continuous output. Random variable \(\user2{X}\) denotes the channel input with finite input alphabet \({{\mathcal X}}\) of possible signaling points \({\user2{x}_1,\ldots,\user2{x}_M\in{\mathbb{R}}^n.}\) These signaling points are used by the transmitter according to a certain input distribution \(\user2{p}=(p_1,\ldots,p_M).\) The channel output \(\user2{Y}\) is randomly distorted by noise. The distribution of \(\user2{Y}\) given input \(\user2{X}=\user2{x}_i\) is assumed to have Lebesque density

$$ f(\user2{y} |\user2{x}_i)=f_i(\user2{y}), \quad \user2{y}\in{\mathbb{R}}^n. $$

An example for such a channel is the additive channel \(\user2{Y}=\user2{X}+\user2{n}\) with \(f_i(\user2{y})=g(\user2{y}-\user2{x}_i)\) where g denotes the noise density. Further the transition probability of the input \(\user2{x}_i\) given output \(\user2{y}\) is given by \(P(\user2{x}_i|\user2{y}).\) The set of all \(P(\user2{x}_i|\user2{y})\) is called \({{\mathcal{P}}}\).

As outlined in the introduction, a challenging task is to choose a fixed size subset from the whole finite support and the corresponding distribution such that the channel is utilized optimally in terms of maximizing the mutual information between the channel input and the channel output. Being more precise, the task is to find a subset \({{\mathcal X}'\subset{\mathcal X}}\) with cardinality K for some fixed number K and the associated probability measure \(\user2{p}=(p_1,\ldots,p_M)\) where p i  = 0 for \({i\in {\mathcal X}\backslash {\mathcal X}'.}\)

As was shown in [18], a necessary and sufficient condition for a distribution to be capacity achieving is the following proposition. This proposition will be used to show the optimality of our converging algorithm in Sect. 4.

Proposition 1

Input distribution\(\user2{p}^*\)is capacity achieving if and only if

$$ D\left(f_i \| \sum_{j=1}^M p_j^*f_j\right)=\hbox {const}, $$

for allisuch thatp i  > 0, where\(D(f \| g)=\int f\log\frac{f} {g}\)denotes the Kullback-Leibler divergence between densitiesfandg. Furthermore, if\(H(f_i)=-\int f_i(\user2{y})\log f_i(\user2{y}) \rm{d}\user2{y}\)is independent ofithen\(\user2{p}^*\)is capacity achieving if and only if

$$ \int f_i(\user2{y})\log\left(\sum_{j=1}^M p_j^*f_j(\user2{y})\right) \rm {d}\user2{y}=\hbox{const} $$

for allisuch thatp i  > 0.

3 Selecting the subset

The problem of choosing both a fixed size subset and the corresponding distribution is rather complex. Therefore, we first consider a simplified approach by adding the assumption of a uniform distribution for the subset. Hence, the task is reduced to only decide which signaling points should be included in the subset and which not, i.e., the task is to find a subset \({{\mathcal X}'\subset{\mathcal X}}\) with cardinality K such that \(\user2{X}\) with distribution

$$ p_i= \left\{\begin{array}{ll} 1/K & \hbox {if } \user2{x}_i \in {\mathcal X}'\\ 0 & \hbox {else}\end{array}\right. $$

is capacity-achieving in the set of uniform distributions on \({{\mathcal X}}\) with at most K support points.

In what follows, we consider two different criteria to find the best subset \({{\mathcal X}'}\) of given size K.

The capacity achieving subset: The most intuitive approach is to find the subset that maximizes the mutual information between the channel input and the channel output,

$$ \max_{{\mathcal X}'\subset {\mathcal X}}\; \sum_{\user2{x}_i\in{\mathcal X}'}\int p_i f(\user2{y} | \user2{x}_i) \log \frac{f(\user2{y} | \user2{x}_i)} {\sum_{\user2{x}_j\in{\mathcal X}'}p_jf(\user2{y} | \user2{x}_j)} \rm{d}\user2{y}. $$
(1)

Using the assumption that \(p_i=\frac{1}{K},\) Eq. (1) is equivalent to solving

$$ \max_{{\mathcal X}'\subset {\mathcal X}}\; \sum_{\user2{x}_i\in{\mathcal X}'}\int f(\user2{y} | \user2{x}_i) \log \frac{f(\user2{y} | \user2{x}_i)} {\sum_{\user2{x}_j\in{\mathcal X}'}f(\user2{y} | \user2{x}_j)} \rm{d}\user2{y}. $$

The cutoff rate maximizing subset: The second criterion we use is the maximization of the cutoff rate, which is a lower bound of the channel capacity [19]. Figure 1 depicts the gap between the cutoff rate and the channel capacity using the setup given in Sect. 5, thereby showing the good performance of the lower bound. The cutoff rate with respect to our channel model is

$$ \max_{{\mathcal X}'\subset {\mathcal X}} -\log \int \left[ \sum_{\user2{x}_i\in{\mathcal X}'} p_i \sqrt{f(\user2{y} | \user2{x}_i)} \right]^2 \rm{d}\user2{y}. $$

Using again that \(p_i=\frac{1} {K}\) and that log is a monotone function, the above equation transforms to

$$ \min_{{\mathcal X}'\subset {\mathcal X}} \int \left[ \sum_{\user2{x}_i\in{\mathcal X}'} \sqrt{f(\user2{y} | \user2{x}_i)} \right]^2 \rm{d}\user2{y}. $$
(2)

In what follows, we will show that this problem can be approached by exploiting semidefinite programming. We will use problem (2) to generate a selection of subsets and then decide on the one maximizing the mutual information. Thereby, we will see that we obtain a very good approximation to the solution of problem (1).

Fig. 1
figure 1

Gap between the cutoff rate and the channel capacity

3.1 Subset selection using semidefinite programming

We assume, the power consumption of signaling point i to be w i and introduce a binary switching algorithm to solve (2). As highest cutoff rate does not always lead to highest mutual information, we use the cutoff rate approach to find distributions with relatively high mutual information first, and then choose the best one among them. The methodological approach is as follows. Problem (2) can be transformed into a constrained binary quadratic minimization problem, see also [20] where a similar approach was applied to increase the cutoff rate in discrete memoryless channels. We obtain problem

$$ \begin{aligned} &\min_{\user2{b} \in \{ 0,1 \}^M} \user2{b}^T \user2{A} \user2{b} \\ &\hbox {s.t.} \quad {\bf 1}^T_M\user2{b} =K \\ &\user2{w}^T\quad\user2{b}+b_{M+1}=W. \end{aligned} $$

where the ij-th entry of \(\user2{A}\) is \(\int{\sqrt{f_i(\user2{y})f_j(\user2{y})} \rm{d}\user2{y}}.\) Vector \(\user2{b}\) is a binary vector where a “1” indicates that the corresponding symbol is included in the subset, a “0” that the corresponding element is excluded. Further, slack variable b M+1 ≥ 0, transforms the power constraint \(\user2{w}^T\user2{b}\leq W\) with \(\user2{w}=(w_1,\ldots,w_M)^T\) to \(\user2{w}^T\user2{b}+b_{M+1}=W.\) For the derivation of the new objective, assume that \({{\mathcal X}'=\{\user2{x}_{i_1}, \ldots, \user2{x}_{i_K}\}.}\) Then, it holds that

$$ \begin{aligned} \int \left[ \sum_{{\user2{x}_i}\in{\mathcal{X}}'} \sqrt{f(\user2{y}\| \user2{x}_i)} \right]^2 \rm{d}\user2{y} &=\int \left[\sum_{\user2{x}_i\in{\mathcal X}'} \sqrt{f_i(\user2{y})} \right]^2 \rm{d}\user2{y} \\ &=\int \left[ \sqrt{f_{i_1}(\user2{y})} +\cdots + \sqrt{f_{i_K}(\user2{y})} \right]^2 \rm{d}\user2{y} =\int \sum_{i,j s.t. \user2{x}_i, \user2{x}_j \in {\mathcal X}'} \sqrt{f_{i}(\user2{y})} \sqrt{f_{j}(\user2{y})} \rm{d}\user2{y} \\ &=\sum_{i,j s.t. \user2{x}_i, \user2{x}_j \in {\mathcal X}'} \int \sqrt{f_{i}(\user2{y})} \sqrt{f_{j}(\user2{y})} \rm{d}\user2{y} =\sum_{i,j} b_i b_j \int \sqrt{f_{i}(\user2{y})} \sqrt{f_{j}(\user2{y})} \rm{d}\user2{y} \\ &=\user2{b}^T \user2{A} \user2{b}, \end{aligned} $$

thus, we obtain the new objective.

To make the optimization problem symmetrical and thus suitable for a convex formulation, we introduce another vector \(\user2{s}=(s_1,\ldots,s_n)^T\) with n = M + 2 and

$$ \begin{aligned} \user2{b} &= s_n (s_1,\ldots,s_M)^T, \\ b_{M+1} &=s_n s_{M+1} \end{aligned} $$

where \(s_n\in\{-1,1\}.\) The values of s i result from s n . If s n  = 1, then \(s_i\in\{0,1\};\) if s n = −1, then \(s_i\in\{-1,0\}.\) It follows that the above problem is equivalent to

$$ \begin{aligned} \min_{\user2{s}} &\quad \user2{s}^T\user2{B}\user2{s}\\ \hbox {s.t.}&\quad s_is_n=s_i^2\quad\forall i, \\ &\quad s_n^2=1, \\ &\quad s_n\cdot{\bf 1}_M^T\cdot[s_1,\ldots,s_M]^T=K, \\ &\quad s_n\cdot\user2{w}^T\cdot[s_1,\ldots,s_M]^T+s_n s_{M+1}=W, \\ \end{aligned} $$
(3)

with

$$ \user2{B}=\left[ \begin{array}{ll} \user2{A} & {\bf 0}_{M \times 2} \\ {\bf 0}_{2 \times M}& {\bf 0}_{2 \times 2}\\ \end{array}\right]. $$

Using substitution \(\user2{S}=\user2{s}\user2{s}^T\) it can be shown that the above optimization problem can be written as

$$ \begin{aligned} & \quad \hat{\user2{S}}=\hbox{argmin}_{\user2{S}\geq{\bf 0}} \hbox{trace}(\user2{B}\user2{S})\\ &\hbox {s.t.}\quad S_{ii}=S_{in} \forall i, S_{nn}=1, \sum_{i=1}^M S_{ni}=K,\\ & \quad \sum_{i=1}^M w_i S_{ni} + S_{n,M+1}, \hbox{rank}(S)=1.\\ \end{aligned} $$
(4)

The semidefinite programming (SDP) relaxation of (4) can efficiently be solved in polynomial time [21]. If the resulting matrix \(\hat{\user2{S}}\) has rank one, then the relaxation is tight. Otherwise, special techniques are required to convert the SDP relaxation solution back to the solution of an binary quadratic problem, see, e.g., [22, 23]. Using any of these relaxation techniques, we then obtain the set of estimators \(\{\tilde{\user2{s}}\}.\) Before remapping it back to vector \(\user2{b},\) it has to be checked whether the estimator fulfills the average power constraint and find the one which maximizes the channel capacity among all those. Originally, \(\tilde{\user2{s}}^{(i)}\) fulfills the average power constraint. However, after the quantization in Step 11, it is very probable that the average power overflows. This is the reason that rechecking in Step 12 is necessary even if the average power constraint is already added before. The resulting vector \(\tilde{\user2{s}}\) leads to the approximate solution of the original binary vector \(\user2{b}.\) This successive subset selection algorithm is summarized in Algorithm 1. Note that the metric to choose \(\tilde{\user2{s}}\) in Algorithm 1 differs from [20]. It is a traditional way to use

$$ \tilde{\user2{s}}=\hbox{argmin}{\tilde{\user2{s}}^{(i)}} {\user2{s}^{(i)}}^T\user2{B} \user2{s}^{(i)} $$

which directly maximizes the cutoff rate as expected and where \(\tilde{\user2{s}}^{(i)}\) denotes a scale for a certain input probability. However, there exists a gap between cutoff rate and channel capacity. To guarantee the best result among the randomizations, we directly calculate the capacity for each candidate of \(\tilde{\user2{s}}\) in Step 15. Further, Step 15 helps to reduce the computation load since sometimes different randomization loops result the same or equivalent selection due to a symmetric noise distribution. We do not need to include any selection which already occurs.

Algorithm 1 Subset selection algorithm

3.2 Heuristic improvement based on weights of signaling points

The solution can further be improved by applying the following heuristic method. For readability, we mention this sub-algorithm separately from Algorithm 1. Consider the vector \(\tilde{\user2{s}}^{(i)}\) in Step 10 of Algorithm 1 which is not yet quantized and sort the first M entries in descending order. In Algorithm 1, the highest K entries are directly chosen. It is easy to see that this selection is not optimal in general. This holds as the order is based on the weight of each signaling point and thus a higher weight implies a higher probability to be a better point with respect to our objective of maximizing the cutoff rate. As exhaustive search is too complex to apply for our scenarios, we use a simple heuristic method to reduce the computational complexity of an optimal exhaustive search. In the method, the unquantized first M entries of \(\tilde{\user2{s}}^{(i)}\) are kept in the buffer, adding \(\tilde{\user2{s}}^{(i)}_{un}=[\tilde{s}^{(i)}_1,\cdots, \tilde{s}^{(i)}_M]\) between Step 10 and 11. In Step 18, we obtain \(\tilde{\user2{s}}\) and the paired \(\tilde{\user2{s}}_{un}=\tilde{\user2{s}}_{un}^{(i)},\) which we divide into four segments in descending order:

  • the largest K − K 1 entries,

  • the second largest K 1 entries,

  • the third largest K 2 entries and

  • the smallest M − K − K 2 entries

where \({K_1, K_2 \in{\mathbb{R}}}\) and 0 ≤ K 1 ≤ K, 0 ≤ K 2 ≤ M − K. The first segment is included in the final subset, while the fourth is excluded. From the second and third segment, we choose the best, mutual information maximizing K 1 entries to obtain the new subset. Figure 2 illustrates how this method works. Simulations show that the solution of Algorithm 1 improves by applying this heuristic method even for rather small K 1 and K 2 such that the computational complexity of this heuristic method is low. Note that the actual choice of K 1 and K 2 is a trade-off between complexity and performance. We denote the subset selection algorithm (Algorithm 1) endowed with the above heuristic improvement by SSA. For the sake of clarity, Table 1 briefly summarizes the algorithms introduced in this work.

Fig. 2
figure 2

Heuristic method to improve the subset selection

Table 1 Overview of algorithm names and abbreviations

4 Selecting both signaling constellation and input distribution

In the previous section, we investigated the task of finding an optimum signaling constellation in a bounded set by considering a uniform distribution on the selected subset. Our numerical evaluations in Sect. 5 show that the corresponding mutual information is quite close to the capacity of the channel using the full signaling constellation. However, the uniform distribution we used for the obtained set is, of course, suboptimal and we would like to approach the question of optimizing both the set of signaling points and the corresponding input probability. We investigate this task by two different procedures, which are compared in Sect. 5. First, the subset is chosen according to Sect. 3 and the input distribution is improved using an idea which is based on the Blahut Arimoto algorithm. Second, we improve the selected subsets and probabilities simultaneously.

4.1 Successive subset and distribution selection

The approach described in Sect. 3 gives a subset of signaling points which is capacity-achieving in the set of uniform distributions with at most K support points. An open question is, if the chosen subset can be utilized in a better way by using a non-uniform distribution. This question is answered in the following.

The idea is based on the classical Blahut-Arimoto algorithm which computes in an elegant way the capacity of a discrete memoryless channel [12, 13]. We extend the algorithm to the here considered case of discrete input and continuous output. Extending Theorem 1 in [12] to our system model, we obtain the following proposition.

Proposition 2

With

$$ I_0(\user2{p},\,{{\mathcal{P}}},\,\user2{f})=\sum_{i=1}^M\int p_i f_i \log \frac{P(\user2{x}_i|\user2{y})} {p_i} {\rm d} \user2{y} $$

the following holds.

  1. 1.

    The channel capacity is given by

    $$ C=\max_{\user2{p}}I(\user2{p},\user2{f})= \max_{\user2{p}}\max_{{{\mathcal{P}}}}I_0(\user2{p},{{\mathcal{P}}},\,\user2{f}). $$
  2. 2.

    For fixed \({\user2{p}, I_0(\user2{p},{\mathcal{P}},\user2{f})}\) is maximized by

    $$ P(\user2{x}_i|\user2{y})=\frac{p_i f_i} {\sum_{j=1}^M p_j f_j}. $$
    (5)
  3. 3.

    For fixed \({{\mathcal{P}}, I_0(\user2{p},{\mathcal{P}},\user2{f})}\) is maximized by

    $$ p_i=\frac{\exp(\int f_i \log P(\user2{x}_i|\user2{y}) {\rm d} \user2{y})}{\sum_{j=1}^M \exp(\int f_j \log P(\user2{x}_j|\user2{y}) {\rm d} \user2{y})}. $$
    (6)

Inserting (5) into (6) gives

$$ p_i=\frac{p_i\exp\left(\int f_i \log \frac{f_i} {\sum_{j=1}^M p_j f_j} {\rm d}\user2{y}\right)}{\sum_{j=1}^M p_j\exp\left(\int f_j \log \frac{f_j}{\sum_{k=1}^M p_k f_k} {\rm d} \user2{y}\right)} =\frac{p_i\exp(D(f_i||f_0))}{\sum_{j=1}^M p_j\exp(D(f_j||f_0))}, $$
(7)

where f 0 = ∑ Mi=1 p i f i . This yields the idea for the updating algorithm. The probabilities \(\user2{p}^{(k)}\) of iteration k ≥ 1 are obtained by

$$ p_i^{(k)}=\frac{1} {N^{(k)}} p_i^{(k-1)} \exp \left(D(f_i||f_0^{(k-1)})\right), \quad i=1,\ldots,M, $$

where N (k) is the normalizing factor

$$ N^{(k)}=\sum_{j=1}^M p_j^{(k-1)} \exp \left(D(f_j||f_0^{(k-1)})\right), $$

and

$$ f_0^{(k)}=\sum_{j=0}^M p_j^{(k)} f_j. $$

Note that f (k)0 is updated not because of f i but p (k) i . Starting from a distribution \(\user2{p}^{(1)}\) with all components strictly positive, this algorithm converges, see [12]. Let \(\hat{\user2{p}}\) denote the limit. From (7) it follows that

$$ \exp(D(f_i||f_0))=\sum_{j=1}^M \hat{p}_j\exp(D(f_j||f_0)), \quad \hbox { for all } i=1,\ldots,M, $$

which implies

$$ D\left(f_i \| \sum_{j=1}^M \hat{p}f_j\right)=\hbox {const}. $$

Thus, by Proposition 1, which is a necessary and sufficient optimality criterion for optimality, \(\hat{\user2{p}}\) is capacity-achieving.

As the standard Blahut-Arimoto algorithm is computationally quite complex, several enhancements have been proposed. The most important ones, the natural-gradient-based algorithm and the accelerated Blahut-Arimoto algorithm are mentioned in [14]. They converge significantly faster to the capacity-achieving distribution and can be extended to our system model.

The accelerated Blahut-Arimoto algorithm updates \(\user2{p}^{(k)}, k\geq 1,\) according to

$$ p_i^{(k)}=\frac{1} {N^{(k)}} p_i^{(k-1)} \exp \left(\mu^{(k)}D(f_i||f_0^{(k-1)})\right), \quad i=1,\ldots,M, $$

where μ(k) denotes an accelerating step size. As suggested in [14], the factor μ(k) is adjusted by

$$ \mu^{(k+1)}=\frac{D\left(\user2{p}^{(k)}||\user2{p}^{(k-1)}\right)} {D\left(f_0^{(k)}||f_0^{(k-1)}\right)} =\frac{\sum_{j=1}^M p_j^{(k)}\log \left(p_j^{(k)}/p_j^{(k-1)}\right)} {\sum_{j=1}^M p_j^{(k)}f_j \log \frac{\sum_{j=1}^M p_j^{(k)}f_j}{\sum_{j=1}^M p_j^{(k-1)}f_j}} $$

for k > 1 with initial value μ(1) = 1.

The natural-gradient-based algorithm uses the update rule

$$ p_i^{(k)}=p_i^{(k-1)}\left[1+\mu^{(k)}\left(D(f_0^{(k)}||f_0^{(k-1)}\right)-I_L^{(k-1)})\right],\quad \hbox { for all } k\geq 1, $$

where μ(k) is the same as in the accelerated Blahut-Arimoto algorithm.

The accelerated Blahut-Arimoto algorithm converges at least linear for fixed step size μ and convergence speed increases by increasing μ. This holds as

$$ \lim_{k\to \infty} \frac{\| \user2{p}^{k+1}-\user2{p}^* \|} {\| \user2{p}^{k}-\user2{p}^* \|} \leq\frac{c} {\mu} $$

with constant c and \(\user2{p}^*\) denotes the optimum distribution, as was shown in [14]. Furthermore, it can be shown that the accelerated Blahut-Arimoto algorithm converges superlinearly for properly chosen step sizes. However, obtaining the appropriate step sizes is computationally very complex [14]. Explicit results are difficult to obtain for the natural-gradient-based algorithm. Reference [14] suggests however, that convergence properties are similar to the latter ones. The convergence speed of the ordinary Blahut-Arimoto algorithms corresponds to μ = 1 and thus, the proposed heuristics clearly converge faster.

The performance of the above algorithms can be improved by applying a heuristic method related to the one introduced in Sect. 3.2 and further performing Blahut-Arimoto for any combined new subset. This improves the performance, since a better subset for a uniform distribution does not always imply to be a better one for the non-uniform case. In addition, the parameters K 1 and K 2 should be chosen rather small to allow for additional Blahut-Arimoto applications in each step.

References [1517] suggest another interesting enhancement, the iterated Blahut-Arimoto algorithm. Though it was proposed for discrete memoryless channels, it can be extended to a discrete input continuous output channel. We abbreviate the successive subset selection obtained by out subset selection algorithm followed by the iterated Blahut-Arimoto algorithm by SSA_IBA, see also Table 1. This algorithm is motivated by the fact that capacity-achieving distributions usually include only a small number of inputs that are assigned non-zero probabilities, especially when the input alphabet size is very large. Thus, great efforts can be made in eliminating inputs which will end up with probability zero. Thereby the algorithm operates on a subset of the whole input alphabet. The algorithm starts from an input alphabet that consists of only two symbols. The alphabet grows by one symbol at every iteration until it includes all symbols with non-zero probabilities. At every iteration, a Blahut-Arimoto algorithm is used to compute a capacity relative to a partial input alphabet. This approach is discussed in more detail in the following subsection.

The ordinary Blahut-Arimoto algorithm is slow when the input alphabet is large. The IBA utilizes the Blahut-Arimoto algorithm only for small sets of signaling points and thus converges faster.

4.2 Simultaneous subset and distribution selection

As shown in Fig. 3 for the scenario given in Section 5, the channel capacity achieved by the iterated Blahut-Arimoto algorithm increases fast at first, when the partial alphabet size is small. Hence, by truncating the algorithm, i.e., by terminating it once the alphabet size reaches the subset size requirement, we obtain an approximate solution of the non-uniform distribution over the subset. Let \({C_{\mathcal{X}'}}\) denote the capacity of the discrete input continuous output channel induced when all but the symbols in the subset \(\mathcal{X}'\) of \(\mathcal{X}\) are assigned a probability of zero. Our truncation of the iterated Blahut-Arimoto algorithm (TIBA) with respect to our system model is based on the following steps which are also summarized in Algorithm 2:

  1. 1.

    Determine \(\left\{i,j\right\} \in \mathcal{X}^2\) such that

    $$ C_{\left\{i,j\right\}}=D(f_i||f_0)=D(f_j||f_0) $$

    is maximized over all choices of i and j where

    $$ f_0=p_i f_i+p_j f_j $$

    in this case. Define \(\mathcal{X}' = \left\{i,j\right\}\) and \({C' = C_{\mathcal{X}'}}\).

  2. 2.

    If \(\mathcal{X}' = \mathcal{X}\), then C = C′ and the algorithm terminates. Otherwise, for all \(m\in \mathcal{X} \backslash \mathcal{X}'\), compute D(f m ||f 0). If the values computed are all smaller or equal to C′, then C = C′ and the algorithm can be terminated at this point.

  3. 3.

    Add the symbol m that maximized D(f m ||f 0) in Step 2 to the set \(\mathcal{X}'\). Recompute \({C' = C_{\mathcal{X}'}}\) using the Blahut-Arimoto algorithm and update f 0 with the new \(\mathcal{X}'\). Return to step 2.

Fig. 3
figure 3

Increase in mutual information with the iterated Blahut-Arimoto algorithm

Algorithm 2 Truncation of Iterated Blahut-Arimoto Algorithm (TIBA)

The function BA mentioned in the algorithms is a placeholder for the iterated or non-iterated Blahut-Arimoto algorithm with either the accelerated Blahut-Arimoto algorithm or natural-gradient-based algorithm. The actual choice depends on the size of K.

To further enhance the performance of the truncation, we exploit the following ideas. Zero probability inputs characterized above should be eliminated to allow other signaling points to be included. To exclude zero probability inputs more efficiently, we set a certain threshold and exclude the input if its probability \(p_i<\max\left\{10^{-3},\frac{1} {k^2}\right\},\) where k is the current subset size, i.e., the number of already included points. The choice for a threshold for p i is a balance between resolution and speed. We found in examples \(\max\left\{10^{-3},\frac{1} {k^2}\right\}\) to be quite good. The cardinality of \(\mathcal{X}'\) increases by one during each iteration. Nevertheless, excluding low probability points as mentioned above, the cardinality of \(\mathcal{X}'\) may shrink while improving the mutual information. Therefore, we postpone the truncation of the iterated algorithm until the cardinality of \(\mathcal{X}'\) is equal to K + K 1 for some K 1 > 0. Instead of directly choosing \(\mathcal{X}'\) with size K, we obtain it by reducing it from a certain larger set of size K + K 2 by excluding the least probable K 2 points. The explained steps are summarized in Algorithm 3.

Algorithm 3 Sub Algorithm for TIBA

The initialization of two input symbols can be done as outlined in Algorithm 4. It describes an efficient way to apply exhaustive search to all two point combinations.

Algorithm 4 Initialization of Iterated Blahut-Arimoto Algorithm

5 Simulation results

For our simulations, we use the following scenario. We aim to choose K = 16 signaling points from an M-QAM scenario with M = 64. The 64-QAM points in the square [−3,3]2 are chosen as initial situation. We consider 2-dimensional Gaussian noise with covariance matrix \(\left(\begin{array}{ll}1 & \rho \\ \rho & 1\end{array}\right) \) with varying parameter ρ.

5.1 Reducing the number of signaling points

As outlined above, using a uniform distribution over the selected subset provides a lower bound for the capacity-achieving non-uniform distribution.

5.1.1 The subset selection algorithm with heuristic improvement: SSA

SSA depends on the chosen SDP relaxation technique for solving (4). Figure 4 compares different SDP relaxation techniques. The blue curve depicts the channel capacity obtained by applying the dominant eigenvector approximation, see [23], to acquire the SDP relaxation. The green and red curve are obtained by using randomization with 30 loops, i.e., an estimator of \(\user2{s}\) is obtained after each randomization loop. In examples, we observed that 30 loops are close to maximize the mutual information. The red one aims at maximizing the cutoff rate, while the green one directly maximizes the mutual information. As can be seen in the figure and as expected, maximizing the mutual information with randomization always performs best, though the difference is interestingly not very large especially for small and large SNR values.

Fig. 4
figure 4

SDP relaxation (Step 1 in Algorithm 1) comparison by changing SNR. The dominant eigenvalue approximation (blue) vs. randomization with 30 loops aiming at maximizing the cutoff rate (green) and the capacity (red)

The resulting signaling schemes found by SSA depend on the total power constraint. Figure 5 shows the mutual information vs. the total power constraint for four different values of the correlation parameter ρ for a given total power constraint per dimension. The curves are monotone increasing when the total power becomes larger. This holds, because the constraint becomes much looser. As expected, be observed from the plot that each curve converges to its maximum, which is the same as the channel capacity without total power constraint.

Fig. 5
figure 5

Mutual information vs. the total power constraint for different correlation parameters ρ

We find in examples that the performance of a channel that only exploits a small set of signaling points and a uniform distribution is remarkably good. For low SNR values, the achieved mutual information is quite close to the capacity of the full set of signaling points. This is depicted in Fig. 6. Only for large SNR values, the capacity gap is rather large. The capacity for the optimal case (light blue) was obtained by using the Blahut-Arimito algorithm. The blue, green and red curve show the mutual information result for a subset of size 16, 32 and 64, respectively, obtained by SSA. Interestingly, for small SNR values, a smaller subset size achieves a higher capacity than a larger subset size. Note that we only consider uniform distributions so far. For the full problem it is trivial, that a larger subset size achieves higher capacity. However, as can be seen here, this does not hold for the uniform case.

Fig. 6
figure 6

Comparison of mutual information for different numbers of signaling points

5.2 Selecting both signaling points and probability measure

So far, we only considered a uniform distribution over the signaling points. The channel can be exploited even better by choosing a more appropriate probability distribution. In Sect. 4, we discussed two algorithms for finding a better probability distribution.

5.2.1 The successive subset and distribution selection: SSA_IBA

In Fig. 7 we investigated the constellation of the capacity achieving distribution for the full set of signaling points using SSA_IBA for K = M = 64. The red points depict the optimal signaling points, the numbers refer to the optimal probability multiplied by 1,000. The tiny blue dots are the signaling points that are assigned a probability of zero by the algorithm, i.e., they are not chosen as signaling point. Interestingly, the obtained distribution only uses 28 signaling points. This shows that by restricting ourselves to a smaller set of signaling points, we can still achieve a remarkably good mutual information. Figure 9 shows the improvement of the heuristic method introduced in Sect. 4.1. The parameters were set to K 1 = 8 and K 2 = 2.

Fig. 7
figure 7

The optimum input distribution

5.2.2 Simultaneous subset and distribution selection: TIBA

To show the performance of TIBA, we consider the following two plots. Figure 10 shows the performance of TIBA compared to the channel capacity. As can be seen, TIBA performs quite good. The arising question is, if TIBA finds the best possible subset and probability distribution or if exhaustive search can yield a result which is even closer to the full set capacity. As exhaustive search is computationally too complex, we include three simple aspects in the following pseudo exhaustive search, thereby reducing run-time enormously. As only 28 signaling points are assigned a non-zero probability in the optimum probability measure, see Fig. 7, we consider only those and reduce ourselves to choosing 16 out of 28 rather than 16 out of 64. Moreover, we include the four vertices (3, 3)′, (−3, 3)′, (−3, −3)′, (3, −3)′ to our choice, as they are assigned the highest probabilities in the optimum distribution. Thus, the search further reduces to 12 out of 24 with the four vertices fixed. We also included symmetry aspects in the pseudo exhaustive search which excludes many repeated combinations. The extended Blahut-Arimoto algorithms converges really fast in the first several iterations. That implies, that we can discard a combination if the capacity achieved after several iterations is still too low compared to other combinations. Thereby we do not waste any time in finding the exact capacities of not interesting combination. Taking all these considerations into account, we obtain our pseudo exhaustive search, which, for the scenario given, yields the results shown on in Fig. 11. Running TIBA gives exactly the same points and probabilities, i.e., our algorithm yields the best possible result.

5.2.3 Comparison of SSA, SSA_IBA and TIBA

As can be seen in Fig. 8, TIBA performs best while the uniform distribution over the subset always gives the smallest mutual information. The intuitive and sequential approach of first finding a subset and then improving the according probability still yields a quite reasonable high mutual information, it is quite close to the sophisticated TIBA.

Fig. 8
figure 8

Mutual information comparison for different subset selection algorithms

Fig. 9
figure 9

Comparison of mutual information for different numbers of signaling points

Fig. 10
figure 10

Comparison of TIBA with the channel capacity

Fig. 11
figure 11

Comparison of TIBA with pseudo exhaustive search

6 Conclusions

In this paper, we investigated the challenging problem of finding a reduced, optimal set of signaling points and a corresponding capacity achieving probability measure. By assuming first a uniform distribution on the selected signaling points, we obtained a lower bound for the original problem. We considered both a mutual information and cutoff rate maximizing subset selection and solved these problems by forming a semidefinite programing problem and applying two different relaxation techniques. A heuristic improvement based on weights of signaling points improves the performance of this lower bound, which is close to the channel capacity of the full set of signaling points. Two different ways to tackle the full problem of selecting a subset and a distribution were introduced. Building on a Blahut Arimoto algorithm and a subsequent heuristic improvement, the uniform distribution obtained in the first step is replaced by a more sophisticated one. Using a simultaneous update on both the chosen signaling points and the probability measure, we obtained the so called TIBA algorithm, which performs best under the proposed methods and gives remarkable results compared to exhaustive search.

Our numerical results show that using only a subset of small size can indeed achieve a very high mutual information even compared to the large full input set. This approach helps to highly simplify the receiver design while maintaining a high transmission rate over the channel.

An interesting open problem for future research is the analysis of the choice of K compared to M, e.g., the ratio K/M and analytical results proving our numerical conclusions.