1 Introduction

Linear dimension reduction is commonly used for preprocessing of high-dimensional data in complicated learning frameworks to compress and weight important data features. In contrast to nonlinear approaches, the use of orthogonal projections is computationally cheap, since it corresponds to a simple matrix multiplication. Conventional approaches apply specific projections that preserve essential information and complexity within a more compact representation. The projector is usually selected by optimizing distinct objectives, such as information preservation of the sample variance or of pairwise relative distances. Widely used orthogonal projections for dimension reduction are variants of the principal component analysis (PCA) that maximize the variance of the projected data [37]. Preservation of relative pairwise distances asks for a near-isometric embedding, and random projections guarantee this embeddings with high probability, cf. [5, 15] and see also [1, 6, 12, 27, 30, 35]. The use of random projections is especially favorable for large, high-dimensional data [48], since the computational complexity is just O(dkm), e.g., using the construction in [1], with \(d, k \in \mathbb {N}\) being the original and lower dimensions and \(m \in \mathbb {N}\) the number of samples. In contrast, PCA needs \(O(d^2m)+O(d^3)\) operations [24]. Moreover, tasks that do not have all data available at once, e.g., data streaming, ask for dimension reduction methods that are independent of the data.

In the present manuscript, we study orthogonal projections regarding the interplay between

  1. (O1)

    preservation of variance,

  2. (O2)

    preservation of pairwise relative distances,

aiming for a sufficient lower-dimensional data representation. We shall consider the Euclidean distance exclusively since it is most widely used in applications, especially for error estimation. On manifolds, the geodesic distance is locally equivalent to the Euclidean distance. The two objectives (O1) and (O2) are directly addressed by PCA (O1) and random projections (O2). We achieve the following goals: First, we clarify mathematically and numerically that the two objectives are competing, i.e., PCA and random projections preserve different kinds of information. Depending on the objectives, we discuss beneficial choices of orthogonal projections and numerically find a balancing projector for a given data set. Finally, we define a general framework of augmented target (AT) loss functions for deep neural networks that integrate information about target characteristics via features and projections. We observe that our proposed methodology can increase the accuracy in two deep learning problems.

In contrast to conventional approaches, we study the joint behavior of the two objectives with respect to the entire set of orthogonal projectors. By analyzing the correlation between the variance and pairwise relative distances of projected data, we observe that (O1) and (O2) are competing and usually cannot be reached at the same time. In classification experiments with support vector machine and shallow neural networks, we investigate heuristic choices of projections applied to the input features.

In view of learning frameworks, we utilize features and projections on target data. The class of augmented target loss functions incorporates suitable transformations and projections that provide beneficial representations of the target space. It is applied in two supervised deep learning problems dealing with real-world data.

The first experiment is a clinical image segmentation problem in optical coherence tomography (OCT) data of the human retina. Related principles of dimension reduction for other clinical classification problems in OCT have already been successfully applied in [9]. In the second experiment, we aim to categorize musical instruments based on their spectrogram; see [19] for related results. Our utilized augmented target loss functions can increase the accuracy in both experiments.

The outline is as follows. In Sect. 2, we address the analysis of the competing objectives and Theorem 2.5 yields the asymptotic correlation between variance and pairwise relative distances of projected data. Section 3 prepares for the numerical investigations by recalling t-designs as considered in [10], enabling subsequent numerics. Heuristic investigations on projected input used in a straightforward classification task are presented in Sect. 4. Our framework of augmented target loss functions as modified standard loss functions for deep learning is introduced in Sect. 5. Finally, in Sects. 6 and 7 we present classification experiments on OCT images and musical instruments using aligned augmented target loss functions.

2 Dimension Reduction with Orthogonal Projections

To reduce the dimension of a high-dimensional data set \(x=\{x_i\}_{i=1}^m\subset \mathbb {R}^d\), we map x onto a lower-dimensional affine linear subspace \(\bar{x}+V\), where \(\bar{x}:=\frac{1}{m}\sum _{i=1}^m x_i\) is the sample mean and V is a k-dimensional linear subspace of \(\mathbb {R}^d\) with \(k < d\). This mapping is performed by an orthogonal projector \(p\in \mathcal {G}_{k,d}\), where

$$\begin{aligned} \mathcal {G}_{k,d}:= \{p\in \mathbb {R}^{d\times d} : p^2=p,\; p^\top =p,\; {\text {rank}}(p)=k\} \end{aligned}$$

denotes the Grassmannian, so that the lower-dimensional data representation is

$$\begin{aligned} \{\bar{x}+p(x_i-\bar{x})\}_{i=1}^m \subset \bar{x}+V, \end{aligned}$$

with \({{\,\mathrm{range}\,}}(p)=V\). A suitable choice of p within \(\mathcal {G}_{k,d}\) depends on further objectives, i.e., which kind of information preservation shall be favored for subsequent analysis tasks. In the following, we consider two objectives associated with popular choices of orthogonal projectors for dimension reduction, in particular, random projectors and PCA. We will first observe that the two objectives are competing, especially in high dimensions, and then discuss consequences.

2.1 Objective (O1)

The total sample variance \({{\,\mathrm{tvar}\,}}(x)\) of \(x=\{x_i\}_{i=1}^m\subset \mathbb {R}^d\) is the sum of the corrected variances along each dimension:

$$\begin{aligned} {{\,\mathrm{tvar}\,}}(x) :=\frac{1}{m-1}\sum _{i=1}^m \Vert x_i-\bar{x}\Vert ^2. \end{aligned}$$

PCA aims to construct \(p\in \mathcal {G}_{k,d}\), such that the total sample variance of (2.1) is maximized among all projectors in \(\mathcal {G}_{k,d}\). For other equivalent optimality criteria, we refer to [49].

The total sample variance of \(px=\{px_i\}_{i=1}^m\subset V\) coincides with the one of (2.1) and satisfies

$$\begin{aligned} {{\,\mathrm{tvar}\,}}(px)\le {{\,\mathrm{tvar}\,}}(x) \end{aligned}$$

for all \(p \in \mathcal {G}_{k,d}\). Thus, PCA achieves optimal variance preservation. The total variance (2.2) can also be expressed via pairwise absolute distances:

$$\begin{aligned} {{\,\mathrm{tvar}\,}}(x) =\frac{1}{m(m-1)} \sum _{i<j} \left\| x_i - x_j\right\| ^2 . \end{aligned}$$

Equally, it holds that

$$\begin{aligned} {{\,\mathrm{tvar}\,}}(px) = \frac{1}{m(m-1)} \sum _{i<j} \left\| p(x_i) - p(x_j)\right\| ^2 , \end{aligned}$$

which reveals that PCA maximizes the sample mean of the projected pairwise absolute distances.

2.2 Objective (O2)

In contrast to pairwise absolute distances, the Johnson–Lindenstrauss lemma targets the global property of preservation of pairwise relative distances:

Lemma 2.1

(Johnson–Lindenstrauss, cf. [15, 35]). For any \(0< \epsilon < 1\), any \(k \le d, m \in \mathbb {N}\), with

$$\begin{aligned} \frac{4 \log (m)}{\epsilon ^2/2 -\epsilon ^3/3}\le k, \end{aligned}$$

and any set \(\{x_i\}_{i=1}^m\subset \mathbb {R}^d\), there is a projector \(p\in \mathcal {G}_{k,d}\) such that

$$\begin{aligned} (1 - \epsilon ) \left\| x_i- x_j\right\| ^2 \le \tfrac{d}{k}\left\| p(x_i) - p(x_j)\right\| ^2 \le (1 + \epsilon ) \left\| x_i- x_j\right\| ^2 \end{aligned}$$

holds for all \(i<j\).

For small \(\epsilon > 0\), the projector p in Lemma 2.1 yields that all of the \(\frac{m(m-1)}{2}\) pairwise relative distances

$$\begin{aligned} \left\{ \frac{d}{k}\frac{\Vert p(x_i) - p(x_j)\Vert ^2}{\Vert x_i - x_j\Vert ^2}:i<j\right\} \end{aligned}$$

are close to 1, i.e., the projection p preserves all scaled pairwise relative distances well. A good choice of p in Lemma 2.1 is based on random projectorsFootnote 1\(P\sim \lambda _{k,d}\), where \(\lambda _{k,d}\) denotes the unique orthogonally invariant probability measure on \(\mathcal {G}_{k,d}\). The following theorem is essentially proved by following the lines of the proof of Lemma 2.1 in [15] after replacing the constant 4 with \((2 + \tau )2\) in the respective bound on k.

Theorem 2.2

For any \(0< \epsilon < 1\), any \(k \le d, m \in \mathbb {N}\) and any \(0<\tau \) with

$$\begin{aligned} \frac{(2 + \tau ) 2\log (m)}{\epsilon ^2/2 - \epsilon ^3/3} \le k, \end{aligned}$$

and any set \(\{x_i\}_{i=1}^m\subset \mathbb {R}^d\), the random projector \(P\sim \lambda _{k,d}\) satisfies

$$\begin{aligned} \left\{ \frac{d}{k}\frac{\Vert P(x_i) -P(x_j)\Vert ^2}{\Vert x_i - x_j\Vert ^2}:i<j\right\} \in [1-\epsilon ,1+\epsilon ] \end{aligned}$$

with probability at least \(1 - \tfrac{1}{m^{\tau }} + \tfrac{1}{m^{\tau +1}}\).

The theorem tells that the preservation property of pairwise relative distances is achieved with high probability when choosing a random projection according to k;d. Note that the random choice is completely independent from the actual data set.

2.3 Competing Objectives

A projector p satisfying the near-isometry property (2.5) implies

$$\begin{aligned} (1 - \epsilon ) \tfrac{k}{d} {{\,\mathrm{tvar}\,}}(x) \le {{\,\mathrm{tvar}\,}}(px) \le (1 + \epsilon ) \tfrac{k}{d} {{\,\mathrm{tvar}\,}}(x), \end{aligned}$$

so that the total variance of the projected data px is not preserved for \(k<d\). In particular, with high probability a random projector \(P\sim \lambda _{k,d}\) does not suit the objective of maximizing the total variance, and we even observe \(\mathbb {E}{{\,\mathrm{tvar}\,}}(Px) = \frac{k}{d}{{\,\mathrm{tvar}\,}}(x)\); see (A.2) in the Appendix. PCA does not guarantee any local geometric property, and distances between pairs of points can be arbitrarily distorted [1]; see [39] for more robust PCA. The preservation of larger distances is favored since PCA maximizes (2.4) among all \(p \in \mathcal {G}_{k,d}\) and \(\Vert p(x_i)-p(x_j)\Vert \le \Vert x_i - x_j \Vert \) holds for all \(i<j\). Close but distinct points could even be projected onto a single point, which violates the preservation of pairwise relative distances; see Fig. 1.

Fig. 1
figure 1

A trivial example of PCA distorting smaller distances. Choosing the first principal component, PCA projects the two-dimensional data points * onto the plane of the first eigendirection (− −). The Euclidean distances of the points lying on the diagonal are preserved, whereas the two points with smaller distances are projected onto a single point (the origin)

Fig. 2
figure 2

Competing properties: 10,000 random projections \(p \sim \lambda _{k,50}\) versus PCA (\(*\)), plotted concerning \({{\,\mathrm{tvar}\,}}(px)\), \(\mathcal {M}(p,x)\) and \(\mathcal {V}(p,x)\). The normal distributed fixed data set x has total variance \({{\,\mathrm{tvar}\,}}(x) = 49.5\). Random projections cluster around their expectation values (2.10), (2.11) and (2.12), marked by \(+\)

To more quantitatively understand the relation between the two competing objectives, we consider the sample mean and the uncorrected sample variance of the pairwise relative distances (2.6):

$$\begin{aligned} \mathcal {M}(p,x)&:= \frac{2}{m(m-1)} \sum _{i<j}\frac{d}{k}\frac{\Vert p(x_i - x_j)\Vert ^2}{\Vert x_i-x_j\Vert ^2}, \end{aligned}$$
$$\begin{aligned} \mathcal {V}(p,x)&:=\frac{2}{m(m-1)}\sum _{i<j}\frac{d^2}{k^2}\frac{\Vert p(x_i - x_j)\Vert ^4}{\Vert x_i-x_j\Vert ^4} - \mathcal {M}(p,x)^2. \end{aligned}$$

Recall that good preservation of the relative pairwise distances in (2.6) asks for \(\mathcal {M}(p,x)\) being close to 1 and the variance \(\mathcal {V}(p,x)\) being small. In the following, we analyze \({{\,\mathrm{tvar}\,}}(px)\), \(\mathcal {M}(p,x)\) and \(\mathcal {V}(p,x)\) and their expectations for random \(P\sim \lambda _{k,d}\).

In Fig. 2, we see a simple numerical experiment, where we first create an independent, normally distributed fixed data set \(\{x_i\}_{i=1}^{m}\) with \(x_i \in \mathbb {R}^d\) for \(i=1, \dots , m\) and \(m = 100\), \(d = 50\). We then compute PCA, for \(k = 10, 20, 30, 40\), as well as \(n = 10{,}000\) random projections p distributed according to \(\lambda _{k,50}\). In Fig. 2a–d, we can see that the more the k differs from d, the more the PCA and random projections differ concerning \({{\,\mathrm{tvar}\,}}(px)\) and \(\mathcal {M}(p,x)\). Those differences may lead to diverse behavior in subsequent data analysis. Moreover, we compare \(\mathcal {M}(p,x)\) and \(\mathcal {V}(p,x)\) in Fig. 2e–h for the different k. We can see that again when k is much smaller than d, random projections and PCA differ more concerning the variance of pairwise distances \(\mathcal {V}(p,x)\). For \(k=10\), the variance for PCA is higher in comparison with random projections (Fig. 2e); for \(k = 40\) vice versa (Fig. 2h). Note that the theoretical bounds stated in Theorem 2.1 are much higher than the dimensions k used in the experiments, but the projections still preserve relative pairwise distances very well. In [7], similar observations were made on empiric experiments with image and text data.

The amount of variance kept in the principal components comparing real-world and random data has been experimentally studied, e.g., in [29] and [46]. Both studies determine that the difference occurs mainly in the first principal component.

Remark 2.3

In the numerical example, we compare random projections and PCA directly, serving as the corresponding projections to the objectives (O1) and (O2). We observe that even for not so high-dimensional (\(d = 50\)) data x and \(k\ll d/2\), PCA severely loses information in terms of total variance, i.e., more than \(50 \%\) for \(k=10\), and more importantly loses much more information on pairwise relative distances than random projections. If both types of information are of interest, pairwise relative distances and high total variance, one should therefore favor random projections over PCA for \(k \ll d/2\) to balance the two objectives (O1) and (O2) and vice versa. Note that with a large amount of data one might still want to favor random projectors since their construction is computationally much cheaper and independent from the data. On the other hand, if objective (O2) is negligible, e.g., tasks with very noisy data, then PCA would be the favorable choice for all k.

Information of data can be quantified and expressed in different ways. One crucial part in dimension reduction is the decision of what kind of information shall be kept, which depends on several parameters including the quality of the data and the analysis task. Variants of PCA, focusing on the preservation of variance, have been widely used in real-world problems with big success, especially in denoising, when the preservation of all pairwise relative distances may be counterproductive, e.g., in dMRI imaging [51] and color filter array images [56]. Drawbacks are the necessity for all data being available from the start and the high computational costs. For very high-dimensional and large data sets, the computation of PCA is often not feasible. Besides the huge benefit of data independence and low computational cost when using random projections, the near-isometry property often allows to establish that the solution found in the low-dimensional space is a good approximation to the solution in the original space [1, 34].

Algorithms in machine learning often need or benefit from sufficient estimates of pairwise distances, e.g., approximate nearest-neighbor problems, supervised classification [27] and subspace clustering [26]. In [32], algorithmic applications of near-isometry embeddings have been introduced. In [7], random projections have been successfully applied to noisy and noiseless text and image data. The experimental studies include the comparison of preservation of pairwise distances between random projections and PCA. The results coincide with our observations that for \(k > d/2\) PCA is able to preserve the pairwise distances sufficiently, whereas for \(k < d/2\) PCA distorts them. The smaller the k, the worse the distortion, whereas random projections preserve similarities still well for very small k, while being computationally much cheaper than PCA. One should point out again that favoring preservation of pairwise distances relies on the accuracy of the original distances.

PCA and random projections are orthogonal projections favoring two different aims. We want to study in the context of the whole set of orthogonal projections if the two objectives (O1) and (O2) could be reached at the same time. We will see that the objectives act competing, and therefore we suggest a balancing projector for tasks that benefit from both objectives.

Fig. 3
figure 3

For \(x=\{x_i\}_{i=1}^{10}\subset \mathbb {R}^d\) with independent, normal distributed entries, we independently sample 10, 000 random projectors p from \(\lambda _{10,d}\) and plot \(\mathcal {M}(p,x)\) versus \({{\,\mathrm{tvar}\,}}(px)\). The expectation values with respect to \(P \sim \lambda \) are marked with \(+\). The correlation is already 0.9916 for \(d = 50\) and grows further when d increases, namely with values 0.9961, 0.9985, 0.9996 for \(d = 100,200,500\)

2.4 Covariances and Correlation Between Competing Objectives

For further mathematical analysis, we first introduce a more general class of probability measures on \(\mathcal {G}_{k,d}\) that resemble \(\lambda _{k,d}\) sufficiently well:

Definition 2.4

A Borel probability measure \(\lambda \) on \(\mathcal {G}_{k,d}\) is called a cubature measure of strength t if

$$\begin{aligned} \int _{\mathcal {G}_{k,d}} f(p)\mathrm {d}\lambda _{k,d}(p) = \int _{\mathcal {G}_{k,d}} f(p)\mathrm {d}\lambda (p),\quad \text {for all } f\in {{\,\mathrm{Pol}\,}}_t(\mathbb {R}^{d^2}), \end{aligned}$$

where \({{\,\mathrm{Pol}\,}}_t(\mathbb {R}^{d^2})\) denotes the set of multivariate polynomials of total degree t in \(d^2\) variables.

Existence of cubature measures is studied, for instance, in [17]. For random P, we now determine the expectation values for our three quantities of interest: \({{\,\mathrm{tvar}\,}}(Px)\), \(\mathcal {M}(P,x)\) and \(\mathcal {V}(P,x)\). If \(P\sim \lambda \) and \(\lambda \) is a cubature measure of strength at least 2, the identities (A.2) and (A.3) in the Appendix and a short calculation yield

$$\begin{aligned} \mathbb {E}{{\,\mathrm{tvar}\,}}(Px)&= \tfrac{k}{d}{{\,\mathrm{tvar}\,}}(x), \end{aligned}$$
$$\begin{aligned} \mathbb {E}\mathcal {M}(P,x)&=1, \end{aligned}$$
$$\begin{aligned} \mathbb {E}\mathcal {V}(P,x)&= a_{k,d} \Big (1-\tfrac{4}{m^2(m-1)^2}\sum _{\begin{array}{c} i<j\\ l<r \end{array}}\left\langle \tfrac{x_i - x_j}{\Vert x_i-x_j\Vert },\tfrac{x_l- x_r}{\Vert x_l-x_r\Vert }\right\rangle ^2 \Big ), \end{aligned}$$

where \(a_{k,d} = \tfrac{2d(d-k)}{k(d-1)(d+2)}\). The expected sample variance in (2.12) satisfies

$$\begin{aligned} \mathbb {E}\mathcal {V}(P,x) \le a_{k,d} \longrightarrow \frac{2}{k},\quad \text {for}\quad d\rightarrow \infty . \end{aligned}$$

This asymptotic bound relates to Theorem 2.2 and alludes to a near-isometry property of the type (2.7) for k sufficiently large.

The following theorem provides a lower bound for random P on the population correlation

$$\begin{aligned} {{\,\mathrm{Corr}\,}}(\mathcal {M}(P,x), {{\,\mathrm{tvar}\,}}(Px)) = \frac{{{\,\mathrm{Cov}\,}}(\mathcal {M}(P,x), {{\,\mathrm{tvar}\,}}(Px))}{\sqrt{{{\,\mathrm{Var}\,}}(\mathcal {M}(P,x))} \sqrt{{{\,\mathrm{Var}\,}}({{\,\mathrm{tvar}\,}}(Px))}}. \end{aligned}$$

Theorem 2.5

Let \(x=\{x_i\}_{i=1}^m\subset \mathbb {R}^d\) be pairwise different and let \(P\sim \lambda \), with \(\lambda \) being a cubature measure of strength at least 2. For \(d \ge \tfrac{m(m-1)}{2}\), the correlation (2.13) is bounded from below by

$$\begin{aligned} \tfrac{\min _{i\ne j} \left\| x_i - x_j \right\| ^2}{\max _{i\ne j} \left\| x_i - x_j \right\| ^2} - \tfrac{m(m-1)}{2d} \cdot \tfrac{\max _{i\ne j} \left\| x_i - x_j \right\| ^2}{\min _{i\ne j} \left\| x_i - x_j \right\| ^2}. \end{aligned}$$

If \(\{x_i\}_{i=1}^m\subset \mathbb {R}^d\) are random points, whose entries are independent, identically distributed with finite 4-th moments that are uniformly bounded in d, then (2.14) converges towards 1 in probability for \(d\rightarrow \infty \).

The strong correlation for large dimensions d in the second part of Theorem 2.5 suggests that increasing \({{\,\mathrm{tvar}\,}}(Px)\) may also lead to increasing \(\mathcal {M}(P,x)\); see Fig. 3 for illustration. Thus, large projected total variance \({{\,\mathrm{tvar}\,}}(Px)\) and the preservation of scaled pairwise distances, i.e., \(\mathcal {M}(P,x)\) being close to 1, are competing properties. As discussed in Sect. 2.3, the choice of which kind of information is favorable to preserve depends on the data and the task, e.g., denoising (O1) and nearest-neighbor classification (O2). PCA and random projections are extreme in preserving either (O1) or (O2). We will heuristically study the behavior of orthogonal projections balancing both objectives in the next section and will state a numerical experiment where a balancing projector yields the highest classification accuracy.

Remark 2.6

The second part of Theorem 2.5 relates to the well-known fact that random vectors in high dimensions are almost orthogonal [4], and standard concentration of measure arguments may lead to more quantitative statements, cf. [52].

3 Preparations for Numerical Experiments

For the numerical experiments, we need finite sets of projectors that represent the overall space well, i.e., cover \(\mathcal {G}_{k,d}\) properly.

3.1 Optimal Covering Sequences

Let the covering radius of a set \(\{p_l\}_{l=1}^n\subset \mathcal {G}_{k,d}\) be denoted by

$$\begin{aligned} \varrho (\{p_l\}_{l=1}^n):=\sup _{p\in \mathcal {G}_{k,d}} \min _{1\le l\le n} \Vert p-p_l\Vert _{{{\,\mathrm{F}\,}}}, \end{aligned}$$

where \(\Vert \cdot \Vert _{{{\,\mathrm{F}\,}}}\) is the Frobenius norm. The smaller the covering radius, the better the set \(\{p_l\}_{l=1}^n\) represents the entire space \(\mathcal {G}_{k,d}\), i.e., there are smaller holes and the points \(\{p_l\}_{l=1}^n\) are better distributed within \(\mathcal {G}_{k,d}\). Following Lemma 2.1, we can connect finite sets of projections and their covering radius to the near-isometry property:

Lemma 3.1

Let \(\{p_l\}_{l=1}^{n}\subset \mathcal {G}_{k,d}\) and denote \(\varrho :=\varrho (\{p_l\}_{l=1}^n)\). For any \(0< \epsilon < 1\), any \(m,k,d\in \mathbb {N}\) with

$$\begin{aligned} \frac{4 \log (m)}{\epsilon ^2/2 -\epsilon ^3/3}\le k\le d, \end{aligned}$$

and any \( \{x_i\}_{i=1}^m\subset \mathbb {R}^d\), there is \(l_0\in \{1,\ldots ,n\}\) such that

$$\begin{aligned} (1 - \delta ) \left\| x_i - x_j\right\| ^2\le & {} \tfrac{d}{k}\left\| p_{l_0}(x_i) - p_{l_0}(x_j)\right\| ^2 \nonumber \\\le & {} (1 + \delta ) \left\| x_i -x_j\right\| ^2, i<j, \end{aligned}$$

where \(\delta = \epsilon + 2\varrho \sqrt{\frac{(1 + \epsilon )d }{k}} + \frac{d }{k}\varrho ^2\).


Given an arbitrary projector \(p\in \mathcal {G}_{k,d}\), there is an index \(l_0 \in \{1, \dots , n\}\) such that

$$\begin{aligned} \Vert p_{l_0}x - px \Vert \le \Vert p_{l_0} - p \Vert _{{{\,\mathrm{F}\,}}} \Vert x\Vert \le \varrho \Vert x\Vert ,\quad x\in \mathbb {R}^d. \end{aligned}$$

From here, standard computations imply Lemma 3.1. We omit the details. \(\square \)

The accuracy of the near-isometry property in (3.2) depends on the covering radius. Therefore, a set \(\{p_l\}_{l=1}^n\in \mathcal {G}_{k,d}\) with a small covering radius \(\varrho \) is more likely to contain a projector with better preservation of pairwise relative distances. According to [11], it holds thatFootnote 2\(\varrho > rsim n^{-\frac{1}{k(d-k)}}\) and we shall see next how to achieve this lower bound.

A set of projectors \(\{p_l\}_{l=1}^n\subset \mathcal {G}_{k,d}\) is called a t-design if the associated normalized atomic measure \(\frac{1}{n}\sum _{l=1}^n \delta _{p_l}\) is a cubature measure of strength t (see Definition 2.4); see [44] for general existence results. Any sequence of \(t_i\)-designs \(\{p^i_l\}_{l=1}^{n_i}\subset \mathcal {G}_{k,d}\) with \(t_i\rightarrow \infty \) satisfies

$$\begin{aligned} \varrho _i\asymp t_i^{-1}, \end{aligned}$$

and moreover, the bound \(n_i > rsim t_i^{k(d-k)}\) holds, cf. [11, 17]. To relate \(n_i\) to \(\varrho _i\) via \(t_i\), a sequence of \(t_i\)-designs \(\{p^i_l\}_{l=1}^{n_i}\subset \mathcal {G}_{k,d}\) is called a low-cardinality design sequence if \(t_i\rightarrow \infty \) and

$$\begin{aligned} n_i\asymp t_i^{k(d-k)}, \quad i=1,2,\ldots . \end{aligned}$$

For their existence and numerical constructions, we refer to [21] and [10, 11]. According to [11], see also (3.3) and (3.4), any low-cardinality design sequence \(\{p^{i}_l\}_{l=1}^{n_i}\) covers asymptotically optimal, i.e.,

$$\begin{aligned} \varrho _i\asymp n_i^{-\frac{1}{k(d-k)}}. \end{aligned}$$

Benefiting from the covering property, we will use low-cardinality design sequences as a representation of the overall space of orthogonal projectors \(\mathcal {G}_{k,d}\).

3.2 Linear Least Squares Fit

With the linear least squares fit, we can directly gain information about the relation between \(\mathcal {M}(p,x)\) and \({{\,\mathrm{tvar}\,}}(px)\) for a given data set \(x=\{x_i\}_{i=1}^m\subset \mathbb {R}^{d }\) when p varies. Given the two samples

$$\begin{aligned} \{{{\,\mathrm{tvar}\,}}(p_1x),\ldots ,{{\,\mathrm{tvar}\,}}(p_nx)\},\quad \{\mathcal {M}(p_1,x),\ldots ,\mathcal {M}(p_n,x)\}, \end{aligned}$$

the linear least squares fitting provides the best fitting straight line,

$$\begin{aligned} {{\,\mathrm{tvar}\,}}(p_lx) \approx s \cdot \mathcal {M}(p_l,x) + \gamma ,\quad l=1,\ldots ,n, \end{aligned}$$

where s and \(\gamma \) are determined by the sample variances and the sample covariance. If \(\{p_l\}_{l=1}^n\) is a 2-design, then the sample (co)variances coincide with the respective population (co)variances for \(P\sim \lambda _{k,d}\); see Appendix A.3 for further details. It follows that

$$\begin{aligned} s&= \frac{{{\,\mathrm{Cov}\,}}(\mathcal {M}(P,x),{{\,\mathrm{tvar}\,}}(Px))}{{{\,\mathrm{Var}\,}}(\mathcal {M}(P,x))} \quad \text { with }P\sim \lambda _{k,d}, \end{aligned}$$
$$\begin{aligned} \gamma&=\tfrac{k}{d}{{\,\mathrm{tvar}\,}}(x) - s. \end{aligned}$$

The quantities s and \(\gamma \) can be directly computed, where \({{\,\mathrm{tvar}\,}}(x)\) is given by (2.2) and the covariances are stated in Corollary A.1. Note that (3.6) and (3.7) are now independent of the particular choice of \(\{p_l\}_{l=1}^n\).

The correlation between the two samples (3.5) yields additional information about their relation. As before, if \(\{p_l\}_{l=1}^n\) is a 2-design, then the sample correlation coincides with the population correlation (2.13) for \(P\sim \lambda _{k,d}\), cf. Appendix A.3. High correlation for a specific data set x suggests that random projections and PCA preserve competing properties, whose benefits need to be assessed for the specific subsequent task.

Fig. 4
figure 4

Projections \(\{p_l\}_{l=1}^{8475} \subset G_{2,4}\) from a t-design of strength 14 evaluated on the iris data set \(x \subset \mathbb {R}^{4 \times 150}\)

4 Numerical Experiments in Pattern Recognition

We investigate the impact on classification accuracy when applying specific orthogonal projections to input data. The chosen real-world data yields a straightforward classification task, serving as a toy example for comparing the accuracy of several projected input data in simple learning frameworks. Projectors are chosen from a t-design in view of \({{\,\mathrm{tvar}\,}}(px)\) and \(\mathcal {M}(p,x)\). For all computations made in this section, the ‘Neural Network’ and ‘Statistics and Machine Learning’ toolboxes in MATLAB R2017a are used.

We use the publicly available iris data set from the UCI Repository of Machine Learning Database suitable for supervised classification learning. It consists of three classes with 50 instances each, where each class refers to a type of iris plant. The instances are described by four features resulting in the input samples \(\{x_i\}_{i=1}^{150}\subset \mathbb {R}^4\) and target samples \(\{y_i\}_{i=1}^{150}\subset \{0,1\}^3\). For comparison, we classify the diverse input data with support vector machine (SVM) and three-layer neural networks (NN) with 5 and 10 hidden units (HU).

4.1 Choice of Orthogonal Projection

In the experiment, we use projections \(p \in \mathcal {G}_{2,4}\) reducing the original dimension from \(d=4\) to \(k=2\). As a finite representation of the overall space, we use a t-design of strength 14 from a low-cardinality sequence (see Sect. 3.1) consisting of 8475 orthogonal projectors. Note that the dimension reduction in practice takes place by applying \(q\in \mathcal {V}_{k,d}\) with \(q^\top q = p\in \mathcal {G}_{k,d}\), where

$$\begin{aligned} \mathcal {V}_{k,d}:=\{ q\in \mathbb {R}^{k\times d} : qq^\top = I_k\} \end{aligned}$$

denotes the Stiefel manifold. When taking norms, p and q are interchangeable, i.e., \(\Vert q(x)\Vert ^2 = \Vert p(x)\Vert ^2\), for all \(x\in \mathbb {R}^d\). Therefore, we can use w.l.o.g. the theory developed for p.

The projections are chosen in a deterministic manner viewing the previously described competing properties. In Fig. 4, the quantities \({{\,\mathrm{tvar}\,}}(px)\) and \(\mathcal {M}(p,x)\) are pairwise plotted for all projectors in \(\{p_l\}_{l=1}^{8475}\). For comparison, we choose the following projections \(p \in \{p_l\}_{l=1}^{8475} \subset \mathcal {G}_{2,4}\); see Fig. 4a for a visualization.

  • \(p_{\times }\) closest to the expected values 1 and \(\tfrac{k}{d} {{\,\mathrm{tvar}\,}}(x)\) (see (2.10) and (2.11)),

  • \(p_{\Diamond }\) preserving \(\mathcal {M}(p,x) \approx 1\) and maximizing \({{\,\mathrm{tvar}\,}}(px)\),

  • preserving \(\mathcal {M}(p,x) \approx 1\) and minimizing \({{\,\mathrm{tvar}\,}}(px)\),

  • \({{\,\mathrm{tvar}\,}}(px) \approx {{\,\mathrm{tvar}\,}}(p_{\Diamond } x)\) and maximizing \(\mathcal {M}(p,x)\),

  • minimal \({{\,\mathrm{tvar}\,}}(px)\),

  • \(p_{*}\) maximal \({{\,\mathrm{tvar}\,}}(px)\) (PCA).

4.2 Results

In Fig. 4b, we see the linear least squares fitting line, computed directly and via the slope and intercept as stated in (3.6) and (3.7). The correlation coefficient (2.13) is 0.98, which suggests that preserving the two properties is highly competing and needs to be balanced.

In Table 1, the classification results of the iris data are presented. We can see that in this comparison the projector \(p_{\Diamond }\), which corresponds to preserving \(\mathcal {M}(p,x) \approx 1\) and maximizing \({{\,\mathrm{tvar}\,}}(px)\), yields the highest and most robust results. It even yields better results than working with the original input data. The projections that preserve \(\mathcal {M}(p,x) \approx 1\) but do not take care of the magnitude of the total variance yield much worse results. On the other hand, the projections that just focus on high total variance still do not yield as high results as the projection \(p_{\Diamond }\) that balances both properties.

Table 1 Classification results of iris data, when using projected input data in support vector machine (SVM) and shallow neural networks (NN)

Remark 4.1

Given a data set x, the projector \(p_{\Diamond }\) is a good choice to balance both objectives (O1) and (O2). It can be computed by directly analyzing \(\{{{\,\mathrm{tvar}\,}}(p_1x),\ldots ,{{\,\mathrm{tvar}\,}}(p_nx)\}\) and \(\{\mathcal {M}(p_1,x),\ldots ,\mathcal {M}(p_n,x)\}\) of a finite covering \(\{p_l\}_{l=1}^n\) of \(\mathcal {G}_{k,d}\). For higher dimensions, an accurate representation of \(\mathcal {G}_{k,d}\), in order to heuristically select \(p_{\Diamond }\), requires large computational costs. The least squares regression line for a 2-design, as stated in 3.7, can be directly computed with low computational cost. This offers helpful information about the interplay between (O1) and (O2).

5 Augmented Target Loss Functions

In the previous section, projectors were applied to input features of shallow neural networks. In more complex architectures, such as deep neural networks, the adaption of weights can be viewed as optimization of input features, e.g., arising features can be used for transfer learning [54]. Whereas the input data are processed and optimized in each iteration, the target data stay usually unchanged during the whole learning process, serving as a measure of accuracy. The representation of the target data is one key property for successful approximation with neural networks. Here, we will introduce a general class of loss functions, i.e., augmented target (AT) loss functions, that use projections and features to yield beneficial representations of the target space, emphasizing important characteristics.

In optimization problems, additional penalty terms are used for regularization or to enforce other constraints. In deep learning, weight decay (i.e., Tikhonov regularization) is a standard adaption of the loss function to that effect. Incorporating additional underlying information via features of the output/target data has been studied in diverse settings tailored to particular imaging applications. Perceptual loss functions have been used in [31] for image super-resolution, incorporating the comparison of high-level image features that arise from pretrained convolutional neural networks, i.e., the VGG network [45]. Deep perceptual similarity metrics have been proposed in [20] for generating images, comparing image features instead of the original images. In [28], a similar approach was successfully used for style transfer and super-resolution, adding a network that defines loss functions. Anatomically constrained neural networks (ACNN) have been introduced in [40] and applied to cardiac image enhancement and segmentation. Their loss functions incorporate structural information by using autoencoders to gain features about lower-dimensional parametrization of the segmentation. Brain segmentation was studied in [22], where information about the desired structure has been added in the loss function via an adjacency matrix. It was used for fine-tuning the supervised learned network with unlabeled data, reducing the number of abnormalities in the segmentation.

The information of certain target characteristics can be very powerful and even replace the need of annotations in some tasks. In [47], label-free learning is approached by using just structural information of the desired output in the loss function instead of annotated target values.

In the following, we will define a general framework of loss functions that add information of target characteristics via features and projections in supervised learning tasks.

5.1 General Framework

Let the training data be input vectors \(\{x_i\}_{i=1}^m\subset \mathbb {R}^r\) with associated target values \(\{y_i\}_{i=1}^m\subset \mathbb {R}^s\). We consider training a neural network

$$\begin{aligned} f_\theta :\mathbb {R}^r\rightarrow \mathbb {R}^{s}, \end{aligned}$$

where \(\theta \in \mathbb {R}^N\) corresponds to the vector of all free parameters of a fixed architecture. In each optimization step for \(\theta \), the network’s output \(\{\hat{y}_i=f_\theta (x_i)\}_{i=1}^m\subset \mathbb {R}^s\) is compared with the targets \(\{y_i\}_{i=1}^m\) via an underlying loss function L.

In contrast to ordinary learning problems with highly accurate target data, complicated learning tasks arising in many real-world problems do not yield sufficient results when optimizing neural networks with standard loss functions L, such as the widely used mean least squares error

$$\begin{aligned} L_{{{\,\mathrm{MSE}\,}}}(\{y_i\}_{i=1}^m,\{\hat{y}_i\}_{i=1}^m) := \frac{1}{m}\sum _{i=1}^m \left\| y_i - \hat{y}_i\right\| ^2. \end{aligned}$$

The training data may include important information that is obvious for humans, but poorly represented within the original target data and therefore lacks consideration in the learning process. To overcome this issue, we propose to add information tailored to the particular learning problem represented by additional features of the outputs and targets.

First, we select transformations

$$\begin{aligned} T_j:\mathbb {R}^s\rightarrow \mathbb {R}^t, \quad j=1,\ldots ,d, \end{aligned}$$

to enable error estimation in transformed output/target spaces. Note that the transformations \(T_j\) are not required to be linear. However, they should be piecewise differentiable to enable subsequent optimization of the loss function with gradient methods. We shall allow for additional weighting of the transformations \(T_1, \ldots , T_d\) to facilitate the selection of features for a specific learning problem. The previous sections suggest that orthogonal projections can provide favorable feature combinations, which essentially turns into a weighting procedure.

To enable suitable projections, we stack the d output/target features

$$\begin{aligned} T(y_i):=\begin{pmatrix} T_1 (y_i)^\top \\ \vdots \\ T_d (y_i)^\top \end{pmatrix}\in \mathbb {R}^{d\times t}, \end{aligned}$$

so that applying a projector \(p\in \mathcal {G}_{k,d}\) to each column of \(T(y_i)\) yields \(p(T(y_i))\in \mathbb {R}^{d\times t}\). We now define the augmented target loss function with projections by

$$\begin{aligned} L_{p}\big (\{y_i\},\{\hat{y}_i\}\big ) := L(\{y_i\},\{\hat{y}_i\}) + \alpha \cdot \tilde{L}\big (\{p(T(y_i))\},\{p(T(\hat{y}_i))\}\big ), \nonumber \\ \end{aligned}$$

where \(\alpha > 0\) and L and \(\tilde{L}\) correspond to conventional loss functions. Apparently, \(L_p\) depends on the choice of \(p\in \mathcal {G}_{k,d}\). The projection \(p(T(y_i))\) weighs the previously chosen feature transformations \(T(y_i)\). Standard choices of L and \(\tilde{L}\) are \(L_{{{\,\mathrm{MSE}\,}}}\), in which case \(L_p\) becomes

$$\begin{aligned} L_{p}\big (\{y_i\},\{\hat{y}_i\}\big )= & {} \frac{1}{m}\sum _{i=1}^m \left\| y_i - \hat{y}_i\right\| ^2 + \alpha \cdot \frac{1}{m} \sum _{i=1}^m \Vert p(T(y_i)) \nonumber \\&- p(T (\hat{y}_i)) \Vert _{{{\,\mathrm{F}\,}}}^2. \end{aligned}$$

Remark 5.1

For \(k=d\), the projector p is the identity. In this case, the transformations can map onto different spaces, i.e.,

$$\begin{aligned} T_j:\mathbb {R}^s\rightarrow \mathbb {R}^{t_j}, \quad j=1,\ldots ,d, \end{aligned}$$

and we can now write the standard augmented target loss function by

$$\begin{aligned} L_{\text {AT}}\big (\{y_i\},\{\hat{y}_i\}\big ) = \sum _{j=1}^d \alpha _j \cdot L^j \big (\{T_j(y_i)\},\{T_j(\hat{y}_i)\}\big ), \end{aligned}$$

where \(T_1\) corresponds to the identity function, \(L^1,\ldots , L^d\) are common loss functions and \(\alpha _1,\ldots , \alpha _d > 0\) are weighting parameters.

It should be mentioned that \(\alpha \) resembles a regularization parameter. The actual minimization of (5.1) among \(\theta \) is usually performed through Tikhonov-type regularization in many standard deep neural network implementations. The formulation (5.2) adds one further variational step for beneficial output data representation.

Fig. 5
figure 5

OCT provides cross-sectional visualization of the human retina

Remark 5.2

Our proposed structure with target feature maps \(T_1,\ldots ,T_d\) as in (5.4) relates to multitask learning, which has been successfully used in deep neural networks [13]. It handles multiple learning problems with different outputs at the same time. In contrast to multitask learning, we aim to solve a single problem but also penalize the error in transformed spaces enhancing certain target characteristics.

For the projected feature transformations in the augmented target loss function, it is not possible to identify a balancing projection p heuristically (such as \(p_\Diamond \) in Sect. 4), because the output y changes in each iteration when the loss function is called. In the following clinical numerical experiment we overcome this issue by choosing random projections in each optimization step and compare it to prior deterministic choices of projections, including PCA.

6 Application to Clinical Image Data

The first experiment is a clinical problem in retinal image analysis of the human eye, where the disruptions of the so-called photoreceptor layers need to be quantified in optical coherence tomography (OCT) images. The photoreceptors have been identified as the most important retinal biomarker for prediction of vision from OCT in various clinical publications, see e.g., [23]. As OCT technology advances, clinicians are not able to look at each slice of OCT themselves. (In mean, they get 250 slices per patient and have 3–5 minutes/patients including their clinical examination.) Therefore, automated classification of, for example, photoreceptor status is necessary for clinical guidance.

6.1 Data and Objective

In this application, OCT images of different retinal diseases (diabetic macular edema and retinal vein occlusion) were provided by the Vienna Reading Center recorded with the Spectralis OCT device (Heidelberg Engineering, Heidelberg, Germany). Each patient’s OCT volume consists of 49 cross sections/slices (\(496 \times 512\) pixels) recorded in an area of \(6 \times 6~\hbox {mm}\) in the center of the human retina, which is the part of the retina responsible for vision. Each of the slices was manually annotated by a trained grader of the reading center. This is a challenging and time-consuming procedure that is not feasible in clinical routine but only in a research setting. The binary pixelwise annotations serve as target values, enabling a supervised learning framework (Fig. 5).

The objective is to accurately detect the photoreceptor layers and their disruptions pixelwise in each OCT slice by training a deep convolutional neural network with a suitable loss function. The learning problem is complicated by potentially inaccurate target annotations, as studies have shown that inconsistencies between trained graders are common, cf. [50]. Moreover, the learning task is unbalanced in the sense that there are many more slices showing none or very little disruptions. We shall observe that optimization with respect to standard loss functions performs poorly in regards to detecting disruptions. The augmented target loss function proposed in the previous section can enhance the detection.

6.2 Convolutional Neural Network Learning

We implemented our experiments using Python 3.6 with Pytorch 1.0.0. A deep convolutional neural network \(f_\theta \) is trained by applying the U-Net architecture reported in [43] with a sigmoid activation function and Tikhonov regularization. A set of 20 OCT volumes (980 slices) from different patients with corresponding annotations are used for training, where four volumes were used for calibration (validation set). Another two independent test volumes were identified for evaluating the results, one without any disruptions in the photoreceptor layers, whereas the other one includes a high number of disruptions.

Each OCT slice is represented by a vector \(x_i\in \mathbb {R}^{r}\) with \(r=496\cdot 512\). The collection \(\{x_i\}_{i=1}^m\) corresponds to all slices from the training volumes, i.e., \(m=20\cdot 49\). Further matching the notation of the previous section, we have \(r=s\) and \(f_\theta :\mathbb {R}^r\rightarrow \mathbb {R}^r\) with binary target vectors \(y_i\in \{0,1\}^r\). We observe that disruptions are not identified reliably when using the least squared loss function (5.1). To overcome this issues, we use the proposed augmented target loss function with least squared losses as stated in (5.3).

To enhance disruptions within the output/target space, we heuristically choose \(d=4\) local features of the original representation. They are derived from convolutions with two edge filters, \(T_1\) (Prewitt) and \(T_2\) (Laplacian of Gaussian), and from two Gaussian high-pass filters, yielding \(T_3\) and \(T_4\). Note that these feature transformations keep the same size, i.e., \(T_j:\mathbb {R}^r\rightarrow \mathbb {R}^r\) for \(j=1,\ldots ,d\). See Fig. 6 for example images.

Fig. 6
figure 6

Features on output and targets that enhance edges in different ways. It is not obvious which transformations are of most importance; weighting by projections can overcome this issue

We can derive several augmented target loss functions \(L_p\) by choosing different \(p \in \mathcal {G}_{k,d}\). In this experiment, we use the following projections:

  • \(p=I_4\),

  • \(\{p_l\}_{l=1}^{15}\), all projections from a t-design of strength 2 \(\subset \mathcal {G}_{2,4}\) (see [10]),

  • \(p_{{{\,\mathrm{PCA}\,}}} \in \mathcal {G}_{2,4}\), projection determined by PCA on the training data,

  • \(p_{\lambda _{2,4}}\), random projection chosen according to \(\lambda _{2,4}\) in each mini-batch.

Table 2 Comparison of AUC values for photoreceptors segmentation and disruption detection

6.3 Results

Since the detection problem is highly unbalanced, we use precision/recall curves [16] for evaluating the overall performance of each loss function model. The area under the curve (AUC) was used as a numerical indicator of the success rate [41]. The higher the AUC, the better the classification.

The results of the different loss functions on the independent test set are stated in Table 2. Due to the imbalance within the data, the photoreceptor region is identified well, but disruptions are not identified reliably when using the least squared loss function (5.1). For \(\alpha = 0.1\), all proposed augmented target loss functions \(L_p\) clearly increase the success rate of the disruption quantification. Note that all projections are independent from the actual data set, except PCA that was computed beforehand on the training data.

Fig. 7
figure 7

Log-mel spectrograms of the six different instruments. Intensities range from 0 (black) to 1 (yellow) (Color figure online)

The features itself (i.e., \(p = I_4\)) improve the quantification, and weighting them by projections increases the results even more: using the fixed projection \(p_{12}\) from the t-design sequence \(\{p_l\}_{l=1}^{15}\) on the output/target features yields the highest accuracy for photoreceptors and disruptions. This corresponds to the results of the previous sections, stating that depending on the particular data there are projections in the overall space acting beneficially. Since this projection generally cannot be found beforehand, using random projections in each loss function’s evaluation step is easier, possible in practice, and independent from the data. The computation is efficient and randomization can have regularization effects that yield more robust results, cf. [34]. In the following, we will view a second classification problem based on spectrograms, where augmented target loss functions with random projections can improve the accuracy.

7 Application to Musical Data

Here, the learning task is a prototypical problem in Music Information Retrieval, namely multi-class classification of musical instruments. In analogy to the MNIST problem in image recognition, this classification problem is commonly used as a basis of comparison for innovative methods, since the ground truth is unambiguous and sufficiently many annotated data are available. The input to the neural network is spectrograms of audio signals, which is the standard choice in audio machine learning. Spectrograms are calculated from the time signal using a short-time Fourier transform and taking the absolute value squared of the resulting spectra, thus yielding a vector for each time step and a two-dimensional array, like an image, cf. [18].

Reproducible code and more detailed information of our computational experiments can be found in the online repository [25].

7.1 Data and Objective

The publicly available GoodSounds data set [42] contains recordings of single notes and scales played by several single instruments. To gain equally balanced input classes, we restrict the classification problem to six instruments: clarinet, flute, trumpet, violin, alto saxophone and cello. Note that the recordings are monophonic, so that each recording yields one spectrogram that we aim to correctly assign to one of the six instruments.

After removing the silence [3, 38], segments from the raw audio files are transformed into log-mel spectrograms [36], so that we obtain images of time–frequency representations with size \(100 \times 100\). One example spectrogram for each class of instruments is depicted in Fig. 7.

7.2 Convolutional Neural Network Learning

We implemented a fully convolutional neural network \(f_{\theta }: \mathbb {R}^r \rightarrow [0,1]^s\), cf. [33], where \(r = 100 \times 100\) and \(s = 6\), in Python 3.6 using Keras 2.2.4 framework [14] and trained it on the Nvidia GTX 1080 Ti GPU. The data are split into 1, 40, 722 training, 36, 000 validation and 36, 000 independent test samples. We heuristically choose \(d = 16\) output features arising directly from the particular output class. The transformations \(T_1,\dots , T_{16}\), with \(T_j: \mathbb {R}^6 \rightarrow \mathbb {R}\) for \(j = 1, \dots , 16\), are then given by the inner product of the output/target and the feature vectors. Among others, the features are chosen from the enhanced scheme of taxonomy [53] and from the table of frequencies, harmonics and under tones [57]. We use the proposed augmented target loss function \(L_{p}\) (5.2), where \(L_1\) corresponds to the categorical cross-entropy loss [55] and \(L_2\) to the mean squared error as in (5.3). We consider here two choices of p: the identity \(I_{16}\) and random projectors \(p\sim \lambda _{6,16}\) in \(\mathcal {G}_{6,16}\).

The deep learning model is sensitive to various hyper-parameters, including \(\alpha \) and p, in addition to conventional parameters, such as the number of convolutional kernels, learning rate and the parameter \(\beta \) for Tikhonov regularization. To find the best choices in a fair trial, we utilize a random hyper-parameter search approach, where we train 60 models and select the three best ones for a more precise search over different \(\alpha \) in the augmented target loss function and \(\beta \) for Tikhonov regularization. This results in 212 models that are evaluated on the training and validation set. Finally, we select the best model based on the accuracy of the validation set and evaluate it on the independent test set. For comparison, we also evaluate this model with no Tikhonov regularization, i.e., \(\beta = 0\); see Table 3.

Table 3 Classification results with different parameter choices

7.3 Results

Table 3 shows that no regularization and no features provide the poorest results. It seems that adding features with random projections has a regularizing effect and improves the results significantly. As expected, it is important to include Tikhonov regularization on \(\theta \). Further enhancement happens by adding features via the modified augmented target loss function with or without additional weighting from projections. All results are very stable and are generalizing very well from training to the independent test set; see [25] for further details.