We start by motivating the estimation of topographic representations. Then, we introduce a generative model for the sources s in order to model ICA, TICA and CTA in a unified way, and describe the basic properties of the components in CTA. We then derive an approximation of the likelihood for CTA and propose a method for its optimization.
Motivation for estimating topographic representations
The foremost motivation for estimating topographic representations is visualization. Plotting the components with the topographic arrangement enables us to easily see the interrelationships between components. This is particularly true if the topographic grid is two dimensional and can thus be plotted on the plane.
A second motivation is that the topography learned from natural inputs such as natural images, natural sound, or text, might model cortical representations in the brain. This is based on the hypothesis that in order to minimize wiring length, neurons which interact with each other should be close to each other, see e.g. Hyvärinen et al. (2009). Minimizing wiring seems to be important to keep the volume of the brain manageable, and possibly to speed up computation as well.
An example is computation of complex cell outputs based on simple cell outputs in primary visual cortex (V1). Simple cells are sensitive to an oriented bar or an edge at a certain location in visual space, while complex cells are otherwise similar, but invariant to local sinusoidal phases of visual stimuli. Computationally, such a conversion can be achieved by pooling the squares of the outputs of the simple cells which have similar orientation and spatial location, but different phases. A topographic representation where simple cells are arranged as observed in V1 could minimize the wiring needed in such a pooling because the pooling is done over nearby cells. Such a minimum-wiring topography was found to emerge from natural images using TICA (Hyvärinen et al. 2001).
Related to minimum wiring, the topography may also enable simple definition of new, higher-order features. Summation of the features in a topographic neighborhood (possibly after a nonlinearity such as squaring) may even in general lead to interesting new features, just as in the case of simple cell pooling explained above.
The generative model
We begin with the following generative model for the latent source vector s in (1),
where ⊙ denotes element-wise multiplication, and σ=(σ
1,…,σ
d
) and z=(z
1,…,z
d
) are statistically independent. The two key points of the generative model (2) are the following:
-
1.
If z is multivariate Gaussian with mean 0 and the elements in σ are positive random variables, which is what we assume in the following, the components in s are super-Gaussian, i.e., sparse (Hyvärinen et al. 2001).
-
2.
By introducing linear correlations in z and/or energy correlations in σ, the components in s will have linear and/or energy correlations. This point will be made more precise in the following.
A special case of the model in (2) results in ICA:
- Case 1:
-
If all the elements in z and σ are statistically independent, then s is a vector with independent sparse sources, and (2) gives the source model of ICA.
The source model of TICA can also be obtained as a special case:
- Case 2:
-
If all the elements in z are uncorrelated, but the squares of nearby elements in σ are correlated, then s is a vector formed by sparse sources with energy correlations (and no linear correlations) within a certain neighborhood, and thus (2) gives the source model of TICA.
Here, we introduce the following two further cases:
- Case 3:
-
If nearby elements in z are correlated, but all the elements in σ are statistically independent, then s is a sparse source vector whose elements have linear correlations (and zero or weak energy correlations) within a certain neighborhood.
- Case 4:
-
If nearby elements in z and the squares of nearby elements in σ are correlated, then s is a sparse source vector whose elements have linear and energy correlations within a certain neighborhood, and (2) gives the source model of CTA.
The statistical dependencies of the above four cases for σ and z are summarized in Table 1.
Table 1 Dependencies of pairs of nearby elements in σ and z on four cases of sources and the corresponding source model
In the following, we concentrate on Case 4 (both energy and linear correlations). We do not explicitly consider Case 3 (linear correlations only), but we will show below with simulations that CTA identifies its sources and estimates the ordering of the components as well. This is natural since the model in Case 4 uses both linear and energy correlations to model topography, while Case 3 uses linear ones only.
Basic properties of the model
We give here basic properties of the CTA generative model (Case 4 above) and discuss the differences to TICA (Case 2). Regarding the mean, linear correlation and energy correlation in the model, the following can be shown in general:
-
The mean values of all the components are zero.
-
Nearby components, s
i
and s
j
, are correlated if and only if z
i
and z
j
are linearly correlated. From the property (3), this is proven by
Thus, cov(s
i
,s
j
) is the same as cov(z
i
,z
j
) up to the positive multiplication factor E{σ
i
σ
j
}. The linear correlation coefficient of the components has an upper bound (Appendix A).
-
The energy correlation for s
i
and s
j
can be computed as
where we used the formula valid for Gaussian variables with zero means, \(E\{z_{i}^{2}z_{j}^{2}\}=E\{z_{i}^{2}\}E\{z_{j}^{2}\}+2E\{z_{i}z_{j}\}^{2}\) which is proven by Isserlis’ theorem (Isserlis 1918; Michalowicz et al. 2009). From (5), the energy correlation is caused by the energy correlation for σ and the squared linear correlation for z. Thus, to prove that \(\text {cov}(s_{i}^{2},s_{j}^{2})>0\), it is enough to prove that \(\text {cov}(\sigma_{i}^{2},\sigma_{j}^{2})>0\). In the literature of TICA (Hyvärinen et al. 2001), \(\text {cov}(\sigma_{i}^{2},\sigma_{j}^{2})\) is conjectured to be positive when each σ
i
takes the following form,
where N(i) is an index set to determine a certain neighborhood, ϕ
i
(⋅) denotes a monotonic nonlinear function and u
i
is a positive random variable. We follow this conjecture. The energy correlation coefficient of the components has also an upper bound (Appendix A).
The same analysis has been done for the TICA generative model (Case 2) in Hyvärinen et al. (2001). In the model, the sources are linearly uncorrelated, and, regarding energy correlation, only the first term in (5) is nonzero because the elements in z are statistically independent. Thus, compared to TICA, in CTA, there exist linear correlations and the energy correlations are stronger as well.
Probability distribution and its approximation
We derive here a probability distribution for s to estimate the CTA generative model. We make the assumption that the precision matrix Λ of z takes a tridiagonal form, and thus, the distribution of z is given by
where the boundary of z
i
is ringlike, i.e., z
i±d
=z
i
. All the diagonal elements in Λ are 1, the (i,i+1)-th elements are denoted by λ
i
and the others are 0. For σ, we suppose that each element is given by
where u
i
and v
i
are independent positive random variables and statistically independent from each other. Such a mixture of u
i−1 and u
i
creates energy correlations in the source vector s, while v
i
generates a source-specific variance. By assuming (8), we follow the conjecture in TICA that energy correlations are positive, as in (6). We assume inverse Gamma distributions for u and v,
The a
i
and b
i
are positive scale parameters. If a scale parameter approaches zero, the corresponding variable converges to zero in the sense of distribution. For example, if b
i
→0 for all i, the u
i
approach zero, which decouples the σ
i
from each other. A sketch of the process which generates the sources s and data x is depicted in Fig. 1.
Inserting (2) into (7) gives the conditional distribution for s given σ,
We show in Appendix B that Eq. (8) transforms (10) as
To obtain the distribution for s, we need to integrate out u and v in (11) using (9) as prior distributions. However, this seems to be intractable. Therefore, we resort to two approximations,
where c
i
and d
i
are two unknown positive scaling parameters which do not depend on u and v. The above approximations are similar to what has been done for TICA (Hyvärinen et al. 2001, Eq. (3.7)). Below we analyze the implications of these approximations. With (12) and (13), an approximation of (11) is
where we dropped terms not depending on s,v, or u. The additional parameters c
i
from (12) do not affect the functional form of the approximation. The parameters d
i
from (13) and λ
i
occur only as a product. We thus replace them by the new parameter ϱ
i
=λ
i
d
i
. By calculating the integral over u and v, see Appendix C for details, we obtain the following approximation for the probability distribution of s,
We use the proportionality sign because we do not know the partition function which normalizes \(\tilde{p}(\mathbf {s};\boldsymbol{\varrho},\mathbf {a},\mathbf {b})\).
The approximation in (15) relates to ICA, TICA, and CTA as follows: In the limit where b
i
→0, \(\tilde{p}\) becomes the Laplace distribution, as often used in ICA with sparse sources (Case 1). In the limit where a
i
→0 and ϱ
i
=0 for all i, we obtain TICA (Case 2). Using the fixed values a
i
=b
i
=1 and ϱ
i
=−1, we obtain
which serves as approximative distribution for the CTA model (Case 4) with positively correlated sources, as we justify in more detail below. Note that this distribution has been previously used as a prior for the regression coefficients in the fused lasso for supervised learning (Tibshirani et al. 2005). However, our application on modeling latent variables is very different.
Accuracy of the approximation
The two approximations (12) and (13) were used to derive (16). To analyze the implications of these approximations, we compared (16) with the generative model in (2) in terms of correlation and sparsity of the sources.
For the comparison, we sampled from (2) using d=2 sources and the fixed values a
i
=b
i
=1 for different values of λ
i
=λ. We sampled from (16), with d=2, using slice sampling.Footnote 1 For both models, we drew 106 samples to compute the correlation coefficient between the two sources, the correlation coefficient between their squared values, and their kurtosis.
Figure 2 shows the correlation coefficients for (2) as a function of λ (curves with solid lines), and the correlation coefficients for the approximation (16) as dashed horizontal lines. The plot suggests that the approximation has qualitatively similar correlation coefficients as the generative model for a λ close to −1.
For the generative model, we found that the (excess) kurtosis of the sources was independent of λ, with a value around 3.4. For the approximation, we obtained a value around 2.1. This means that both the original model and the approximation yield sparse sources.
To conclude, we confirmed that the approximation has qualitatively similar properties as the generative model for a λ close to −1. The limitations of the approximation are that the sources are more strongly energy correlated but less sparse than in the original generative model for λ close to −1.
Objective function and its optimization
Using the approximative distribution (16), we can compute the log-likelihood for x and obtain the following objective function to estimate the parameter matrix W=(w
1,…,w
d
)⊤=A
−1:
where we replaced |⋅| with G(⋅)=logcosh(⋅) for numerical reasons. The vector x(t) denotes the t-th observation of the data, t=1,2,…,T. Note that J
1 is the log-likelihood for an ICA model and that J
2 models the topographic part, being sensitive to the order as well as the signs of the w
i
.
We now describe a method to optimize the objective function in (17) because basic gradient methods tend to get stuck in local maxima as we will see in the next section. The proposed algorithm includes the following three steps:
The final output of the algorithm is W
(3). Step 1 corresponds to performing ICA, and Step 2 gives the optimal order and the optimal signs of the ICA components in the sense of the objective function J
2. In Step 3, W
(2) is used as initial value of W. Therefore, Step 1 and Step 2 can be interpreted as a way to find a good initial value.
In Step 2, we have to solve a combinatorial optimization problem, which is computationally very difficult. However, we can see that the problem (21) has a nestedness property, in other words, we can divide the main problem into subproblems. So we can efficiently solve it. For example, suppose c
1=1 and k
1=1. When we want to find the optimal c
2 and k
2 given these c
1 and k
1, we end up with solving a smaller subproblem, which is to maximize the two terms, \(f_{2}(k_{3},c_{3})=\arg\max_{k_{2},c_{2}} [-\frac{1}{T}\sum_{t=1}^{T} \{G(s^{(1)}_{1}-c_{2}s^{(1)}_{k_{2}}) +G(c_{2}s^{(1)}_{k_{2}}-c_{3}s^{(1)}_{k_{3}}) \} ]\) because the other terms do not include k
2 and c
2. Then, we can reuse f
2(k
3,c
3) in finding the optimal c
3 and k
3. Under this situation, dynamic programming (DP) (Bellman 1957; Bellman and Dreyfus 1962; Held and Karp 1962) is one efficient optimization method. The description of the resulting DP algorithm is as follows:
The last term in the right-hand side of (26) was added because of the ring-like boundary. The MATLAB package of CTA by which several results presented in this paper can be reproduced is available at http://www.cs.helsinki.fi/u/ahyvarin/code/cta.
We now briefly describe the run-time cost of the optimization. When data is high-dimensional, most of the time is spent on the dynamic programming part (Algorithm 2). The computation of (23) is T times additions, and the additions are repeated 4(d−i+1)(d−i) times to make the i-th table (24). This means that the computational cost for addition is approximately \(O(4T\sum_{i=2}^{d-1}(d-i+1)(d-i))=O(Td^{3})\). Thus, as the dimension of the data increases, more computational time is needed. But, as we will see below, this algorithm dramatically improves results in terms of topography estimation.