In this section we present our methods. We start with the preliminaries in Sect. 2.1 and present then our main concepts, namely the constraints and the background distribution in Sect. 2.2. We show how we can update the background distribution in Sect. 2.3 and discuss convergence issues in Sect. 2.4. Finally, we show how to find directions where the data and the background distribution differ using an advanced whitening operation in Sect. 2.5 and summarize our framework for interactive visual data exploration in Sect. 2.6.
Preliminaries
We assume the dataset under analysis consists of nd-dimensional real vectors \(\hat{\mathbf{x}}_i\in {{\mathbb {R}}}^d\), where \(i\in [n]=\{1,\ldots ,n\}\). The whole dataset is represented by a real-valued matrix \(\hat{\mathbf{X}}=\left( \hat{\mathbf{x}}_1\hat{\mathbf{x}}_2\ldots \hat{\mathbf{x}}_n\right) ^T\in {{\mathbb {R}}}^{n\times d}\). We use hatted variables (e.g., \(\hat{\mathbf{X}}\)) to denote the data and non-hatted variables (e.g., \(\mathbf{X}\)) to denote the respective random variables.
Example 1
(Running example, see Fig. 3) To illustrate the central concepts of the approach, we generated a synthetic dataset \(\hat{\mathbf{X}}_{5}\) of 1000 data vectors in five dimensions (denoted by \(\hbox {X}1,\ldots ,\hbox {X}5\)). The dataset is designed so that along dimensions X1–X3 it can be clustered into four clusters (labeled A, B, C, D) and along dimensions X4 and X5 into three clusters (labeled E, F, G). The clusters in dimensions X1–X3 are located such that in any 2-D projection along these dimensions cluster A overlaps with one of the clusters B, C, or D. The cluster structure in dimensions X4 and X5 is loosely related to the cluster structure in dimensions X1–X3: with 75% probability a data vector belonging to clusters B, C, or D belongs to one of clusters E and F. The remaining points belong to cluster G. The pairplotFootnote 1 in Fig. 3 shows the structure of the data (the point types correspond to the cluster identities A, B, C, and D).
Constraints and background distribution
The user interaction consists of selecting a point set (which we refer to as a cluster), studying the statistics of this cluster, and possible marking this cluster. Subsequently, the system provides a new visualization showing structure complementary to the structure encoded in the background distribution. To implement the envisioned interaction scheme, we wish to define constraints (specifications of the data) and to construct a background distribution such that the constraints set by the user are satisfied. Intuitively, the more constraints we have, the closer the distribution should be to the true data, since the constraints added are based on the data. Typically, the constraints will not be sufficient to define a distribution, because there are still many degrees of freedom. Arguably, the most neutral distribution is the distribution of maximum entropy (MaxEnt), because that is the only distribution which does not add any side information (Cover and Thomas 2005).
We must also define some initial background distribution. A reasonable and convenient assumption is that the initial background distribution equals a spherical Gaussian distribution with zero mean and unit variance, given by
$$\begin{aligned} q(\mathbf{X})\propto \exp {\left( -\sum \limits _{i=1}^n{\mathbf{x}_i^T\mathbf{x}_i}/2\right) }. \end{aligned}$$
(1)
This is equivalent to the MaxEnt distribution with known mean and variance (but not co-variance) for all attributes. If we normalize the data, these statistics are obviously zero and one respectively, for every attribute.
As illustrated in Fig. 1, the interaction process is such that the user is shown 2-D projections (Fig. 1c) where the data and the background distribution differ the most. The initial view shown to the user is a projection of the whitened data (see Sect. 2.5) onto the first two PCA components or the two ICA components with the highest score, whichever of these projection methods the user deems more appropriate.
Example 2
Figure 4a shows the projection of the whitened \(\hat{\mathbf{X}}_{5}\) onto the two ICA components with the highest scores using log-cosh objective function. One can observe the cluster structure in the first three dimensions X1–X3. The gray points represent a sample from the background distribution. When shown together with the data, it becomes evident that the data and the background distribution differ.
Subsequently, we can define constraints on subsets of points in \({{\mathbb {R}}}^{n\times d}\) for a given projection by introducing linear and quadratic constraint functions (Lijffijt et al. 2018). A constraint is parametrized by the subset of rows \(I\subseteq [n]\) that are involved and a projection vector \({\mathbf {w}}\in {\mathbb {R}}^d\). The linear constraint function is defined by
$$\begin{aligned} f_{\text {lin}}({\mathbf {X}},I,{\mathbf {w}})=\sum \nolimits _{i\in I}{{\mathbf {w}}^T{\mathbf {x}}_i}, \end{aligned}$$
(2)
and the quadratic constraint function by
$$\begin{aligned} f_{\text {quad}}({\mathbf {X}},I,{\mathbf {w}})=\sum \nolimits _{i\in I}{\left( {\mathbf {w}}^T\left( {\mathbf {x}}_i-\hat{\mathbf {m}}_I\right) \right) ^2}, \end{aligned}$$
(3)
where we have used
$$\begin{aligned} {\hat{\mathbf {m}}}_I=\sum \nolimits _{i\in I}{{\hat{\mathbf {x}}}_i}/|I|. \end{aligned}$$
(4)
These constraint functions specify the mean and variance for a set of points, for a specific direction \({\mathbf {w}}\). Notice that \({\hat{\mathbf {m}}}_I\) is not a random variable but a constant that depends on the observed data. If it were a random variable, it would introduce cross-terms between rows and the distribution would no longer be independent for different rows. In principle, we could set \({\hat{\mathbf {m}}}\) to any constant value, including zero. However, for the numerical algorithm to converge quickly, we use the value specified by Eq. (4).
We denote a constraint by a triplet \(C=(c,I,{\mathbf {w}})\), where \(c\in \{{ lin},{ quad}\}\), and the constraint function is then given by \(f_c({\mathbf {X}},I,{\mathbf {w}})\). We can now use the linear and quadratic constraint functions to express several types of knowledge a user may have about the data, e.g., knowledge of a cluster in the data or the marginal distribution of the data, which we can then encode into the background distribution.
To start with, we can encode the mean and variance, i.e., the first and second moment of the marginal distribution, of each attribute:
- (i)
Margin constraint consists of a linear and a quadratic constraint for each of the columns in [d], respectively, the total number of constraints being 2d.
We can encode the mean and (co)variance statistics of a point cluster for all attributes:
- (ii)
Cluster constraint is defined as follows. We make a singular value decomposition (SVD) of the points in the cluster defined by I. Then a linear and a quadratic constraint is defined for each of the eigenvectors. This results in 2d constraints per cluster.
We can also encode the mean and (co)variance statistics of the full data for all attributes:
- (iii)
1-cluster constraint is a special case of a cluster constraint where the full dataset is assumed to be in one single cluster (i.e., \(I=[n])\). Essentially, this means that the data is modeled by its principal components and the correlations are taken into account, unlike with the marginal constraints, again resulting to 2d constraints.
Finally, we can encode the mean and variance of a point cluster or the full data as shown in the current 2-D projection:
- (iv)
2-D constraint consists of a linear and a quadratic constraint for the two eigen vectors spanning the 2-D projection in question, resulting to 4 constraints.
Updating the background distribution
Having formalized the constraints, we are now ready to formulate our main problem, i.e., how to update the background distribution given a set of constraints.
Problem 1
Given a dataset \({\hat{\mathbf {X}}}\) and k constraints \({\mathscr {C}}=\{C^1,\ldots ,C^k\}\), find a probability density p over datasets \(\mathbf{X}\in {{\mathbb {R}}}^{n\times d}\) such that the entropy defined by
$$\begin{aligned} S=-E_{p(\mathbf{X})}\left[ \log {\left( p(\mathbf{X})/q(\mathbf{X})\right) }\right] \end{aligned}$$
(5)
is maximized, while the following constraints are satisfied for all \(t\in [k]\):
$$\begin{aligned} E_{p({\mathbf {X}})}\left[ f_{c^t}({\mathbf {X}},I^t,\mathbf { w}^t)\right] ={\hat{v}}^t, \end{aligned}$$
(6)
where \({\hat{v}}^t= f_{c^t}( {\hat{\mathbf {X}}},I^t,{\mathbf {w}}^t)\) and \(q(\mathbf{X})\propto \exp {\left( -\sum \nolimits _{i=1}^n{\mathbf{x}_i^T\mathbf{x}_i}/2\right) }\).
The distribution p that is a solution to the Problem 1 is the background distribution taking into account \({\mathscr {C}}\). Intuitively, the background distribution is the maximally random distribution such that the constraints are preserved in expectation. Due to our choice of the initial background distribution and the constraint functions, the MaxEnt solution to Problem 1 is a multivariate Gaussian distribution. The form of the solution to Problem 1 is given by the following lemma.
Lemma 1
The probability density p that is a solution to Problem 1 is of the form
$$\begin{aligned} p(\mathbf{X})\propto q(\mathbf{X})\times \exp {\left( \sum \nolimits _{t=1}^k{\lambda ^tf_{c^t}(\mathbf{X},I^t,\mathbf{w}^t)}\right) }, \end{aligned}$$
(7)
where \(\lambda ^t\in {{\mathbb {R}}}\) are real-valued parameters.
See, e.g., Cover and Thomas (2005, Chapter 12) for a proof.
We make an observation that adding a margin constraint or 1-cluster constraint to the background distribution is equivalent to transforming the data to zero mean and unit variance or whitening of the data, respectively.
Equation (7) can also be written in the form
$$\begin{aligned} p(\mathbf{X}\mid \theta )\propto \exp {\left( -\sum \nolimits _{i=1}^n{ \left( \mathbf{x}_i-\mathbf{m}_i\right) ^T \varSigma _i^{-1} \left( \mathbf{x}_i-\mathbf{m}_i\right) /2} \right) }, \end{aligned}$$
(8)
using the natural parameters collectively denoted by
$$\begin{aligned} \theta =\left\{ \theta _i\right\} _{i\in [n]} =\left\{ \left( \varSigma ^{-1}_i\mathbf{m}_i,\varSigma _i^{-1}\right) \right\} _{i\in [n]}. \end{aligned}$$
By matching the terms linear and quadratic in \(\mathbf{x}_i\) in Eqs. (7) and (8), we can write Eq. (8) as sums of the terms of the form \(\lambda ^tf_{c^t}(\mathbf{X},I^t,\mathbf{w}^t)\). The dual parameters are given by \(\mu =\left\{ \mu _i\right\} _{i\in [n]}= \left\{ \left( \mathbf{m}_i,\varSigma _i\right) \right\} _{i\in [n]}\) and can be obtained from the natural parameters by using matrix inversion and multiplication operations.
Problem 1 can be solved numerically as follows. Initially, we set the lambda parameters to \(\lambda ^1=\cdots =\lambda ^k=0\), with the natural dual parameters then given by \(\theta _i=\mu _i=\left( \mathbf{0},\mathbf{1}\right) \) for all \(i\in [n]\). Given a set of constraints, the lambda parameters are updated iteratively as follows. Given some values for the lambda parameters and the respective natural and dual parameters, we choose a constraint \(t\in [k]\) and find a value for \(\lambda ^t\) such that the constraint in Eq. (6) is satisfied for this chosen t. We then iterate this process for all constraints \(t\in [k]\) until convergence. Due to the convexity of the problem, we are always guaranteed to eventually end up in a globally optimal solution. For a given set of lambda parameters, we can then find the natural parameters in \(\theta \) by simple addition, and the dual parameters \(\mu \) using \(\theta \). Finally, the expectation in Eq. (6) can be computed by using the dual parameters and the identities \(E_{p(\mathbf{X}\mid \theta )}\left[ \mathbf{{x}_i}{ \mathbf {x}_i}^T\right] =\varSigma _i+\mathbf{m}_i\mathbf{m}_i^T\) and \(E_{p(\mathbf{X}\mid \theta )}\left[ \mathbf{x}_i\right] =\mathbf{m}_i\).
Example 3
After observing the view in Fig. 4a the user can add a cluster constraint for each of the four clusters visible in the view. The background distribution is then updated to take into account the added constraints by solving Problem 1. In Fig. 4b a sample of the updated background distribution (gray points) is shown together with the data (black points).
Update rules
A straightforward implementation of the above-mentioned optimization process is inefficient because we need to store parameters for n rows and the matrix inversion is an \(O(d^3)\) operation, resulting to a time complexity of \(O(nd^3)\). We can, however, substantially speed up the computations using two observations. First, two rows affected by the exactly same set of constraints will have equal parameters, i.e., we have \(\theta _i=\theta _j\) and \(\mu _i=\mu _j\) for such rows i and j. Thus, we need only to store and compute values of the parameters \(\theta _i\) and \(\mu _i\) for “equivalence classes” of rows, whose number depends on the number and the overlap of the constraints, but not on n. Second, if we store both the natural and dual parameters at each iteration, the update due to each constraint corresponds to a rank-1 update to the covariance matrix \(\varSigma ^{-1}_i\). We can then use the Woodbury Matrix Identity taking \(O(d^2)\) time to compute the inverse, instead of \(O(d^3)\).
A further observation is that by storing the natural and dual parameters at each step, we do not need to explicitly store the values of the lambda parameters. At each iteration we are only interested in the change of \(\lambda ^t\) instead of its absolute value. After these speedups, we expect the optimization process to take \(O(d^2)\) time per constraint and to be asymptotically independent of n. For simplicity, in the following description, we retain the sums of the form \(\sum \nolimits _{i\in I^t}{}\). However, in the implementation we replace these by the more efficient weighted sums over the equivalence classes of rows. To simplify and clarify the notation we use parameters with a tilde (e.g., \({\tilde{\varSigma }}\)) to denote them before the update and parameters without (e.g., \(\varSigma \)) to denote the values after the update, and \(\lambda \) to denote the change in \(\lambda ^t\).
For a linear constraintt the expectation is given by
$$\begin{aligned}v^t= E_{p(\mathbf{X}\mid \theta )}\left[ f_{lin}({\mathbf {X}},I^t,\mathbf { w}^t) \right] =\sum \nolimits _{i\in I^t}{\mathbf{w}^{tT}\mathbf{m}_i}.\end{aligned}$$
The update rules for the parameters are given by \(\theta _{i1}={\tilde{\theta }}_{i1}+\lambda \mathbf{w}^t\) and \(\mu _{i1}=\varSigma _i\theta _{i1}\). Solving for \(v^t={\hat{v}}^t\) gives the required change in \(\lambda ^t\) as
$$\begin{aligned} \lambda =\left( {\hat{v}}^t-{\tilde{v}}^t\right) /\left( \sum \nolimits _{i\in I^t}{\mathbf{w}^{tT}{\tilde{\varSigma }}_i\mathbf{w}^t} \right) , \end{aligned}$$
(9)
where \({\tilde{v}}^t\) denotes the value of \(v^t\) before the update. Notice the change in \(\lambda ^t\) is zero if \({\tilde{v}}^t={\hat{v}}^t\), as expected.
For a quadratic constraintt the expectation is given by
$$\begin{aligned}v^t= E_{p({\mathbf {X}}\mid \theta )}\left[ f_{quad}({\mathbf {X}},I^t,\mathbf { w}^t)\right] =\mathbf{w}^{tT} \sum \nolimits _{i\in I^t}{\left( \varSigma _i+\mathbf{q}_i\mathbf{q}_i^T\right) }{} \mathbf{w}^t, \end{aligned}$$
where \(\mathbf{q}_i=\mathbf{m}_i-\hat{\mathbf{m}}_{I^t}\). The update rules for the parameters are
$$\begin{aligned} \theta _{i1}&={\tilde{\theta }}_{i1}+\lambda \delta \mathbf{w}^t,\\ \theta _{i2}&={\tilde{\theta }}_{i2}+\lambda \mathbf{w}^t\mathbf{w}^{tT},\\ \mu _{i1}&=\varSigma _i\theta _{i1}, \text{ and }\\ \mu _{i2}&={\tilde{\varSigma }}_i-\lambda \mathbf{g}_i\mathbf{g}_i^T/\left( 1+\lambda \mathbf{w}^{tT}{} \mathbf{g}_i\right) , \end{aligned}$$
where have used the short-hands \(\delta =\hat{\mathbf{m}}_{I^t}^T\mathbf{w}^t\) and \(\mathbf{g}_i={\tilde{\varSigma }}_i\mathbf{w}^t\). We use the Woodbury Matrix Identity to avoid explicit matrix inversion in the computation of \(\mu _{i2}\). Again, solving for \(v^t={\hat{v}}^t\) gives an equation
$$\begin{aligned} \phi (\lambda )=\sum \nolimits _{i\in I^t}{\left( \varLambda _ic_i^2-f_i^2c_i^2+2f_ic_i(\delta -e_i) \right) }+\hat{v}^t-{\tilde{v}}^t=0, \end{aligned}$$
(10)
where we have used the following shorthands
$$\begin{aligned} \mathbf{b}_i&={\tilde{\varSigma }}_i{\tilde{\theta }}_{i1},\\ c_i&=\mathbf{b}_i^T\mathbf{w}^t, \\ \varLambda _i&=\lambda /\left( 1+\lambda c_i\right) ,\\ d_i&=\mathbf{b}_i^T{\tilde{\theta }}_{i1},\\ e_i&=\tilde{\mathbf{m}}_i^T\mathbf{w}^t, \text{ and }\\ f_i&=\lambda \delta -\varLambda _id_i-\varLambda _i\lambda \delta c_i. \end{aligned}$$
Notice that \(\varLambda _i\) and \(f_i\) are functions of \(\lambda \). We conclude with the observation that \(\phi (\lambda )\) is a monotone function, whose root can be determined efficiently with a one-dimensional root-finding algorithm.
About convergence
In the runtime experiment (Sect. 3.1, Table 2) we define the optimization to be converged when the maximal absolute change in the lambda parameters is \(10^{-2}\) or when the maximal change in the means or square roots of variance constraints is at most \(10^{-2}\) times the standard deviation of the full data. We describe in this section a situation where the convergence is very slow, and a fixed time cutoff becomes useful. The iteration is guaranteed to converge eventually, but in certain cases—especially if the size of the dataset (n) is small or the size of some clusters (\(|I^t|\)) is small compared to the dimensionality of the dataset (d)—the convergence can be slow, as shown in the following adversarial example.
Example 4
Consider a dataset of three points (\(n=3\)) in two dimensions (\(d=2\)), shown in Fig. 5a and given by
$$\begin{aligned} \hat{\mathbf{X}}=\left( \begin{array}{c@{\quad }c}1&{}0\\ 0&{}1\\ 0&{}0\end{array}\right) , \end{aligned}$$
(11)
and two sets of constraints:
- (A)
The first set of constraints consists of the cluster constraints related to the first and the third row and is given by \({\mathscr {C}}_A=\{C^1,\ldots ,C^4\}\), where \(c^1=c^3={ lin}\), \(c^2=c^4={ quad}\), \(I^1=\cdots =I^4=\{1,3\}\), \(w^1=w^2=(1,0)^T\), and \(w^3=w^4=(0,1)^T\).
- (B)
The second set of constraints has an additional cluster constraint related to the second and the third row and is given by \({\mathscr {C}}_B=\{C^1,\ldots ,C^8\}\), where \(C^1,\ldots ,C^4\) are as above and \(c^5=c^7={ lin}\), \(c^6=c^8={ quad}\), \(I^5=\cdots =I^8=\{2,3\}\), \(w^5=w^6=(1,0)^T\), and \(w^7=w^8=(0,1)^T\).
Next, we consider convergence when solving Problem 1 using these two sets of constraints.
Case A The solution to Problem 1 with constraints in \({{\mathscr {C}}}_A\) is given by \(\mathbf{m}_1=\mathbf{m}_3=(\frac{1}{2},0)^T\), \(\mathbf{m}_2=(0,0)^T\),
$$\begin{aligned} \varSigma _1=\varSigma _3=\left( \begin{array}{c@{\quad }c}\frac{1}{4}&{}0\\ 0&{}0\end{array}\right) ,\quad \text{ and }\quad \varSigma _2=\left( \begin{array}{c@{\quad }c}1&{}0\\ 0&{}1\end{array}\right) . \end{aligned}$$
(12)
Note that if the number of data points in a cluster constraint is at most the number of dimensions in the data, there are necessarily directions in which the variance of the background distribution is zero, see Fig. 5a. However, since we have here a single cluster constraint with no overlapping constraints, the convergence is very fast and, in fact, occurs after one pass over the lambda variables as shown in Fig. 5b (black line).
Case B The solution to Problem 1 with constraints in \({{\mathscr {C}}}_B\) are given by \(\mathbf{m}_1=(1,0)^T\), \(\mathbf{m}_2=(0,1)^T\), \(\mathbf{m}_3=(0,0)^T\), and
$$\begin{aligned} \varSigma _1=\varSigma _2=\varSigma _3=\left( \begin{array}{c@{\quad }c}0&{}0\\ 0&{}0\end{array}\right) . \end{aligned}$$
(13)
Here we observe that adding a second overlapping cluster constraint, combined with the small variance directions in both of the constraints restricts the variance of the third data point to zero. Because both of the clusters have only one additional data point, it follows that the variance of all data points is then zero. The small variance and the overlapping constraints for data points cause the convergence here to be substantially slower, as shown in Fig. 5b (gray line). The variance scales roughly as \((\varSigma _1)_{11}\propto \tau ^{-1}\), where \(\tau \) is the number of optimization steps, the global optimum being in singular point at \((\varSigma _1)_{11}=0\).
The slow convergence in the above example is due to the overlapping of constraints and the quadratic constraints with a small variance (caused here by the small number of points per cluster). A way to speed up the convergence would be—perhaps unintuitively—to add more data points: e.g., to replicate each data point 10 times with random noise added to each replicate. When a data point would be selected to a constraint, then all of its replicates would be included as well. This would set a lower limit on the variance of the background model and hence, would be expected to speed up the convergence. Another way to solve the issue is just to cut off the iterations after some time point leading up to a larger variance than in the optimal solution. The latter approach appears to be typically acceptable in practice.
Whitening operation for finding the most informative visualization
Once we have found the distribution that solves Problem 1, the next task is to find and visualize the maximal differences between the data and the background distribution defined by Eq. (5).
Here we use a whitening operation which is similar to ZCA-Mahalanobis whitening (Kessy et al. 2018) to find the directions in which the current background distribution p and the data differ the most. The underlying idea is that a direction-preserving whitening transformation of the data with p results in a unit Gaussian spherical distribution, if the data follows the current background distribution p. Thus, any deviation from the unit sphere distribution in the data whitened using p is a signal of difference between the data and the current background distribution.
More specifically, let the distribution p solving Problem 1 be parametrized by \(\mu =\left\{ \left( \mathbf{m}_i,\varSigma _i\right) \right\} _{i\in [n]}\) and consider \(\mathbf{X}=\left( \mathbf{x}_1 \mathbf{x}_2\ldots \mathbf{x}_n\right) ^T\in {{\mathbb {R}}}^{n\times d}\). We define new whitened data vectors \({\mathbf {y}}_i\) as follows,
$$\begin{aligned} {\mathbf {y}}_i=\varSigma _{i}^{-1/2}\left( {\mathbf {x}}_i-\mathbf { m}_i\right) , \end{aligned}$$
(14)
where \(\varSigma _{i}^{-1/2}=U_iD_i^{1/2}U^T_i\) with the SVD decomposition of \(\varSigma _i^{-1}\) given by \(\varSigma ^{-1}_{i}=U_iD_iU^T_i\), where \(U_i\) is an orthogonal matrix and \(D_i\) is a diagonal matrix. Notice that if we used one transformation matrix for the whole data, this would correspond to the normal whitening transformation (Kessy et al. 2018). However, here we may have a different transformation for each of the rows. Furthermore, normally the transformation matrix would be computed from the data, but here we compute it from the constrained model, i.e., using the background distribution.
It is easy to see that if \({\mathbf {x}}_i\) obeys the distribution of Eq. (8), then \(D_i^{1/2}U_i^T\left( \mathbf { x}_i-{\mathbf {m}}_i\right) \) obeys unit spherical distribution. Hence, any rotation of this vector obeys a unit sphere distribution as well. We rotate this vector back to the direction of \(\mathbf{x}_i\) so that after the final rotation, the vectors \({\mathbf {y}}_i\) for different rows i have a comparable direction.
Now, we apply the whitening transformation on our data matrix \(\hat{\mathbf{X}}\) and use \(\hat{\mathbf{Y}}= \left( \hat{\mathbf{y}}_1\hat{\mathbf{y}}_2\ldots \hat{\mathbf{y}}_n\right) ^T\) to denote the whitened data matrix. Notice that when there are no constraints, that is \(\mathbf{m}_i=\mathbf{0}\) and \(\varSigma _i^{-1}=\mathbf{1}\), the whitening operation reduces to identity operation, i.e., \(\hat{\mathbf{Y}}=\hat{\mathbf{X}}\).
Example 5
To illustrate the whitening operation, we show in Fig. 6 pairplots of the whitened data matrix \(\hat{\mathbf{Y}}_5\) for the synthetic data \(\hat{\mathbf{X}}_5\) and different background distributions (i.e., sets of constraints). Initially, i.e., without any constraints (Fig. 6a) the whitened data matches \(\hat{\mathbf{X}}_5\). Figure 6b shows the whitened data after the background distribution has been updated to take into account the addition of a cluster constraint for each of the four clusters in Fig. 4a. Now, in the first three dimensions X1–X3 the whitened data does not anymore significantly differ from Gaussian distribution, while in dimensions X4 and X5 it does.
In order to find directions where the data and the background distribution differ, i.e., the whitened data \(\hat{\mathbf{Y}}\) differs from the unit Gaussian distribution with zero mean, an obvious choice is to use Principal Component Analysis (PCA) and look for directions in which the variance of \(\hat{\mathbf{Y}}\) differs most from unity.Footnote 2 However, it may happen that the variance is already taken into account in the variance constraints. In this case, PCA becomes non-informative because all directions in \(\hat{\mathbf{Y}}\) have equal mean and variance. Instead, we can use, e.g., Independent Component Analysis (ICA) and the FastICA algorithm (Hyvärinen 1999) with log-cosh G function as a default method to find non-Gaussian directions. To find the best two ICA components, we compute a full set of d components, and then take the two components that score best on the log-cosh objective function. Clearly, when there are no constraints, our approach equals standard PCA and ICA on the original data, but when there are constraints, the output will be different.
To be able to visualize the background distribution together with the data in the found projection, we use a random dataset that can be obtained by sampling a data point for each \(i\in [n]\) from the multivariate Gaussian distribution parametrized by \(\theta _i\).
Example 6
The directions in which the whitened data \(\hat{\mathbf{Y}}_5\) in Fig. 6b differs the most from Gaussian (using ICA) are shown in Fig. 4c. The user can observe the cluster structure in dimensions X4 and X5, which would not be possible to find with non-iterative methods. Furthermore, it is clear that the sample from the background distribution (the points shown in gray in Fig. 4) is different from the data in this projection. After adding a cluster constraint for each of the three visible clusters, the updated background distribution becomes a faithful representation of the data, and thus the whitened data shown in Fig. 6c resembles a unit Gaussian spherical distribution in all dimensions. This is also reflected in a visible drop in ICA scores in Table 1.
Table 1 ICA scores (sorted by absolute value) for all five components computed by FastICA for each of the iterative steps in Fig. 4
A summary of the proposed interactive framework for EDA
Now, we are ready to summarize our framework. Initially, we have the dataset \(\hat{\mathbf{X}}\), the set of constraints \({\mathscr {C}}\) is empty, and the background distribution equals a spherical Gaussian distribution with zero mean and unit variance (Eq. 1). At each iteration, the following steps are performed, and the exploration continues as long as the user is convinced that she has observed relevant features of the data (i.e., there is now visible difference between the background distribution and the data).
- 1.
The data \(\hat{\mathbf{X}}\) is whitened with respect to the background distribution (Eq. 14).
- 2.
The first two PCA or ICA components of the whitened data \(\hat{\mathbf{Y}}\) are computed to obtain the most informative projection with respect to the current knowledge.
- 3.
The data \(\hat{\mathbf{X}}\) and a sample from the background distribution are projected into the directions found in Step 2.
- 4.
In the projection, the user may observe differences between the data and the background distribution. She then formulates the observations in terms of constraints \(\{C_1,\ldots , C_k\}\), and the set of constraints is updated to \({\mathscr {C}}={\mathscr {C}}\cup \{C_1,\ldots , C_k\}\).
- 5.
The background distribution is updated to take into account the added constraints, i.e., Problem 1 is solved with respect to the updated \({\mathscr {C}}\).
- 6.
The process continues from Step 1.
Remark 1
If the user has prior knowledge about the data, this can represented using a set of constraints \({\mathscr {C}}\ne \emptyset \). Then, one should use the distribution p that is a solution to Problem 1 with respect to \({\mathscr {C}}\) as the initial background distribution instead of using a spherical Gaussian distribution with zero mean and unit variance.
Remark 2
Throughout the process the background distribution has the form of a multivariate Gaussian distribution with mean and co-variance that may differ from point to point. This is not by assumption, but it is the result of the MaxEnt principle along with constraints that specify the mean and variance, which leads to a Gaussian distribution.