We will use the notational convention that bold face upper case symbols represent matrices, bold face lower case symbols represent column vectors, and standard face lower case symbols represent scalars. We assume that our data set consists of n d-dimensional data vectors \(\mathbf {x}_i\). The data set is represented by a real matrix \(\mathbf {X}=\left( \begin{array}{cccc}\mathbf {x}_1^T&\mathbf {x}_2^T&\cdots&\mathbf {x}_n^T\end{array}\right) ^T\in {\mathbb {R}}^{n\times d}\). More generally, we will denote the transpose of the ith row of any matrix \(\mathbf {A}\) as \(\mathbf {a}_i\) (i.e., \(\mathbf {a}_i\) is a column vector). Finally, we will use the shorthand notation \([n]=\{1,\ldots ,n\}\).
2.1 Projection Tile Patterns in Two Flavours
In the interaction step, the proposed system allows users to declare that they have become aware of (and thus are no longer interested in seeing) the value of the projections of a set of points onto a specific subspace of the data space. We call such information a projection tile pattern for reasons that will become clear later. A projection tile parametrizes a set of constraints to the randomization.
Formally, a projection tile pattern, denoted \(\tau \), is defined by a k-dimensional (with \(k\le d\) and \(k=2\) in the simplest case) subspace of \(\mathbb {R}^d\), and a subset of data points \(\mathcal {I}_\tau \subseteq [n]\). We will formalize the k-dimensional subspace as the column space of an orthonormal matrix \(\mathbf {W}_\tau \in \mathbb {R}^{d\times k}\) with \(\mathbf {W}_\tau ^T\mathbf {W}_\tau =\mathbf {I}\), and can thus denote the projection tile as \(\tau =(\mathbf {W}_\tau ,\mathcal {I}_\tau )\). The proposed tool provides two ways in which the user can define the projection vectors \(\mathbf {W}_\tau \) for a projection tile \(\tau \).
2D Tiles. The first approach simply chooses \(\mathbf {W}_\tau \) as the (two) weight vectors defining the projection within which the data vectors belonging to \(\mathcal {I}_\tau \) were marked. This approach allows the user to simply specify that they have taken note of the positions of that set of data points within this projection. The user makes no further assumptions – they assimilate solely what they see without drawing conclusions not supported by direct evidence, see Fig. 3b (left).
Clustering Tiles. It seems plausible, however, that when the marked points are tightly clustered, the user concludes that these points are clustered not just within the two dimensions shown in the scatter plot. To allow the user to express such belief, the second approach takes \(\mathbf {W}_\tau \) to additionally include a basis for other dimensions along which these data points are strongly clustered, see Fig. 3b (right). This is achieved as follows.
Let \(\mathbf {X}(\mathcal {I}_\tau ,:)\) represent a matrix containing the rows indexed by elements from \(\mathcal {I}_\tau \) from \(\mathbf {X}\). Let \(\mathbf {W}\in \mathbb {R}^{d\times 2}\) contain the two weight vectors onto which the data was projected for the current scatter plot. In addition to \(\mathbf {W}\), we want to find any other dimensions along which these data vectors are clustered. These dimensions can be found as those along which the variance of these data points is not much larger than the variance of the projection \(\mathbf {X}(\mathcal {I}_\tau ,:)\mathbf {W}\).
To find these dimensions, we first project the data onto the subspace orthogonal to \(\mathbf {W}\). Let us represent this subspace by a matrix with orthonormal columns, further denoted as \(\mathbf {W}^{\perp }\). Thus, \({\mathbf {W}^{\perp }}^T\mathbf {W}^{\perp }=\mathbf {I}\) and \(\mathbf {W}^T\mathbf {W}^{\perp }=\mathbf {0}\). Then, Principal Component Analysis (PCA) is applied to the resulting matrix \(\mathbf {X}(\mathcal {I}_\tau ,:)\mathbf {W}^{\perp }\). The principal directions corresponding to a variance smaller than a threshold are then selected and stored as columns in a matrix \(\mathbf {V}\). In other words, the variance of each of the columns of \(\mathbf {X}(\mathcal {I}_\tau ,:)\mathbf {W}^{\perp }\mathbf {V}\) is below the threshold.
The matrix \(\mathbf {W}_\tau \) associated to the projection tile pattern is then taken to be:
$$\begin{aligned} \mathbf {W}_\tau =\left( \begin{array}{cc}\mathbf {W}&\,\,\mathbf {W}^\perp \mathbf {V}\end{array}\right) . \end{aligned}$$
The threshold on the variance used could be a tunable parameter, but was set here to twice the average of the variance of the two dimensions of \(\mathbf {X}(\mathcal {I}_\tau ,:)\mathbf {W}\).
2.2 The Randomization Procedure
Here we describe the approach to randomizing the data. The randomized data should represent a sample from an implicitly defined background model that represents the user’s belief state about the data.
Initially, our approach assumes the user merely has an idea about the overall scale of the data. However, throughout the interactive exploration, the patterns in data described by the projection tiles will be maintained in the randomization.
Initial Randomization. The proposed randomization procedure is parametrized by n orthogonal rotation matrices \(\mathbf {U}_i\in {\mathbb {R}}^{d\times d}\), where \(i\in [n]\), and the matrices satisfy \((\mathbf {U}_i)^T=(\mathbf {U}_i)^{-1}\). We further assume that we have a bijective mapping \(f: [n]\times [d]\mapsto [n]\times [d]\) that can be used to permute the indices of the data matrix. The randomization proceeds in three steps:
- Random rotation of the rows.:
-
Each data vector \(\mathbf {x}_i\) is rotated by multiplication with its corresponding random rotation matrix \(\mathbf {U}_i\), leading to a randomised matrix \(\mathbf {Y}\) with rows \(\mathbf {y}_i^T\) that are defined by:
$$\begin{aligned} \forall i:\ \mathbf {y}_i = \mathbf {U}_i\mathbf {x}_i. \end{aligned}$$
- Global permutation.:
-
The matrix \(\mathbf {Y}\) is further randomized by randomly permuting all its elements, leading to the matrix \(\mathbf {Z}\) defined as:
$$\begin{aligned} \forall i,j:\ \mathbf {Z}_{i,j}=\mathbf {Y}_{f(i,j)}. \end{aligned}$$
- Inverse rotation of the rows.:
-
Each randomised data vector in \(\mathbf {Z}\) is rotated with the inverse rotation applied in step 1, leading to the fully randomised matrix \(\mathbf {X}^{*}\) with rows \(\mathbf {x}_i^{*}\) defined as follows in terms of the rows \(\mathbf {z}_i^T\) of \(\mathbf {Z}\):
$$\begin{aligned} \forall i:\ \mathbf {x}^{*}_i = {\mathbf {U}_i}^T\mathbf {z}_i. \end{aligned}$$
The random rotations \(\mathbf {U}_i\) and the permutation f are sampled uniformly at random from all possible rotation matrices and permutations, respectively.
Intuitively, this randomization scheme preserves the scale of the data points. Indeed, the random rotations leave their lengths unchanged, and the global permutation subsequently shuffles the values of the d components of the rotated data points. Note that without the permutation step, the two rotation steps would undo each other such that \(\mathbf {X}^{*}=\mathbf {X}\). Thus, it is the combined effect that results in a randomization of the data set.Footnote 1
Accounting for One Projection Tile. Once the user has assimilated the information in a projection tile \(\tau =(\mathbf {W}_\tau ,\mathcal {I}_\tau )\), the randomization scheme should incorporate this information by ensuring that it is present also in all randomized versions of the data. This ensures that it continues to be a sample from a distribution representing the user’s belief state about the data.
This is achieved by imposing the following constraints on the parameters defining the randomization:
- Constraints on the rotation matrices.:
-
For each \(i\in \mathcal {I}_\tau \), the component of \(\mathbf {x}_i\) that is within the column space of \(\mathbf {W}_\tau \) must be mapped onto the first k dimensions of \(\mathbf {y}_i=\mathbf {U}_i\mathbf {x}_i\) by the rotation matrix \(\mathbf {U}_i\). This can be achieved by ensuring that:Footnote 2
$$\begin{aligned} \forall i \in \mathcal {I}_\tau :\ \mathbf {W}_\tau ^T\mathbf {U}_i=\left( \begin{array}{cc}\mathbf {I}&\ \mathbf {0}\end{array}\right) . \end{aligned}$$
(1)
- Constraints on the permutation.:
-
The permutation should not affect any matrix cells with row indices \(i\in \mathcal {I}_\tau \) and columns indices \(j\in [k]\):
$$\begin{aligned} \forall i\in \mathcal {I}_\tau ,j\in [k]:\ f(i,j)=(i,j). \end{aligned}$$
(2)
Proposition 1
Using the above constraints on the rotation matrices \(\mathbf {U}_i\) and the permutation f, it holds that:
$$\begin{aligned} \forall i\in \mathcal {I}_\tau , \mathbf {x}_i^T\mathbf {W}_\tau = {\mathbf {x}_i^{*}}^T\mathbf {W}_\tau . \end{aligned}$$
(3)
Thus, the values of the projections of the points in the projection tile remain unaltered by the constrained randomization. We omit the proof as the more general Proposition 2 is provided with proof further below.
Accounting for Multiple Projection Tiles. Throughout subsequent iterations, additional projection tile patterns will be specified by the user. A set of tiles \(\tau _i\) for which \(\mathcal {I}_{\tau _i}\cap \mathcal {I}_{\tau _j}=\emptyset \) if \(i\ne j\) is straightforwardly combined simply by applying the relevant constraints on the rotation matrices to the respective rows. When the sets of data points affected by the projection tiles overlap though, the constraints on the rotation matrices need to be combined. The aim of such a combined constraint should be to preserve the values of the projections onto the projection directions for each of the projection tiles a data vector was part of.
The combined effect of a set of tiles will thus be that the constraint on the rotation matrix \(\mathbf {U}_i\) will vary per data vector, and depends on the set of projections \(\mathbf {W}_\tau \) for which \(i\in \mathcal {I}_\tau \). More specifically, we propose to use the following constraint on the rotation matrices:
- Constraints on the rotation matrices.:
-
Let \(\mathbf {W}_i\in \mathbb {R}^{d\times d_i}\) denote a matrix of which the columns are an orthonormal basis for space spanned by the union of the columns of the matrices \(\mathbf {W}_\tau \) for \(\tau \) with \(i\in \mathcal {I}_\tau \). Thus, for any i and \(\tau :i\in \mathcal {I}_\tau \), it holds that \(\mathbf {W}_\tau =\mathbf {W}_i\mathbf {v}_\tau \) for some \(\mathbf {v}_\tau \in \mathbb {R}^{d_i\times \dim (\mathbf {W}_\tau )}\). Then, for each data vector i, the rotation matrix \(\mathbf {U}_i\) must satisfy:
$$\begin{aligned} \forall i \in \mathcal {I}_\tau :\ \mathbf {W}_i^T\mathbf {U}_i=\left( \begin{array}{cc}\mathbf {I}&\ \mathbf {0}\end{array}\right) . \end{aligned}$$
(4)
- Constraints on the permutation.:
-
Then the permutation should not affect any matrix cells in row i and columns \([d_i]\):
$$\begin{aligned} \forall i\in [n],j\in [d_i]:\ f(i,j)=(i,j). \end{aligned}$$
Proposition 2
Using the above constraints on the rotation matrices \(\mathbf {U}_i\) and the permutation f, it holds that:
$$\begin{aligned} \forall \tau ,\forall i\in \mathcal {I}_\tau , \mathbf {x}_i^T\mathbf {W}_\tau = {\mathbf {x}_i^{*}}^T\mathbf {W}_\tau . \end{aligned}$$
Proof
We first show that \({\mathbf {x}_i^{*}}^T\mathbf {W}_i=\mathbf {x}_i^T\mathbf {W}_i\):
$$\begin{aligned} {\mathbf {x}_i^{*}}^T\mathbf {W}_i&= \mathbf {z}_i^T\mathbf {U}_i^T\mathbf {W}_i=\mathbf {z}_i^T\left( \begin{array}{c}\mathbf {I}\\ \mathbf {0}\end{array}\right) =\mathbf {z}_i(1:d_i)^T=\mathbf {y}_i(1:d_i)^T =\mathbf {y}_i^T\left( \begin{array}{c}\mathbf {I}\\ \mathbf {0}\end{array}\right) =\mathbf {x}_i^T\mathbf {W}_i. \end{aligned}$$
The result follows from the fact that \(\mathbf {W}_\tau =\mathbf {W}_i\mathbf {v}_\tau \) for some \(\mathbf {v}_\tau \in \mathbb {R}^{d_i\times \dim (\mathbf {W}_\tau )}\). \(\square \)
Technical Implementation of the Randomization Procedure. To ensure the randomization can be carried out efficiently throughout the process, note that the matrix \(\mathbf {W}_i\) for the \(i\in \mathcal {I}_\tau \) for a new projection tile \(\tau \) can be updated by computing an orthonormal basis for \(\left( \begin{array}{cc}\mathbf {W}_i&\ \mathbf {W}\end{array}\right) \).Footnote 3
Additionally, note that the tiles define an equivalence relation over the row indices, in which i and j are equivalent if they were included in the same set of projection tiles so far. Within each equivalence class, the matrix \(\mathbf {W}_i\) will be constant, such that it suffices to compute it only once, simply keeping track of which points belong to which equivalence class.
2.3 Visualization: Finding the Most Interesting Two-Dimensional Projection
Given the data set \(\mathbf {X}\) and the randomized data set \(\mathbf {X}^{*}\), it is now possible to quantify the extent to which the empirical distribution of a projection \(\mathbf {X}\mathbf {w}\) and \(\mathbf {X}^{*}\mathbf {w}\) onto a weight vector \(\mathbf {w}\) differ. There are various ways in which this difference can be quantified. We investigated a number of possibilities and found that the \(L_1\)-distance between the cumulative distribution functions works particularly well in practice. Thus, with \(F_\mathbf {x}\) the empirical cumulative distribution function for the set of values in \(\mathbf {x}\), the optimal projection is found by solving:
$$\begin{aligned} \max _{\mathbf {w}}&\left\| F_{\mathbf {X}\mathbf {w}}-F_{\mathbf {X}^{*}\mathbf {w}}\right\| _1. \end{aligned}$$
The second dimension of the scatter plot can be sought by optimizing the same objective while requiring it to be orthogonal to the first dimension.
We are unaware of any special structure of this optimization problem that makes solving it particularly efficient. Yet, using the standard quasi-Newton solver in R [18]Footnote 4 already yields satisfactory result. Note that runs of the method may produce different local optimum due to random initialization.