In recent years, a vast amount of data has been accumulated across various fields in industry and academia, and with the rise of artificial intelligence and machine learning technologies, knowledge discovery and high-precision predictions through such data have been demanded. However, real-world data is diverse, including network data that represent relationships, data with multiple modalities or views, data that is distributed across multiple institutions and requires a certain level of information confidentiality. There is also data that requires extracting latent features in complex subspaces for analysis. Therefore, analysis methods that can handle such diversity are needed. In this chapter, we introduce effective methods for such data using novel numerical analysis techniques.

This chapter is organized as follows. Section 4.1 gives an overview of several spectral methods for unsupervised dimensionality reduction and clustering. Section 4.2 describes a recent advanced dimensionality reduction method based on complex moment-based subspace and matrix trace optimization. Section 4.3 shows methods that can utilize data relationships with multiple views simultaneously. In Sect. 4.4, we describe so-called data collaboration analysis that can securely utilize data distributed across multiple institutions.

In this chapter, we denote the numerical dataset of interest by \(X = [{\boldsymbol{x}}_1, {\boldsymbol{x}}_2, \dots , {\boldsymbol{x}}_n]^\textrm{T} \in \mathbb {R}^{n \times m}\), where n is the number of samples (or data objects) and m is the number of features (or attributes). m is also referred to as dimensionality of the data objects. We use the symbol \(:=\) for definition. We denote \([n] := \{1,2,\dots ,n\}\). For matrix A, the (ij)-element is denoted by \([A]_{i,j}\). \(I_n\) and \(O_{n,m}\) denote the identity matrix of order n and the \(n \times m\) zero matrix, respectively.

1 Spectral Methods for Machine Learning

In this section, we describe spectral methods for unsupervised dimensionality reduction and clustering. Spectral methods involve the decomposition or representation of data or relationships in terms of their spectral components. Many spectral methods rely on computing eigenspace of matrices derived from the data in some form. The spectral methods introduced in this section rely on a natively given graph or a graph computed by similarities between the data objects.

Let \(\mathcal {G}:=(\mathcal {V},\mathcal {E})\) be a (weighted) undirected graph where \(\mathcal {V}\) is the set of vertices and \(\mathcal {E}\) is the set of edges. Here we assume that \(\mathcal {G}\) is natively given, for example, as in the application in social network analysis. Well-known approaches to form the graph from the data matrix X will be shown at the end of this section. Let W be the adjacency matrix (which is symmetric) of \(\mathcal {G}\). \(w_{i,j}\) is (ij) element of W. When \((i,j) \in \mathcal {E}\), a positive weight value is assigned to \(w_{i,j}\), otherwise \(w_{i,j} = 0\). D is a diagonal matrix whose ith diagonal element is \(d_i := \sum _{j} w_{i,j}\) (for \(i \in [n]\)). \(L := D - W\) is called graph-Laplacian and is known to be positive semi-definite. It is known that the number of zero eigenvalues of L coincides the number of connected components of \(\mathcal {G}\). When \(\mathcal {G}\) is connected, the values of all elements of the eigenvector corresponding to only zero eigenvalue are constant. In this section, we assume that G is connected, for simplicity. For other cases, refer to [45].

The Laplacian eigenmap [1] is a dimensionality reduction method that minimizes a matrix trace involving L. When one considers dimensionality reduction to \(\ell \)-dimension, the objective function is

$$\begin{aligned} \min _{U \in \mathbb {R}^{n \times \ell }} \textrm{Tr}(U^{\textrm{T}} L U) \end{aligned}$$

with the constraint \(U^{\textrm{T}} D U = I_\ell \). The solution of the minimization can be obtained by computing eigenvectors corresponding to smallest \(\ell \) non-zero eigenvalues of the generalized eigenvalue problem \(L \boldsymbol{u} = \lambda D \boldsymbol{u}\). This minimization problem can be regarded as a continuous relaxation of the normalize cut problem [45]. The ith row vector of U is regarded as \(\ell \) dimensional representation of the ith data object.

Spectral clustering [34, 45] is one of the most popular clustering methods. This clustering method is a method that applies k-means for low-dimensional representation obtained by the Laplacian eigenmap. The locality preserving projection (LPP) [12] can be regarded as a linear approximation of the Laplacian eigenmap. Its objective function is

$$\begin{aligned} \min _{Z \in \mathbb {R}^{m \times \ell }} Z^{\textrm{T}} X^{\textrm{T}} L X Z \end{aligned}$$

with the constraint \(Z^{\textrm{T}} X^{\textrm{T}} D X Z = I_\ell \). Here we show the algorithm of spectral clustering in Algorithm 1.

An algorithm for spectral clustering. The input reads data matrix X belongs to script n times m, and the output reads cluster memberships C and includes 5 steps.

The modularity maximization [33] is a popular optimization function for graph clustering and is commonly used for community detection in social network graphs. Modularity maximization is a combinatorial optimization problem. In the modularity maximization, random graph model is introduced. \(p_{i,j}\) is the expected weight value of edge between vertex i and j of the random graph model (for unweighted case, \(p_{i,j}\) is the probability of the occurrence of an edge). Modularity Q is defined as

$$\begin{aligned} Q := \frac{1}{\textrm{Vol}(\mathcal {G})} \sum _{i,j} [w_{i,j} - p_{i,j}] \delta (g_i, g_j). \end{aligned}$$

Here, \(g_i\) is the cluster index of the ith vertex and \(\delta (g_i, g_j) = 1\) where \(g_i = g_j\) otherwise 0. Large modularity indicates that the intra-cluster vertices of the graph are more densely connected than a given random graph model. In the Chung–Lu random graph model [33], we define \(p_{i,j}\) as \(p_{i,j} = \frac{1}{\textrm{Vol}(\mathcal {G})} {D_{i,i} D_{j,j}}\) where \(\textrm{Vol}(\mathcal {G}) = \sum _{i} D_{i,i}\) so that the expected degree of the random graph becomes same as those of \(\mathcal {G}\). The modularity matrix is defined as

$$\begin{aligned} M := W - \frac{1}{\textrm{Vol}(\mathcal {G})} \boldsymbol{d} \boldsymbol{d}^{\textrm{T}}, \end{aligned}$$

where \(\boldsymbol{d}\) is the vector formed by the diagonal elements of D. Here we introduce a cluster membership matrix U. \([U]_{i,j} = 1\) when the ith vertex belongs to the jth cluster and \([U]_{i,j} = 0\) otherwise. Using this, modularity can be written as

$$\begin{aligned} Q = \textrm{Tr} (U^{\textrm{T}} M U). \end{aligned}$$

Here we omitted the constant factor. When we relax the binary constraint and allow the continuous values (with constraint \(U^{\textrm{T}} U = I_n\)), the solution is obtained by computing the eigenvectors corresponding to the \(\ell \) largest eigenvalues of the standard eigenvalue problem

$$\begin{aligned} M \boldsymbol{u} = \lambda \boldsymbol{u}. \end{aligned}$$

This dimensionality reduction method is not as widely used as the Laplacian eigenmap, but could be an alternative to it. It is proposed as an anomaly detection method for densely connected subgraphs [9].

Now we introduce methods for constructing a graph from the data matrix X. One of the most popular approaches is to make the kNN graph. Given pairwise similarity, kNN graph we set \((i,j) \in \mathcal {E}\) if vertex i is a k-nearest neighbor vertex of j. Because the edge occurrence is not necessarily symmetric, some process for symmetrization is performed [45]. The complexity of a brute force approach to compute pairwise similarity (or distance) is \(O(n^2)\) that is not tractable for large n. There are approximation methods such as recursive Lanczos bisection [2] and kNN descent [5] that are aimed at reducing computational complexity.

2 Complex Moment-Based Supervised Eigenmap for Dimensionality Reduction

Increasing the number of features (dimensionality) seems to lead to better performance. However, in practice, adding more features leads to worse performance, i.e. the curse of dimensionality. Dimensionality reduction reduces the number of features (dimensionality) to avoid the curse of dimensionality. Using dimensionality reduction methods, we can speed up algorithms, reduce the risk of overfitting and improve the accuracy of prediction results.

In this section, we briefly introduce dimensionality reduction methods based on matrix trace optimization. These methods use a few eigenvectors to construct a low-dimensional space while preserving certain properties of the original data. We also introduce a complex moment-based supervised eigenmap [17] that uses a large number of eigenvectors to improve recognition performance.

2.1 Dimensionality Reduction Based on a Matrix Trace Optimization

Let m and n be the number of the features and samples for training dataset \(X = [{\boldsymbol{x}}_1, {\boldsymbol{x}}_2, \dots , {\boldsymbol{x}}_n]^\textrm{T} \in \mathbb {R}^{n \times m}\). Here, we consider linear and nonlinear dimensionality reduction methods that construct low-dimensional data \(Y = [{\boldsymbol{y}}_1, {\boldsymbol{y}}_2, \dots , {\boldsymbol{y}}_n]^\textrm{T} \in \mathbb {R}^{n \times \ell }\), which retain some of the properties of the original data. Specifically, linear dimensionality reduction methods reduce the original data X to the low-dimensional data Y using a linear map \(B \in \mathbb {R}^{m \times \ell }\), i.e.

$$\begin{aligned} Y = XB. \end{aligned}$$

In each dimensionality reduction method based on a matrix trace optimization, a symmetric matrix \(A_1 \in \mathbb {R}^{m \times m}\) and a symmetric positive definite matrix \(A_2 \in \mathbb {R}^{m \times m}\) are defined, respectively. Then, the linear map B is formulated by the minimization or maximization of a matrix trace:

$$\begin{aligned} \min _{B \in \mathbb {R}^{m \times \ell }} \textrm{Tr} \left( B^\textrm{T} A_1 B \right) \quad \text{ or } \quad \max _{B \in \mathbb {R}^{m \times \ell }} \textrm{Tr} \left( B^\textrm{T} A_1 B \right) \quad \text{ s.t. } \quad B^\textrm{T} A_2 B = I_\ell . \end{aligned}$$

This is solved as \(\ell \) eigenvectors of the corresponding generalized eigenvalue problem:

$$\begin{aligned} A_1 {\boldsymbol{t}}_i = \lambda _i A_2 {\boldsymbol{t}}_i. \end{aligned}$$
(4.1)

Here, we have \(B = [{\boldsymbol{t}}_1, {\boldsymbol{t}}_2, \dots , {\boldsymbol{t}}_\ell ]\).

Principal component analysis (PCA) [24, 39] and locality preserving projections (LPP) [12] are two of the typical unsupervised dimensionality reduction methods in this class. PCA aims to maximize the variance of the projected vectors, while LPP aims to preserve the local similarity of the original data. Discriminant analysis is the typical supervised method that maximizes the between-class scatter and reduces the within-class scatter. A family of discriminant analysis methods are proposed for supervised dimensionality reduction, including Fisher discriminant analysis (FDA) [7, 8], local FDA (LFDA) [43], semi-supervised LFDA (SELF) [44] and locality adaptive discriminant analysis (LADA) [28].

Nonlinear dimensionality reduction methods, which use a nonlinear map and the kernel trick [42], are widely used as improvements over linear dimensionality reduction methods. Nonlinear dimensionality reduction methods transform the original data X to \(\phi (X) = [\phi ({\boldsymbol{x}}_1),\phi ({\boldsymbol{x}}_2), \dots , \phi ({\boldsymbol{x}}_n)]^\textrm{T}\) with a nonlinear kernel function and reduce the dimension of \(\phi (X)\) using a nonlinear map \(\widetilde{B}\) such that \(Y = \phi (X)\widetilde{B}\). With appropriate nonlinear functions, nonlinear dimensionality reduction methods are expected to improve the recognition performance.

In general, we set \(\widetilde{B} = \phi (X)^\textrm{T}\widehat{B}\) with \(\widehat{B} \in \mathbb {R}^{n \times \ell }\) and directly set the Gram matrix \(K = \phi (X) \phi (X)^\textrm{T} \in \mathbb {R}^{n \times n}\) without computing \(\phi (X)\) to reduce the computational costs. The Gaussian kernel, polynomial kernel and sigmoid kernel are commonly used as the kernel functions.

2.2 A Complex Moment-Based Supervised Eigenmap

As written in Sect. 4.2.1, most of the existing dimensionality reduction methods use only \(\ell \) eigenvectors to construct the low-dimensional space with a dimension \(\ell \), which may lead to loss of useful information for achieving successful classification. To overcome the deficiency of information loss, recently, a complex moment-based supervised eigenmap (CMSE) for dimensionality reduction has been proposed [17]. CMSE allows us to achieve better recognition performance by using a complex moment-based subspace that includes \(d > \ell \) eigenvectors, where d can be set independently of the dimension \(\ell \) of the low-dimensional space.

The basic concepts of CMSE for high recognition performance are summarized as follows:

  • Use the complex moment-based subspace \(\mathcal {S}_\Omega \), which is equivalent to the invariant subspace with respect to the multiple eigenvectors.

  • Use the novel minimization problem that combines the matrix trace derived from the dimensionality reduction methods and the squared error straightforwardly using the ground truth data \(Z \in \mathbb {R}^{n \times \ell }\).

Let \(A_1\) and \(A_2\) be the matrices used in a given dimensionality reduction method such as LPP or LFDA. Based on the concept of complex moment-based eigensolvers [14, 40, 41], we also let \(\mathcal {S}_\Omega \) be a complex moment-based subspace with respect to a given real interval \(\Omega = [a,b] \subset \mathbb {R}\) defined by

$$\begin{aligned} &\mathcal {S}_\Omega = \mathcal {R}(S), \quad S = [S_0, S_1, \dots , S_{M-1}], \nonumber \\ &S_k := \frac{1}{2 \pi \textrm{i}} \oint _\Gamma z^k (z A_2 - A_1)^{-1} A_2 V \textrm{d}z, \end{aligned}$$
(4.2)

where \(L, M \in \mathbb {N}_+, V \in \mathbb {R}^{m \times L}\) and \(\Gamma \) is a positively oriented Jordan curve around \(\Omega \). Then, the subspace \(\mathcal {S}_\Omega = \mathcal {R}(S)\) is equivalent to the subspace spanned by eigenvectors \({\boldsymbol{t}}_i\) corresponding to eigenvalues \(\lambda _i \in \Omega \) of (4.1), that is,

$$\begin{aligned} \mathcal {S}_\Omega = \mathcal {R}(S) = \{ {\boldsymbol{t}}_i | \lambda _i \in \Omega \}. \end{aligned}$$

Then, using \(d = \textrm{rank}(\mathcal {S}_\Omega ) \ge \ell \) eigenvectors, CMSE introduces the following minimization problem on \(\mathcal {S}_\Omega \) to obtain the linear map \(B \in \mathbb {R}^{m \times \ell }\):

$$\begin{aligned} \min _{B = [{\boldsymbol{b}}_1, {\boldsymbol{b}}_2, \dots , {\boldsymbol{b}}_\ell ], {\boldsymbol{b}}_i \in \mathcal {S}_\Omega } E(B) \quad \text{ s.t. } \quad B ^\textrm{T} A_2 B = I_\ell , \end{aligned}$$
(4.3)

with

$$\begin{aligned} E(B) = (1-\mu ) \textrm{Tr}\left( B^\textrm{T} f(A_1) B \right) + \mu \Vert Z - XB \Vert _\textrm{F}^2, \end{aligned}$$

where the objective function E(B) combines a matrix trace derived from dimensionality reduction methods and a squared error straightforwardly using the ground truth data Z. Here, \(\mu \in [0,1]\) is a weight parameter for both terms and \(f(\cdot )\) is a (meromorphic) weight function of each eigenvector for minimization.

In (4.3), the column vectors of the linear map B are constrained by \(A_2\)-orthonormal bases of the complex moment-based subspace \(\mathcal {S}_\Omega \). Let \(U \in \mathbb {R}^{m \times d}\) be an \(A_2\)-orthogonal matrix (\(U^\textrm{T}A_2U=I_d\)) whose columns are \(A_2\)-orthonormal bases of the complex moment-based subspace \(\mathcal {S}_\Omega \) and \(T = U^\textrm{T}A_1U\). Then, the minimization problem (4.3) can be written as

$$\begin{aligned} &\min _{C \in \mathbb {R}^{d \times \ell }} \left| \left| \left[ \begin{array}{c} \mu ^{1/2} Z \\ O_{d,\ell } \end{array} \right] - \left[ \begin{array}{c} \mu ^{1/2} XU \\ (1-\mu )^{1/2} f(T)^{1/2} \end{array} \right] C \right| \right| _\textrm{F}^2 \quad \text{ s.t. } \quad C^\textrm{T} C = I_\ell , \end{aligned}$$
(4.4)

where \(B = UC\). A minimization problem with an orthogonal constraint (4.4) is called an unbalanced orthogonal Procrustes (UOP) problem, which is solved using an iterative method [6, 38].

In practice, the contour integral (4.2) is approximated by a numerical integration rule such as the N-point trapezoidal rule, as follows:

$$\begin{aligned} \widehat{S}_k = 2 \sum _{j=1}^{N/2} \textrm{Re}(\omega _j z_j^k (z_j A_2 - A_1)^{-1} A_2 V), \end{aligned}$$
(4.5)

using symmetric property. In addition, to improve the numerical stability, we apply a low-rank approximation of \(\widehat{S}\) with a truncated singular value decomposition on an \(A_2\)-inner product:

$$\begin{aligned} {\widehat{S}} = [{\widehat{U}}, {\widehat{U}}{'}] \left[ \begin{array}{cc} {\widehat{\Sigma }} &{} \\ &{} {\widehat{\Sigma }}{'} \end{array} \right] \left[ \begin{array}{c} {\widehat{W}}^\text {T} \\ {\widehat{W}{'}}^\text {T} \end{array} \right] \approx {\widehat{U}} {\widehat{\Sigma }} {\widehat{W}}^\textrm{T}, \quad {\widehat{U}}^\textrm{T} A_2 {\widehat{U}} = I_{\widehat{d}}, \quad {\widehat{W}}^\textrm{T} {\widehat{W}} = I_{\widehat{d}}, \end{aligned}$$

where \(\widehat{\Sigma }\) is a diagonal matrix whose diagonal entries are the larger part of the singular values, i.e., \(\sigma _i/\sigma _1 \ge \delta \) \((\sigma _1 \ge \sigma _2 \ge \dots \ge \sigma _{LM})\). The columns of \(\widehat{U}\) and \(\widehat{W}\) are the corresponding left and right singular vectors. Let \(\widehat{d}\) be a numerical rank, \(\sigma _{\widehat{d}} / \sigma _1 \ge \delta > \sigma _{\widehat{d}+1} / \sigma _1\). Then, the UOP problem (4.4) is rewritten as

$$\begin{aligned} \min _{\widehat{C} \in \mathbb {R}^{\widehat{d} \times \ell }} \left| \left| \left[ \begin{array}{c} \mu ^{1/2} Z \\ O_{\widehat{d},\ell } \end{array} \right] - \left[ \begin{array}{c} \mu ^{1/2} X \widehat{U} \\ (1-\mu )^{1/2} f(\widehat{T})^{1/2} \end{array} \right] \widehat{C} \right| \right| _\textrm{F}^2 \quad \text{ s.t. } \quad \widehat{C}^\textrm{T} \widehat{C} = I_\ell , \end{aligned}$$
(4.6)

where \(\widehat{T} = \widehat{U}^\textrm{T} A_1 \widehat{U}\) and the map is obtained from \(B = \widehat{U} \widehat{C}\).

The algorithm of CMSE is summarized in Algorithm 2. One of the most time-consuming parts of CMSE is computing the solutions of the N/2 linear systems with L right-hand sides in (4.5) and Step 2 of Algorithm 2 as follows:

$$\begin{aligned} (z_j A_2 - A_1) P_j = A_2 V, \quad j = 1, 2, \dots , N/2. \end{aligned}$$

Since CMSE has hierarchical parallelism for solving these linear systems, CMSE shows a high parallel efficiency demonstrated in [48].

An algorithm for complex moment-based supervised eigenmap. The input has training dataset and parameters, and the output reads linear eigenmap B belongs to script R m times l and includes 4 steps.

3 Multi-view Data Analysis

Multi-view data are ubiquitous in real-world applications. Multi-view data refers to data that includes multiple sets of features or representations, each providing a different perspective or view of the objects or entities being observed. For instance, pictures often have textual tags and descriptions and the same semantic meaning can be represented in multilingual forms. Each individual view has its specific property in these data, while different views often contain complementary information [47]. Multi-view learning can be roughly divided into supervised and unsupervised methods. Here, we mainly introduce the unsupervised method, i.e., multi-view clustering, which has emerged as a powerful tool for exploring the underlying structure of data from different sources.

Next, we introduce three kinds of multi-view clustering methods, including multi-kernel learning, multi-view graph clustering and multi-view subspace clustering.

3.1 Multi-Kernel Learning

Multi-kernel learning has been widely applied in order to deal with multi-view data, which intends to optimally combine a group of predefined kernels in order to improve clustering performance [47]. The general method for multi-view data is to use a linear combination of several kernel functions. Since the weights of different views are important, the weights of different kernels should be taken into consideration. An auto-weighted multi-kernel method has been proposed to weigh the views and kernels simultaneously [51]. First, Kernel Principal Component Analysis (KPCA) is used on each view to reduce the dimension of multi-view data. Then, the designed weighted Gaussian kernel is applied to the low-dimensional multi-view data. The weighted Gaussian kernel integrates the advantage of the Gaussian kernel and the Polynomial kernel, which is formulated as

$$\begin{aligned} K(x,y) = \bigg [\exp \bigg (-{\frac{\Vert x-y \Vert ^2}{2\sigma ^2}}\bigg )+R\bigg ] ^d, \quad \forall R\ge 0,d\ge 0,d\in N. \end{aligned}$$

The above formula can be expanded based on the binomial theorem as follows.

$$\begin{aligned} \begin{aligned} K(x,y) &= \sum ^{d}_{s=0} \left( \begin{array}{r} d \\ s \end{array} \right) R^{d-s}\exp \bigg (-{\frac{s\Vert x-y \Vert ^2}{2\sigma ^2}}\bigg ) \\ &=R^{d}+\sum ^{d}_{s=1}\left( \begin{array}{r} d \\ s \end{array} \right) R^{d-s}\exp \bigg (-{\frac{\Vert x-y \Vert ^2}{2\sigma ^2/s}}\bigg ). \end{aligned} \end{aligned}$$

Thus, the weighted Gaussian kernel is a combination of d Gaussian kernels with different widths in the range from \(\sigma ^2\) to \(\sigma ^2/s\). Weight \(R^{d-s}\) is used to reflect the importance of the \(d-s\) kernel and enlarge the linear translation of distance between points.

For multi-view data with m views, n samples and k clusters, the objective function based on the K-means and weighted Gaussian kernel is formulated as

$$\begin{aligned} \begin{aligned} \min \sum ^{m}_{v=1}\sum ^{n}_{i=1}\omega _{iv}\sum ^{k}_{j=1}\delta _{ij}\Vert \phi ^v(x_i^v)-\phi ^v(c_j^v)\Vert ^2, \\ s.t., \omega _{iv}\ge 0, \prod _{v}\omega _{iv}=1, \end{aligned} \end{aligned}$$

where \(c_j^v\) is the cluster center, and \(\delta _{ij}\) is the indicator variable with \(\delta _{ij}=1\) if \(x_i \in c_j\), otherwise \(\delta _{ij}=0\). This step drives the weight of each view and cluster center. After finite iterations, it arrives at the final clustering result.

By replacing the weighted Gaussian kernel \(K(x,y)=\phi (x)\cdot \phi (y)\), the above formula can be rewritten as

$$\begin{aligned} \begin{aligned} \min \sum ^{m}_{v=1}\sum ^{n}_{i=1}\omega _{iv}\sum ^{k}_{j=1}2\delta _{ij}[(1+R)^d-K(x_i^v,c_j^v)], \\ s.t., \omega _{iv}\ge 0, \prod _{v}\omega _{iv}=1. \end{aligned} \end{aligned}$$

The above formula inherits the properties of K-means and kernel, while the weighted Gaussian kernel integrates the advantages of the Gaussian kernel and Polynomial kernel.

3.2 Multi-view Graph Clustering

The objective of multi-view graph clustering is to find a fusion graph across all views and then use other methods such as spectral clustering on the fusion graph to obtain the final clustering result. Here, we introduce a parameter-free Auto-weighted Multiple Graph Learning (AMGL) method [37], which implements the automatic allocation of weights by modifying the traditional spectral clustering model without any hyperparameters.

In spectral clustering, based on the Laplacian matrix, the objective function can be defined as follows.

$$\begin{aligned} \min _{F^{\textrm{T}}F=I} \textrm{Tr}(F^{\textrm{T}}LF). \end{aligned}$$

Based on the above formula and replacing L with \(L^v\), i.e., the Laplacian matrix in each view v, a new general framework for multiple graph learning is proposed.

$$\begin{aligned} \min _{F\in C}\sum ^{m}_{v=1}\sqrt{\textrm{Tr}(F^{\textrm{T}}L^vF)}, \end{aligned}$$

where no weight factors are explicitly defined. By taking the derivative of the Lagrange function of the above formula, the weight of each view \(\alpha ^v\) can be obtained based on F as

$$\begin{aligned} \alpha ^v=\frac{1}{2\sqrt{\textrm{Tr}(F^{\textrm{T}}L^vF)}}. \end{aligned}$$
(4.7)

Then, the objective function of the AMGL method is set as

$$\begin{aligned} \min _{F\in C}\sum ^{m}_{v=1}\alpha ^v{\textrm{Tr}(F^{\textrm{T}} L^vF)}. \end{aligned}$$

In the above formula, F will be continuously used to update \(\alpha ^v\) according to Eq. (4.7), which inspires taking an alternating optimization strategy to compute F and \(\alpha ^v\) iteratively.

3.3 Multi-view Subspace Clustering

Multi-view subspace clustering is to learn a unified representation or a latent space from all view data. Then, the unified representation is fed into an off-the-shelf clustering model to obtain the clustering results. Here, we introduce Non-Negative Matrix Factorization (NMF)-based method.

NMF aims to find two non-negative matrices \(U \in \mathbb {R}^{n\times p}\) and \(V \in \mathbb {R}^{p\times m}\) to adequately approximate the original matrix \(X \in \mathbb {R}^{n\times m}\). The reconstruction processes can be formulated as the following optimization problem.

$$\begin{aligned} \min _{U,V}\Vert X-UV \Vert _F^2, \quad s.t., U\ge 0, V\ge 0. \end{aligned}$$

Here, U is termed as the basis matrix, while V is the indicator matrix, and p denotes the desired reduced dimension. Due to the non-negative constraints, it can learn a part-based representation.

For multi-view data, the objective is to combine multi-view information in the NMF framework. Generally, a common indicator matrix \(V^*\) is enforced in the NMF among different views to perform multi-view clustering. One of the widely used methods is to push each view-dependent indicator matrix \(V^v\) toward a common indicator matrix \(V^*\) [29]. The optimization problem is formulated as

$$\begin{aligned} \begin{aligned} \min _{U^v,V^v,v=1,...,m}\sum ^{m}_{v=1}\Vert X^v-U^{v}V^{v} \Vert _F^2+\lambda _v\Vert V^{v}-V^*\Vert _F^2, \\ s.t., \forall 1\ge k\ge K, \Vert U^{v}_{.,k}\Vert _1=1, U^{v},V^{v},V^*\ge 0. \end{aligned} \end{aligned}$$

The constraint \(\Vert U^{v}_{.,k}\Vert _1=1\) is to guarantee that \(V^{v}\) is within the same range for different views. After obtaining the consensus matrix \(V^*\), the cluster label of the data point i can be computed as \(\arg \max _k V^*_{i,k}\).

4 Data Collaboration Analysis

In various real-world applications, there is a growing demand for the integrated data analysis, in which the datasets are owned by multiple parties in a distributed manner, and they are collaboratively analyzed. For example, in medical data analysis for rare diseases, it was reported that when the analysis is conducted using only data from a single institution, the accuracy is insufficient because of the small sample size [30]. However, sharing the original medical data is difficult because of privacy concerns, and even if it could be achieved, we have to pay huge costs for cross-institutional cross-border communications. Therefore, methods to achieve privacy-preserving analysis in which datasets are collaboratively analyzed without sharing the original data are attracting attention.

In this section, we briefly introduce typical privacy-preserving methods and describe a data collaboration analysis.

4.1 Privacy-Preserving Integrated Data Analysis

A compelling application scenario for privacy-preserving integrated data analysis is medical data analysis involving multiple institutions. Consider a scenario where there are several municipalities, each with its own set of patients or data samples. Within each municipality, patients receive medical examinations or treatments at multiple medical institutions. There are a variety of medical data (i.e., features), such as red blood cell count and white blood cell count, assigned to each patient and the type of data varies by institution. Therefore, medical data are partitioned by samples into multiple municipalities (i.e., horizontal data partitioning) and by features into multiple medical institutions (i.e., vertical data partitioning).

If the analysis is conducted using only data in a single institution (local analysis), the accuracy may be insufficient because of the small sample size and limited features. By integrating data from multiple medical institutions in multiple municipalities (centralized analysis), one can achieve a highly accurate analysis; however, data sharing is difficult because of the perspective of data confidentiality. Thus, privacy-preserving integrated data analysis for horizontally and vertically partitioned data is essential.

Cryptographic computation is one of the most well-known methods used for ensuring privacy preservation [4, 11, 23]. Cryptographic methods can compute a function over distributed data while retaining the privacy of the data. Any given function can be computed by applying fully homomorphic encryption [10]. However, this method is not feasible for a machine learning model construction of large datasets because of the large computational cost even with the latest implementations [3, 50].

In the context of a model construction, a typical technology of a privacy-preserving integrated data analysis is federated learning systems introduced by Google. The concept of federated learning systems was first proposed by Google [25] typically for Android phone model updates [31]. Federated learning is primarily based on (deep) neural network and updates the model iteratively.

To update the model, federated stochastic gradient descent (FedSGD) and federated averaging (FedAvg) are typical strategies [31]. FedSGD is a direct extension of the stochastic gradient descent method. In each iteration of the gradient descent method, each party locally computes a gradient from the shared model using the local dataset and sends it to the server. The shared gradients are averaged and used to update the model. Instead, in FedAvg, each party performs more than one batch update using the local dataset, and sends the updated model to the server. Then, the shared models are averaged to update. Federated learning including more recent methods, such as FedProx [27] and FedCodl [36], require cross-institutional communication in each iteration. For more details, we refer to [26, 46] and references therein.

Recently, non-model share-type federated learning called data collaboration (DC) analysis has been proposed [18, 21]. Instead of the above model share-type federated learning, the DC analysis centralizes the dimensionality-reduced intermediate representation. The centralized intermediate representations are transformed to incorporable forms called collaboration representations. Then, the collaborative representation is analyzed as a single dataset. The DC analysis does not require iterative cross-institutional communications.

The DC analysis has been developed to interpretable model construction [15], novelty detection [22], feature selection [49] and survival analysis [20]. Privacy and accuracy of DC analysis were analyzed in [13]. In addition, identifiability, which is essential for analyzing personal information, of the shared intermediate representation was analyzed [19]. The paper [19] proposed a non-readily identifiable DC analysis that realizes privacy-preserving analysis sharing only non-readily identifiable intermediate representations.

4.2 Data Collaboration Analysis

Let m and n denote the numbers of features (dimensionality of each data) and training data samples. Let \(X = [{\boldsymbol{x}}_{1}, {\boldsymbol{x}}_{2}, \dots , {\boldsymbol{x}}_{n}]^\textrm{T} \in \mathbb {R}^{n \times m}\) and \(Y = [{\boldsymbol{y}}_1, {\boldsymbol{y}}_2, \dots ,\) \({\boldsymbol{y}}_n]^\textrm{T} \in \mathbb {R}^{n \times \ell }\) be the training dataset and the corresponding ground truth or label. Here, for privacy-preserving integrated data analysis on multiple parties, we introduce the algorithm of the DC analysis for supervised learning of horizontal partitioned data, that is, data samples are partitioned into c parties as follows:

$$\begin{aligned} X = \left[ \begin{array}{c} X_{1} \\ X_{2} \\ \vdots \\ X_{c} \end{array} \right] , \quad Y = \left[ \begin{array}{c} Y_{1} \\ Y_{2} \\ \vdots \\ Y_{c} \end{array} \right] . \end{aligned}$$

Then, the ith party has a partial dataset and the corresponding ground truth,

$$\begin{aligned} X_{i} \in \mathbb {R}^{n_i \times m}, \quad Y_i \in \mathbb {R}^{n_i \times \ell }, \end{aligned}$$

where \(n = \sum _{i=1}^c n_i\). Note that DC analysis is applicable to the datasets with partially common features [32] and horizontal and vertical partitioned data [21].

DC analysis operates in two roles: worker and master. Workers have the private dataset \(X_{i}\) and corresponding ground truth \(Y_i\), which must be analyzed without sharing \(X_{i}\). Master supports the collaboration analysis.

First, all workers generate the same anchor data \(X^\textrm{anc} \in \mathbb {R}^{r \times m}\), which is shareable data consisting of public data or dummy data that are randomly constructed. Although, random anchor data functions well for DC analysis in general [18, 21, 22], using an anchor data whose distribution is close to that of the raw dataset can improve the recognition performance [16]. Then, each worker constructs intermediate representations,

$$\begin{aligned} \widetilde{X}_{i} = f_{i}(X_{i}) \in \mathbb {R}^{n_i \times \widetilde{m}_{i}}, \quad \widetilde{X}_{i}^\textrm{anc} = f_{i}(X^\textrm{anc}) \in \mathbb {R}^{r \times \widetilde{m}_{i}}, \end{aligned}$$

where \(\widetilde{m}_{i} < m\), and centralizes them to the master. Here, we can use non-supervised and supervised dimensionality reduction methods, e.g., in Sect. 4.2.

At the master side, mapping function \(g_i\) for the collaboration representation is constructed satisfying \(g_i(\widetilde{X}_i^\textrm{anc}) \approx g_{i'}(\widetilde{X}_{i'}^\textrm{anc})\) \((i, i' = 1, 2, \dots , c)\) in some sense. In practice, \(g_i\) is set as a linear function \(g_i(\widetilde{X}_i^\textrm{anc}) = \widetilde{X}_j^\textrm{anc} G_j\) with \(G_i \in \mathbb {R}^{\widetilde{m}_i \times \widehat{m}}\) and is constructed using the following minimal perturbation problem:

$$\begin{aligned} \min _{E_i, G_i' (i = 1, 2, \dots , c), \Vert Z\Vert _\textrm{F} = 1} \sum _{i=1}^c \Vert E_i \Vert _\textrm{F}^2 \quad \text{ s.t. } (\widetilde{X}_{i}^\textrm{anc} + E_i) G_i' = Z. \end{aligned}$$

This can be solved by a singular value decomposition (SVD)-based algorithm for total least squares problems. Let

$$\begin{aligned}{}[\widetilde{X}^\textrm{anc}_1, \widetilde{X}^\textrm{anc}_2, \dots , \widetilde{X}^\textrm{anc}_c] \approx U_{\widehat{m}} \Sigma _{\widehat{m}} V_{\widehat{m}}^\textrm{T} \end{aligned}$$
(4.8)

be the rank \(\widehat{m}\) approximation based on SVD. Then, the target matrix \(G_i\) is obtained as follows:

$$\begin{aligned} G_i = (\widetilde{X}_i^\textrm{anc})^\dagger U_{\widehat{m}} C, \end{aligned}$$
(4.9)

where \(\dagger \) denotes the Moore–Penrose inverse and \(C \in \mathbb {R}^{\widehat{m} \times \widehat{m}}\) is a nonsingular matrix, for example, \(C=I_{\widehat{m}}\) and \(C=\Sigma _{\widehat{m}}\) are used in practice. The collaboration representations are analyzed as a single dataset, that is,

$$\begin{aligned} Y \approx h(\widehat{X}), \quad \widehat{X} = [ \widehat{\boldsymbol{x}}_1, \widehat{\boldsymbol{x}}_2, \dots , \widehat{\boldsymbol{x}}_n]^\textrm{T} = \left[ \begin{array}{c} \widehat{X}_1 \\ \widehat{X}_2\\ \vdots \\ \widehat{X}_{c} \end{array} \right] = \left[ \begin{array}{c} \widetilde{X}_1 G_1 \\ \widetilde{X}_2 G_2 \\ \vdots \\ \widetilde{X}_c G_c \end{array} \right] \in \mathbb {R}^{n \times \widehat{m}} \end{aligned}$$

with the shared ground truth \(Y_i\) using some supervised machine learning or the deep learning methods for constructing the model function h of the collaboration representation \(\widehat{X}\). Functions \(g_i\) and h are returned to the ith worker.

Let \(X_i^\textrm{test} \in \mathbb {R}^{s_i \times m}\) be a test dataset of the ith party. For the prediction phase, the prediction result \(Y_i^\textrm{pred}\) of \(X_i^\textrm{test}\) is obtained by the following equation:

$$\begin{aligned} Y_i^\textrm{pred} = h( g_i(f_i (X_{i}^\textrm{test}))) \end{aligned}$$

through the intermediate and collaboration representations.

An algorithm for data collaboration analysis has input, output, and four steps. Input reads X i belongs to script R n i times m, Y i belongs to script R n i times l, and X i test individually.

The algorithm of the DC analysis is summarized in Algorithm 3, where \(g_i\) is set by (4.9). The DC analysis requires only three cross-institutional communications, Steps 1, 4 and 9 in Algorithm 3. This is a major advantage over federated learning.

As in [13], the DC analysis has the following double privacy layer for the protection of private data \(X_{i}\):

  • No one can infer the private data \(X_{i}\) under the protocol;

  • Even if \(f_{i}\) is stolen, the private data \(X_{i}\) is still protected regarding \(\varepsilon \)-DR privacy [35].

Under the protocol (Algorithm 3), the function \(f_i\) for the intermediate representation is private and cannot be inferred by others because both the input and output of \(f_i\) are not possessed. Therefore, it is impossible to infer the original data \(X_i\) only from the shared intermediate representation \(\widetilde{X}_i = f_i(X_i)\). In addition, the function \(f_i\) is set to a dimensionality reduction function such that \(\widetilde{m}_i < m\). Therefore, it is impossible to obtain the original data \(X_i\) from \(\widetilde{X}_i = f_i(X_i)\) even when using \(f_i\).