Introduction

Medical big data is increasingly used for improving healthcare quality and clinical research, such as clinical decision support systems [3, 5, 44, 50], identifying patients for clinical trials [36], and post-marketing surveillance of drugs [31, 51]. While machine learning is one of the critical techniques for analyzing medical data [5, 6, 34, 40, 43, 45], patients’ privacy must be protected in the learning process. As machine-learning methods for privacy protection, encryption [8, 15, 29, 32], differential privacy [1, 12, 30], and federated learning [33, 38] are well known. However, encryption requires a huge computational cost [7], and the accuracy of analysis for differential privacy tends to be low as protection becomes strong [35]. On the other hand, federated learning using accumulated encrypted models from institutions not only provides good analysis accuracy, it is also difficult to estimate the original data. However, the risk of information leakage tends to be high when it contains personal information.

Recently, a method of data collaboration analysis for distributed data among institutions has been proposed as a secured method absent data encryption [4, 26,27,28, 54]. The method converts raw data to “intermediate representations (IRs)” at each institution by feature extraction, namely some information in raw data is reduced. Therefore, it is impossible to estimate the original information from the IRs. The IRs gathered from all institutions are able to be integrally analyzed without reconverting to the original data. Since each institution generates the IR independently, it is impossible to compare the representation as the original data is compared. Therefore, the data collaboration method learns further transformations to make the IRs comparable. This comparable transformation can be derived by sharing common dummy data among institutions. The data collaboration method should be appropriate for analyzing medical data since small size of data distributed among hospitals (e.g., those with regional characteristics or rare diseases) cannot be shared as in the original form. However, it is possible to integrate and analyze the distributed data by using the method. Thus, the hospitals can make a diagnosis based on information obtained from other institutions.

In this study, the data collaboration analysis method is applied to medical data for identifying patients diagnosed with diabetes mellitus. One of the most useful pieces of information for the secondary use of medical data is diagnosis. Since structured data on diagnoses are limited in terms of accuracy and completeness [14, 37, 53], automated techniques for identifying patients diagnosed with a particular disease based on medical data have increased [11, 21, 22, 41, 52]. Our aim is to show that the data collaboration method identifies patients diagnosed with a particular disease accurately while protecting privacy. The main contributions of this paper include the following items. (1) Application of the data collaboration method to real medical data. (2) Clarification of the influences of anchor data similar to raw data on the classification of disease. (3) Improvement of the classification accuracy by non-linearity of transformation.

The remainder of this paper is organized as follows. Section 2 introduces the data collaboration method in detail and how to apply it to the medical data is illustrated in Sect. 3. In Sect. 4, an explanation of medical data and experimental settings are given. The results of data collaboration analysis are shown in Sect. 5. Finally, we discuss the results and conclude this study in Sect. 6.

Data Collaboration Analysis

For analyzing distributed data remaining the original datasets, data collaboration analysis method was originally proposed by Imakura and Sakurai (2019) [24,25,26,27] as non-model share-type federated learning systems and was developed for classification and regression problems [28] and feature selection [54]. The performance comparison between model share-type and non-model share-type federated learnings was also reported in [4]. The data collaboration method only centralizes so-called intermediate representations constructed individually instead of the original datasets. The algorithm of the data collaboration method comprises the following three-step algorithm. (1) Each institution constructs intermediate representations from raw data individually and send them to an analyst, called data collaborator. (2) From the gathered intermediate representations, the collaboration representations are constructed. (3) Collaboration representations integrated from individual original datasets are analyzed as one dataset.

Here, we briefly introduce the practical algorithm. The m-dimensional data of d institutions, \(X_i\), are described as follows:

$$\begin{aligned} X_i = [x_{i1}, x_{i2}, ..., x_{in_i}] \in {\mathbb {R}}^{m \times n_i}~~~(1 \le i \le d), \end{aligned}$$
(1)

where \(n_i\) indicates the amount of data in the i th institution, and

$$\begin{aligned} \sum _i n_i = n. \end{aligned}$$
(2)

Each institution independently constructs the intermediate representation, \({\tilde{X}}_i\), by a map, \(f_i\), such that

$$\begin{aligned} {\tilde{X}}_i = f_i(X_i)\in {\mathbb {R}}^{{\tilde{m}}_i \times n_i}. \end{aligned}$$
(3)

Note that \({\tilde{m}}_i\) are not require to be the same. As the map function, \(f_i\), the dimensionality reduction method, including principal component analysis, independent component analysis [23], local linear embedding [46], and locality preserving protections (LPP) [20] are considered.

Here, because \(f_i\) depends on the institution, i, the intermediate representation of the data differs \(f_i(x) \ne f_j(x)\) \((i \ne j)\). In this case, we cannot combine the intermediate representations to analyze one dataset. To overcome this difficulty, the intermediate representations are transformed again to the collaboration representation, \(\hat{X_i}\) = \(g_i({\tilde{X}}_i)\) \(\in {\mathbb {R}}^{{\hat{m}} \times n_i}\) , with the function, \(g_i\), satisfying

$$\begin{aligned} g_i(f_i(x))\approx g_j(f_j(x)) \quad (i\ne j). \end{aligned}$$
(4)

Note that \({\hat{X}}_i\) is not an approximation of \(X_i\). The dimensions of \(X_i\) and \({\tilde{X}}_i\) can differ. Instead of the intermediate representation, one can analyze the collaboration representation as one dataset, as follows:

$$\begin{aligned} {\hat{X}} = [\hat{X_1}, \hat{X_2},...,\hat{X_d}] \in \mathbb {R}^{{\hat{m}} \times n}, \end{aligned}$$
(5)

To construct the map, \(g_i\), we introduce shareable data, referred to as an anchor dataset, comprising public data or pseudo-data constructed randomly as follows:

$$\begin{aligned} X^{\text {anc}} = [x_{1}^{\text {anc}}, x_{2}^{\text {anc}}, ..., x_{r}^{\text {anc}}]\in {\mathbb {R}}^{m \times r}, \end{aligned}$$
(6)

where r indicates the amount of anchor data. Applying each map, \(f_i\), to the anchor data, we have the ith intermediate representation of the anchor dataset,

$$\begin{aligned} \tilde{X_i}^{\text {anc}} = f_i(X^{\text {anc}}) \in {\mathbb {R}}^{{\tilde{m}}_i \times r}. \end{aligned}$$
(7)

Then, we share \({\tilde{X}}_i^{\text {anc}}\) and construct \(g_i\), satisfying

$$\begin{aligned} {\hat{X}}_i^{\text {anc}} \approx {\hat{X}}_j^{\text {anc}}, {\hat{X}}_i^{\text {anc}} = g_i({\tilde{X}}_i^{\text {anc}}). \end{aligned}$$
(8)

Imakura and Sakurai (2019) [26] introduce a practical method for constructing \(g_i\) when \(g_i\) is linear. In this situation, the function, \(g_i\), can be computed by solving the minimization problem,

$$\begin{aligned} \min _{g_1, g_2, ..., g_d} \sum _{i=1}^d \Vert Z - g_i(\tilde{X_i}^{\text {anc}})\Vert ^2_\mathrm {F} \end{aligned}$$
(9)

where \(Z =[z_1, z_2, \dots , z_r]\in {\mathbb {R}}^{{\hat{m}} \times r}\) is a target for the collaboration representations, \({\hat{X}}_i^{\text {anc}}\). For the details, we refer to [26, 27].

Data Collaboration for Medical Data

Overview

Figure 1 provides an overview of the proposed method. First, the data collaborator constructs the virtual data generator denoted by G and distributes it to each hospital or medical institution. By using the generator G, institution i obtains the virtual patient data, corresponding to the anchor data, \(X_i^{\text {anc}}\) (Step1). For privacy protection, the institutions only share a random number seed to generate the virtual data. Second, each institution can arbitrarily select a map by which raw data \(X_i\) and virtual data \(X^{\text {anc}}_i\) with dimension M, are converted to an intermediate representation (Step2). The intermediate representation is in the form of extracted feature, so that the dimension is usually reduced from the original data. Thus, the privacy problem is resolved, since it is impossible to estimate the original data from the representation. Regarding the dimension of intermediate representation, each hospital has its own dimension (denoted by \(\tilde{M_1}\) and \(\tilde{M_2}\) in Fig. 1) due to the difference of the strategy and regulation for sharing medical data. Next, the data collaborator gathers the intermediate representations and constructs the reconverting function, \(g_i\), based on the intermediate representations of anchor data, \({\tilde{X}}^{\text {anc}}\). Finally, the collaboration representations with dimension K, denoted by \({\hat{X}}_i\), are obtained via the reconverting function \(g_i\), and they are applied to the machine learning method as input values (Step4). In this study, the classification of patients diagnosed with diabetes mellitus (DM) was carried out in terms of social importance. That is to say, more than 425 million people worldwide were estimated to have DM [13], and the problem can cause other critical diseases [50].

Fig. 1
figure 1

Overview of this study

Generating the Intermediate Representation by LPP

Regarding the map f, this study uses LPP [20], which presents low computational costs. LPP is a linear approximation of the nonlinear Laplacian eigenmap. The algorithm has three steps: (1) Constructing the adjacency matrix by k-nearest neighbor (KNN) [2]. In this study, k=5. (2) Choosing the symmetric \(m \times m\) weight matrix by calculating the weight as:

$$\begin{aligned} W_{ij} = e^{-\frac{\Vert \mathbf {X}_i-\mathbf {X}_j\Vert ^2}{t}}, \end{aligned}$$
(10)

where \(W_{ij}\) is the weight between node i and j. t \(\in \) \(\mathbf {R}\) is a parameter. (3) Calculating Eigenmaps generalized eigenvector problem, the eigenvectors and eigenvalues are calculated as follows:

$$\begin{aligned} XLX^T\mathbf{{a}} = \lambda XDX^T\mathbf{{a}}, \end{aligned}$$
(11)

where D is a diagonal matrix whose entries are column sums of W. \(L=D-W\) is the Laplacian matrix. Finally, \(\mathbf {a}\) determines f.

Node2vec for Reconverting Function

The previous study [26] used SVD as a linear reconverting function g. As an alternative reconverting function, Node2Vec, which is non-linear, was exploited [19] in this paper. Node2Vec is a graph-embedding [18],[?] and network-embedding method [10], which converts the graph and network structures to a vector. Specifically, both DeepWalk [42] and Node2Vec [19] estimate the vector representation from the graph structure. The methods were based on a skip-gram model [39]. Node2Vec uses random walks, which is the sequence of nodes sampled from the edge of the graph [10]. We consider the weight matrix, \({\tilde{W}} \in {\mathbb {R}}^{r \times r}\), which is obtained by integration of the KNN adjacency matrix, \(\tilde{W_i} \in {\mathbb {R}}^{r \times r}\) . In this study, \({\tilde{W}}\) was the weighted summation of \({\tilde{W}}_i\) in order to maintain the relation of KNN after reconversion. Therefore, \({\tilde{W}}\) is defined as follows:

$$\begin{aligned} {\tilde{W}} = \sum _i^d w_i \tilde{W_i}. \end{aligned}$$
(12)

The uth node for the graph related to \({\tilde{W}}\) is \({\tilde{v}}_u\). The initial node selected randomly is represented by \(c_0\), and the jth node of the random walks is represented by \(c_j\). Thus, \(c_j\) is sampled by the distribution, as shown below:

$$\begin{aligned} \text {Pr}(c_j = {\tilde{v}}_t|c_{j-1}={\tilde{v}}_{s}) = {\left\{ \begin{array}{ll} \dfrac{\pi _{st}}{C} &{} \text {if } {\tilde{W}}_{st} > 0,\\ 0 &{} \text {otherwise}. \end{array}\right. } \end{aligned}$$
(13)

C is a normalizing constant. \(\pi _{st}\) is the transition probability form the node s to t, and \(\pi _{st}=\alpha _{pq}(u,s){\tilde{W}}_{st}\), where \(u = c_{j-2}\) and \(\alpha \) is defined by

$$\begin{aligned} \alpha _{pq}(u,t) = {\left\{ \begin{array}{ll} \dfrac{1}{p} &{} \text {if } d_{ut} = 0,\\ 1 &{} \text {if } d_{ut} = 1,\\ \dfrac{1}{q} &{} \text {if } d_{ut} = 2, \end{array}\right. } \end{aligned}$$
(14)

where \(d_{ut}\) is the number of nodes in the shortest path from the node u to t. p is called the return parameter representing the probability to return to the original node, and q is called the in-out parameter representing the probability to leave the node u. Let l be the length of the random walk, and \(c_0, c_1, c_2, ..., c_l\) is obtained. With a sufficient number of random walks applying to the skip-gram model, the vector representation of node \({\tilde{v}}_u\), is obtained. The integrated vectors is denoted by \(Z = {\mathbb {R}}^{{\hat{m}} \times r}\), where \({\hat{m}}\) can be selected freely. In this study, \(p=0.5\) and \(q=1\) are used. The number of random walks is 1, 000, and the walk length is 500.

Experimental Settings

We perform numerical experiments with real medical data for verifying the following two tasks.

  • Task 1: Effect of the similarity to raw data on classification score is verified.

  • Task 2: Non-linearity of the reconverting function for classification scores are verified.

Task 1 is carried out with the reconverting function being linear or non-linear, and Task 2 is carried out with anchor data generated from random number or virtual patients data.

Subjects and Anchor Data

This study analyzed 29,802 patients (mean age: 59.9; gender: 50.3% female) testing HbA1c and random blood glucose. They were hospitalized at least once at the University of Tsukuba Hospital between 01/03/2013 and 30/09/2018. Based on Ethical Guidelines for Medical and Health Research Involving Human Subjects, our research is carried out with opt-out consent and is also approved by the Ethics Committee of the University of Tsukuba Hospital (Permission number: H30-187). We used only the maximum value of glucose and HbA1c for each patient. The other basic statistics are shown in Table 1.

The number of anchor data are 1,000, generated by three methods. The first one was generated by a random number that was limited by the max and min value of original data. This type of anchor data was used in previous studies [26], and the data adjusted to statistics properties of real data were applied to machine learning in medical situations [49]. The second one was generated by GAN [9, 16, 17]. GAN generates similar data to the raw data in terms of statistical distribution. The 1,000 patients raw data randomly selected is used for GAN, and 1,000 virtual patients are obtained as the anchor data. The statistical value of anchor data generated by GAN is shown in Table 1. The third one is a part of the raw data that is selected randomly as anchor data for verification.

Table 1 The statistics of raw data and 1000 virtual patients

Similarity of Virtual Data (Anchor Data) to Raw Data

Anchor data similar to raw data has not yet been investigated in the context of data collaboration. This study generates several types of virtual data of patients, verifying their similarity to raw data via earth-mover’s distance (EMD) [47]. EMD is calculated for the three datasets composed of a random number, the virtual patients, and raw data. The dataset has 100 data samples, which are chosen randomly five times. EMD are calculated between the raw dataset and others, as described below. The mean EMDs for five times are shown in Table 2. Thus, virtual patient data are similar to raw data, as expected, and we evaluate the performance of data collaboration with respect to these data.

Calculated data of virtual patients and their intermediate representations are shown in the additional information.

Table 2 Mean of earth-movers’ distance between raw and target data

Evaluation of the Performance of the Classification Task

The classification of DM is carried out next. We designed two types of settings for collaboration: the first is increasing number of collaborative institution where each hospital has the same amount of medical data and the second is increasing number of divisions where the total size of data is fixed.

In the first setting, the raw data are divided to have 40 samples for each institution, and the number of institutions are increased until 25 considered as independent hospitals. In the second setting, the total size of data is fixed and the number of divisions are increased by 20 hospitals from 2 to 202. We used 14,401 for the training data as well as the test data. Therefore, the size of datasets per each hospitals are ranged from 1402 to 144 where the half of samples are used to test samples.

As for the individual analysis, median score among the entire institutions was adopted to avoid the effect of data selection bias. The integrated analysis that shares all raw data is called “ALL-RAW” and the individual analysis that uses only raw data at one institution is called “EACH-RAW”. Further, the results of data collaboration are separated by the two types of reconverting functions and anchor data: “SVD-RANDOM” and “SVD-GAN”, “Node2Vec-RANDOM”and “Node2Vec-GAN”. We adopt the logistic regression with the L2 penalty for classification method and area under the receiver operating characteristic (ROC) curve are calculated as an evaluation metrics.

Result

Figure 2 shows the area under the ROC curve in the case where the number of collaborative hospitals increases and Fig. 3 shows the case where the number of divisions increases. The horizontal axis indicates the number of hospitals with the same amount of data for each hospital (Fig. 2) or the number of divisions with the decreasing amount of data for each hospital (Fig. 3). The vertical axis indicates the area under the ROC curve for the results obtained by data collaboration. The blue and light blue line indicate “ALL-RAW” and “EACH-RAW”, respectively, which can be comparison criteria of scores. The green line represents non-linear reconverting function: Node2Vec, and the red line represents linear reconverting function which is originally proposed by Imakura [26]: SVD. Light-colored as well as broken lines indicate the results of using anchor data generated randomly. Finally, the lightly colored area around the line represents the standard error for 10 trials.

To verify Task 1, scores with the same reconverting functions (i.e., “Node2Vec-GAN” and “Node2Vec-RANDOM”, or “SVD-GAN” and “SVD-RANDOM”) are compared. In addition, to verify Task 2, scores with the same type of anchor data (i.e., “Node2Vec-GAN” and “SVD-GAN”) are compared. The final results for the scores from 25 hospitals in Fig. 2 and 202 hospitals in Fig. 3 are shown in Table 3.

Table 3 Results of the analysis of 25 hospitals (unit: %)
Fig. 2
figure 2

The performance of the identifying DM subjects depending on SVD and Node2Vec with Ridge regression

Fig. 3
figure 3

The performance of the identifying DM subjects depending on SVD and Node2Vec with Ridge regression

As expected, the scores of data collaboration are lower than that of centralized analysis “ALL-RAW” (see Figs. 2 and 3). First, for the Task1 verification, the different types of anchor data (i.e., “SVD-RANDOM” vs “SVD-GAN” and “Node2Vec-RANDOM” vs “Node2Vec-GAN”) are compared. For the both types of experimental conditions for Figs. 2 and 3, the AUC score for data collaboration method using GAN-anchors are greater than that for RANDOM-anchors. Specifically, the AUC scores with GAN-anchors improve more than 10% compared to RANDOM-anchors in any cases in Table 3 . Therefore, the result demonstrates that the similarity of the anchor data to real medical data improves performance of the data collaboration.

Next, we verify Task 2 by comparing the linear and non-linear reconverting functions: SVD and Node2Vec, respectively. As shown in Fig. 2, the data collaboration with non-linear re-conversion outperformed linear re-conversion in terms of the AUC score. In the case of Fig. 2, the score of Node2Vec-GAN with 25 collaborative hospitals results in about 10% higher than “EACH-RAW”. In contrast, the score of SVD-GAN exceeds slightly by 3%. In addition, AUC scores for SVD with respect to all divisions in Fig. 3 are lower than that of the individual analysis, “EACH-RAW”. These results suggest that reconverting function of SVD is insufficient for applying the data collaboration analysis to medical data. For a conceivable reason, the medical data frequently has extreme values as abnormal ones. Thus, SVD may have removed them as noise [48]. Interestingly, as shown in Fig. 3, the performance of the data collaboration with Node2Vec-GAN turns upward when the number of hospitals grows over 140 and the size of data per hospital is reduced to 100. It is suggested that Node2Vec is robust to a large number of participants but with small data sizes. From these results, it is figured out that anchor data following real data distribution and an appropriate non-linear reconverting function improve the performance of data collaboration analysis.

Summary and Discussion

In this study, the data collaboration analysis has been applied to medical data for diagnosis of diabetes mellitus (DM). The following conditions play an important role for the performance of the analysis:

  1. (1)

    The distribution of anchor data are similar to that of raw data.

  2. (2)

    The reconverting function is non-linear.

Both were satisfied in the present analysis, and the score in identification of DM by the proposed method was a maximum of 20% higher than that by the previous method. This study introduced a way of application of data collaboration analysis suitable for medical data.

Adequate analysis of medical data requires a high degree of accuracy with privacy protection. This study realised high accuracy and preserving privacy of disease classification by dimension reduction methods and appropriate non-linear transformations. The problem of proposed method is that the analysis cannot be interpreted medically. For example, the relation between the classification and non-linearly transformed features (e.g., patients with higher HbA1c being classified as DM) is uncertain. This problem should be solved with further experiments.

The limitations of this study were that the data is obtained from a single hospital and divided into pieces of dataset as virtual hospitals. Furthermore, our approach was evaluated only with numerical data, whereas distributed clinical data possibly included images and text. The applicability of the proposed method to data gathered from many independent hospitals should be verified in the next stage.

As for another future work, we should consider the case where data dimension of virtual patients generated by GAN are different among institutions. This would contribute to real medical data analyses, since the output from each medical equipment would be practically different. Additionally, virtual patient data would contain the values of a normal range for a given parameter (e.g., age, disease threshold). On the other hand, the virtual data of patients can be made to include abnormal values associated with some disease. These virtual patients can be used for educational purpose as well.